Omnivore: A Single Model for Many Visual Modalities

01/20/2022
by   Rohit Girdhar, et al.
7

Prior work has studied different visual modalities in isolation and developed separate architectures for recognition of images, videos, and 3D data. Instead, in this paper, we propose a single model which excels at classifying images, videos, and single-view 3D data using exactly the same model parameters. Our 'Omnivore' model leverages the flexibility of transformer-based architectures and is trained jointly on classification tasks from different modalities. Omnivore is simple to train, uses off-the-shelf standard datasets, and performs at-par or better than modality-specific models of the same size. A single Omnivore model obtains 86.0 RGB-D. After finetuning, our models outperform prior work on a variety of vision tasks and generalize across modalities. Omnivore's shared visual representation naturally enables cross-modal recognition without access to correspondences between modalities. We hope our results motivate researchers to model visual modalities together.

READ FULL TEXT

page 1

page 2

page 7

research
06/16/2022

OmniMAE: Single Model Masked Pretraining on Images and Videos

Transformer-based architectures have become competitive across a variety...
research
10/01/2011

Learning to relate images: Mapping units, complex cells and simultaneous eigenspaces

A fundamental operation in many vision tasks, including motion understan...
research
05/09/2023

ImageBind: One Embedding Space To Bind Them All

We present ImageBind, an approach to learn a joint embedding across six ...
research
11/22/2017

CMCGAN: A Uniform Framework for Cross-Modal Visual-Audio Mutual Generation

Visual and audio modalities are two symbiotic modalities underlying vide...
research
07/31/2018

Deep Cross Modal Learning for Caricature Verification and Identification(CaVINet)

Learning from different modalities is a challenging task. In this paper,...
research
11/25/2021

PolyViT: Co-training Vision Transformers on Images, Videos and Audio

Can we train a single transformer model capable of processing multiple m...
research
10/03/2022

A Strong Transfer Baseline for RGB-D Fusion in Vision Transformers

The Vision Transformer (ViT) architecture has recently established its p...

Please sign up or login with your details

Forgot password? Click here to reset