OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

09/15/2022
by   Junke Wang, et al.
27

This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture. It adopts a unified transformer-based visual encoder for both image and video inputs, and thus can perform joint image-language and video-language pretraining. We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer (e.g., use image-language to help video-language). To this end, we propose a decoupled joint pretraining of image-language and video-language to effectively decompose the vision-language modeling into spatial and temporal dimensions and obtain performance boost on both image and video tasks. Moreover, we introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together, so that both supervised and noisily supervised pretraining data are utilized as much as possible. Without incurring extra task-specific adaptors, OmniVL can simultaneously support visual only tasks (e.g., image classification, video action recognition), cross-modal alignment tasks (e.g., image/video-text retrieval), and multi-modal understanding and generation tasks (e.g., image/video question answering, captioning). We evaluate OmniVL on a wide range of downstream tasks and achieve state-of-the-art or competitive results with similar model size and data scale.

READ FULL TEXT

page 4

page 10

page 18

research
12/08/2021

FLAVA: A Foundational Language And Vision Alignment Model

State-of-the-art vision and vision-and-language models rely on large-sca...
research
02/01/2023

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

Recent years have witnessed a big convergence of language, vision, and m...
research
07/24/2023

Towards a Visual-Language Foundation Model for Computational Pathology

The accelerated adoption of digital pathology and advances in deep learn...
research
03/10/2023

MuLTI: Efficient Video-and-Language Understanding with MultiWay-Sampler and Multiple Choice Modeling

Video-and-language understanding has a variety of applications in the in...
research
04/26/2023

Towards Medical Artificial General Intelligence via Knowledge-Enhanced Multimodal Pretraining

Medical artificial general intelligence (MAGI) enables one foundation mo...
research
12/08/2022

Structured Vision-Language Pretraining for Computational Cooking

Vision-Language Pretraining (VLP) and Foundation models have been the go...
research
03/23/2023

MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models

Foundation models have shown outstanding performance and generalization ...

Please sign up or login with your details

Forgot password? Click here to reset