COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

06/15/2023
by   Sihan Chen, et al.
0

Due to the limited scale and quality of video-text training corpus, most vision-language foundation models employ image-text datasets for pretraining and primarily focus on modeling visually semantic representations while disregarding temporal semantic representations and correlations. To address this issue, we propose COSA, a COncatenated SAmple pretrained vision-language foundation model. COSA jointly models visual contents and event-level temporal cues using only image-text corpora. We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining. This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus, enabling richer scene transformations and explicit event-description correspondence. Extensive experiments demonstrate that COSA consistently improves performance across a broad range of downstream tasks, including long-form/short-form video-text tasks and image-text tasks such as retrieval, captioning, and question answering. Notably, COSA achieves state-of-the-art results on various competitive benchmarks. Code and model are released at https://github.com/TXH-mercury/COSA.

READ FULL TEXT

page 2

page 13

research
12/06/2022

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

The foundation models have recently shown excellent performance on a var...
research
03/25/2023

Equivariant Similarity for Vision-Language Foundation Models

This study explores the concept of equivariance in vision-language found...
research
08/18/2022

VAuLT: Augmenting the Vision-and-Language Transformer with the Propagation of Deep Language Representations

We propose the Vision-and-Augmented-Language Transformer (VAuLT). VAuLT ...
research
05/22/2022

Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

The goal of this work is to build flexible video-language models that ca...
research
07/24/2023

Towards a Visual-Language Foundation Model for Computational Pathology

The accelerated adoption of digital pathology and advances in deep learn...
research
06/27/2023

MIMIC: Masked Image Modeling with Image Correspondences

Many pixelwise dense prediction tasks-depth estimation and semantic segm...
research
05/24/2023

UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning

Charts are very popular for analyzing data, visualizing key insights and...

Please sign up or login with your details

Forgot password? Click here to reset