X^2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

11/22/2022
by   Yan Zeng, et al.
0

Vision language pre-training aims to learn alignments between vision and language from a large amount of data. We proposed multi-grained vision language pre-training, a unified approach which can learn vision language alignments in multiple granularity. This paper advances the proposed method by unifying image and video encoding in one model and scaling up the model with large-scale data. We present X^2-VLM, a pre-trained VLM with a modular architecture for both image-text tasks and video-text tasks. Experiment results show that X^2-VLM performs the best on base and large scale for both image-text and video-text tasks, making a good trade-off between performance and model scale. Moreover, we show that the modular design of X^2-VLM results in high transferability for X^2-VLM to be utilized in any language or domain. For example, by simply replacing the text encoder with XLM-R, X^2-VLM outperforms state-of-the-art multilingual multi-modal pre-trained models without any multilingual pre-training. The code and pre-trained models will be available at github.com/zengyan-97/X2-VLM.

READ FULL TEXT

page 9

page 19

page 20

research
01/22/2020

ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data

In this paper, we introduce a new vision-language pre-trained model – Im...
research
09/14/2022

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

The pre-trained image-text models, like CLIP, have demonstrated the stro...
research
03/06/2023

Angel-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent

Recent years have witnessed the unprecedented achievements of large-scal...
research
06/16/2021

Temporal Convolution Networks with Positional Encoding for Evoked Expression Estimation

This paper presents an approach for Evoked Expressions from Videos (EEV)...
research
11/12/2022

AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities

In this work, we present a conceptually simple and effective method to t...
research
10/16/2021

PAGnol: An Extra-Large French Generative Model

Access to large pre-trained models of varied architectures, in many diff...
research
12/12/2022

You Only Need a Good Embeddings Extractor to Fix Spurious Correlations

Spurious correlations in training data often lead to robustness issues s...

Please sign up or login with your details

Forgot password? Click here to reset