Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework

05/08/2022
by   Chunyu Xie, et al.
0

Vision-language pre-training (VLP) relying on large-scale pre-training datasets has shown premier performance on various downstream tasks. In this sense, a complete and fair benchmark (i.e., including large-scale pre-training datasets and a variety of downstream datasets) is essential for VLP. But how to construct such a benchmark in Chinese remains a critical problem. To this end, we develop a large-scale Chinese cross-modal benchmark called Zero for AI researchers to fairly compare VLP models. We release two pre-training datasets and five fine-tuning datasets for downstream tasks. Furthermore, we propose a novel pre-training framework of pre-Ranking + Ranking for cross-modal learning. Specifically, we apply global contrastive pre-ranking to learn the individual representations of images and Chinese texts, respectively. We then fuse the representations in a fine-grained ranking manner via an image-text cross encoder and a text-image cross encoder. To further enhance the capability of the model, we propose a two-way distillation strategy consisting of target-guided Distillation and feature-guided Distillation. For simplicity, we call our model R2D2. We achieve state-of-the-art performance on four public cross-modal datasets and our five downstream datasets. The datasets, models and codes will be made available.

READ FULL TEXT

page 9

page 14

page 15

research
02/14/2022

Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework

This paper presents a large-scale Chinese cross-modal dataset for benchm...
research
09/10/2021

EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling

While large scale pre-training has achieved great achievements in bridgi...
research
08/19/2023

An Empirical Study of CLIP for Text-based Person Search

Text-based Person Search (TBPS) aims to retrieve the person images using...
research
08/22/2023

GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training

Cross-modal pre-training has shown impressive performance on a wide rang...
research
05/08/2023

Vision Langauge Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation

Cross-modal contrastive learning in vision language pretraining (VLP) fa...
research
06/30/2020

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph

We propose a knowledge-enhanced approach, ERNIE-ViL, to learn joint repr...
research
06/12/2023

Retrieval-Enhanced Contrastive Vision-Text Models

Contrastive image-text models such as CLIP form the building blocks of m...

Please sign up or login with your details

Forgot password? Click here to reset