EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling

09/10/2021
by   Jue Wang, et al.
0

While large scale pre-training has achieved great achievements in bridging the gap between vision and language, it still faces several challenges. First, the cost for pre-training is expensive. Second, there is no efficient way to handle the data noise which degrades model performance. Third, previous methods only leverage limited image-text paired data, while ignoring richer single-modal data, which may result in poor generalization to single-modal downstream tasks. In this work, we propose an EfficientCLIP method via Ensemble Confident Learning to obtain a less noisy data subset. Extra rich non-paired single-modal text data is used for boosting the generalization of text branch. We achieve the state-of-the-art performance on Chinese cross-modal retrieval tasks with only 1/10 training resources compared to CLIP and WenLan, while showing excellent generalization to single-modal tasks, including text retrieval and text classification.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 8

12/31/2020

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

Existed pre-training methods either focus on single-modal tasks or multi...
10/23/2020

ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken Language Understanding

Language model pre-training has shown promising results in various downs...
09/05/2021

Data Efficient Masked Language Modeling for Vision and Language

Masked language modeling (MLM) is one of the key sub-tasks in vision-lan...
11/09/2021

FILIP: Fine-grained Interactive Language-Image Pre-Training

Unsupervised large-scale vision-language pre-training has shown promisin...
05/20/2021

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

We present a simplified, task-agnostic multi-modal pre-training approach...
08/21/2021

Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training

Existing approaches to vision-language pre-training (VLP) heavily rely o...
04/07/2021

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

We study joint learning of Convolutional Neural Network (CNN) and Transf...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.