InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining

03/30/2020
by   Junyang Lin, et al.
0

Multi-modal pretraining for learning high-level multi-modal representation is a further step towards deep learning and artificial intelligence. In this work, we propose a novel model, namely InterBERT (BERT for Interaction), which owns strong capability of modeling interaction between the information flows of different modalities. The single-stream interaction module is capable of effectively processing information of multiple modalilties, and the two-stream module on top preserves the independence of each modality to avoid performance downgrade in single-modal tasks. We pretrain the model with three pretraining tasks, including masked segment modeling (MSM), masked region modeling (MRM) and image-text matching (ITM); and finetune the model on a series of vision-and-language downstream tasks. Experimental results demonstrate that InterBERT outperforms a series of strong baselines, including the most recent multi-modal pretraining methods, and the analysis shows that MSM and MRM are effective for pretraining and our method can achieve performances comparable to BERT in single-modal tasks. Besides, we propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model. We pretrain the Chinese InterBERT on our proposed dataset of 3.1M image-text pairs from the mobile Taobao, the largest Chinese e-commerce platform. We finetune the model for text-based image retrieval, and recently we deployed the model online for topic-based recommendation.

READ FULL TEXT
research
08/20/2021

Knowledge Perceived Multi-modal Pretraining in E-commerce

In this paper, we address multi-modal pretraining of product data in the...
research
03/01/2021

M6: A Chinese Multimodal Pretrainer

In this work, we construct the largest dataset for multimodal pretrainin...
research
04/06/2023

Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-commerce

This paper aims to establish a generic multi-modal foundation model that...
research
08/30/2021

LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation

Standard multi-task benchmarks are essential for driving the progress of...
research
02/01/2023

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

Recent years have witnessed a big convergence of language, vision, and m...
research
07/15/2022

Boosting Multi-Modal E-commerce Attribute Value Extraction via Unified Learning Scheme and Dynamic Range Minimization

With the prosperity of e-commerce industry, various modalities, e.g., vi...
research
05/12/2022

One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code

People perceive the world with multiple senses (e.g., through hearing so...

Please sign up or login with your details

Forgot password? Click here to reset