Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-commerce

04/06/2023
by   Yang Jin, et al.
0

This paper aims to establish a generic multi-modal foundation model that has the scalable capability to massive downstream applications in E-commerce. Recently, large-scale vision-language pretraining approaches have achieved remarkable advances in the general domain. However, due to the significant differences between natural and product images, directly applying these frameworks for modeling image-level representations to E-commerce will be inevitably sub-optimal. To this end, we propose an instance-centric multi-modal pretraining paradigm called ECLIP in this work. In detail, we craft a decoder architecture that introduces a set of learnable instance queries to explicitly aggregate instance-level semantics. Moreover, to enable the model to focus on the desired product instance without reliance on expensive manual annotations, two specially configured pretext tasks are further proposed. Pretrained on the 100 million E-commerce-related data, ECLIP successfully extracts more generic, semantic-rich, and robust representations. Extensive experimental results show that, without further fine-tuning, ECLIP surpasses existing methods by a large margin on a broad range of downstream tasks, demonstrating the strong transferability to real-world E-commerce applications.

READ FULL TEXT

page 1

page 8

page 11

page 13

page 14

page 15

page 16

research
08/20/2021

Knowledge Perceived Multi-modal Pretraining in E-commerce

In this paper, we address multi-modal pretraining of product data in the...
research
03/30/2020

InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining

Multi-modal pretraining for learning high-level multi-modal representati...
research
07/30/2021

Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-modal Pretraining

Nowadays, customer's demands for E-commerce are more diversified, which ...
research
03/08/2022

Multi-Modal Mixup for Robust Fine-tuning

Pre-trained large-scale models provide a transferable embedding, and the...
research
09/22/2021

KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation

Self-supervised vision-and-language pretraining (VLP) aims to learn tran...
research
05/20/2023

Patton: Language Model Pretraining on Text-Rich Networks

A real-world text corpus sometimes comprises not only text documents but...
research
07/15/2022

Boosting Multi-Modal E-commerce Attribute Value Extraction via Unified Learning Scheme and Dynamic Range Minimization

With the prosperity of e-commerce industry, various modalities, e.g., vi...

Please sign up or login with your details

Forgot password? Click here to reset