M5Product: A Multi-modal Pretraining Benchmark for E-commercial Product Downstream Tasks

by   Xiao Dong, et al.

In this paper, we aim to advance the research of multi-modal pre-training on E-commerce and subsequently contribute a large-scale dataset, named M5Product, which consists of over 6 million multimodal pairs, covering more than 6,000 categories and 5,000 attributes. Generally, existing multi-modal datasets are either limited in scale or modality diversity. Differently, our M5Product is featured from the following aspects. First, the M5Product dataset is 500 times larger than the public multimodal dataset with the same number of modalities and nearly twice larger compared with the largest available text-image cross-modal dataset. Second, the dataset contains rich information of multiple modalities including image, text, table, video and audio, in which each modality can capture different views of semantic information (e.g. category, attributes, affordance, brand, preference) and complements the other. Third, to better accommodate with real-world problems, a few portion of M5Product contains incomplete modality pairs and noises while having the long-tailed distribution, which aligns well with real-world scenarios. Finally, we provide a baseline model M5-MMT that makes the first attempt to integrate the different modality configuration into an unified model for feature fusion to address the great challenge for semantic alignment. We also evaluate various multi-model pre-training state-of-the-arts for benchmarking their capabilities in learning from unlabeled data under the different number of modalities on the M5Product dataset. We conduct extensive experiments on four downstream tasks and provide some interesting findings on these modalities. Our dataset and related code are available at https://xiaodongsuper.github.io/M5Product_dataset.


page 2

page 4


Knowledge Perceived Multi-modal Pretraining in E-commerce

In this paper, we address multi-modal pretraining of product data in the...

Cross-view Semantic Alignment for Livestreaming Product Recognition

Live commerce is the act of selling products online through live streami...

VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix

Existing vision-language pre-training (VLP) methods primarily rely on pa...

AntM^2C: A Large Scale Dataset For Multi-Scenario Multi-Modal CTR Prediction

Click-through rate (CTR) prediction is a crucial issue in recommendation...

Benchmarking Diverse-Modal Entity Linking with Generative Models

Entities can be expressed in diverse formats, such as texts, images, or ...

MultiMAE: Multi-modal Multi-task Masked Autoencoders

We propose a pre-training strategy called Multi-modal Multi-task Masked ...

MMG-Ego4D: Multi-Modal Generalization in Egocentric Action Recognition

In this paper, we study a novel problem in egocentric action recognition...

Please sign up or login with your details

Forgot password? Click here to reset