AntM^2C: A Large Scale Dataset For Multi-Scenario Multi-Modal CTR Prediction

08/31/2023
by   ZhaoXin Huan, et al.
0

Click-through rate (CTR) prediction is a crucial issue in recommendation systems. There has been an emergence of various public CTR datasets. However, existing datasets primarily suffer from the following limitations. Firstly, users generally click different types of items from multiple scenarios, and modeling from multiple scenarios can provide a more comprehensive understanding of users. Existing datasets only include data for the same type of items from a single scenario. Secondly, multi-modal features are essential in multi-scenario prediction as they address the issue of inconsistent ID encoding between different scenarios. The existing datasets are based on ID features and lack multi-modal features. Third, a large-scale dataset can provide a more reliable evaluation of models, fully reflecting the performance differences between models. The scale of existing datasets is around 100 million, which is relatively small compared to the real-world CTR prediction. To address these limitations, we propose AntM^2C, a Multi-Scenario Multi-Modal CTR dataset based on industrial data from Alipay. Specifically, AntM^2C provides the following advantages: 1) It covers CTR data of 5 different types of items, providing insights into the preferences of users for different items, including advertisements, vouchers, mini-programs, contents, and videos. 2) Apart from ID-based features, AntM^2C also provides 2 multi-modal features, raw text and image features, which can effectively establish connections between items with different IDs. 3) AntM^2C provides 1 billion CTR data with 200 features, including 200 million users and 6 million items. It is currently the largest-scale CTR dataset available. Based on AntM^2C, we construct several typical CTR tasks and provide comparisons with baseline methods. The dataset homepage is available at https://www.atecup.cn/home.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/10/2022

MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation

Responding with multi-modal content has been recognized as an essential ...
research
11/17/2022

DeepSense 6G: A Large-Scale Real-World Multi-Modal Sensing and Communication Dataset

This article presents the DeepSense 6G dataset, which is a large-scale d...
research
09/09/2021

M5Product: A Multi-modal Pretraining Benchmark for E-commercial Product Downstream Tasks

In this paper, we aim to advance the research of multi-modal pre-trainin...
research
08/08/2023

Online Distillation-enhanced Multi-modal Transformer for Sequential Recommendation

Multi-modal recommendation systems, which integrate diverse types of inf...
research
07/27/2019

Many could be better than all: A novel instance-oriented algorithm for Multi-modal Multi-label problem

With the emergence of diverse data collection techniques, objects in rea...
research
10/21/2021

A scale invariant ranking function for learning-to-rank: a real-world use case

Nowadays, Online Travel Agencies provide the main service for booking ho...
research
07/17/2023

Unified Open-Vocabulary Dense Visual Prediction

In recent years, open-vocabulary (OV) dense visual prediction (such as O...

Please sign up or login with your details

Forgot password? Click here to reset