Scaling Up Collaborative Filtering Data Sets through Randomized Fractal Expansions

by   Francois Belletti, et al.

Recommender system research suffers from a disconnect between the size of academic data sets and the scale of industrial production systems. In order to bridge that gap, we propose to generate large-scale user/item interaction data sets by expanding pre-existing public data sets. Our key contribution is a technique that expands user/item incidence matrices matrices to large numbers of rows (users), columns (items), and non-zero values (interactions). The proposed method adapts Kronecker Graph Theory to preserve key higher order statistical properties such as the fat-tailed distribution of user engagements, item popularity, and singular value spectra of user/item interaction matrices. Preserving such properties is key to building large realistic synthetic data sets which in turn can be employed reliably to benchmark recommender systems and the systems employed to train them. We further apply our stochastic expansion algorithm to the binarized MovieLens 20M data set, which comprises 20M interactions between 27K movies and 138K users. The resulting expanded data set has 1.2B ratings, 2.2M users, and 855K items, which can be scaled up or down.


Scalable Realistic Recommendation Datasets through Fractal Expansions

Recommender System research suffers currently from a disconnect between ...

Dynamic Graph Collaborative Filtering

Dynamic recommendation is essential for modern recommender systems to pr...

New Recommendation Algorithm for Implicit Data Motivated by the Multivariate Normal Distribution

The goal of recommender systems is to help users find useful items from ...

Collaborative filtering via sparse Markov random fields

Recommender systems play a central role in providing individualized acce...

Deep Item-based Collaborative Filtering for Top-N Recommendation

Item-based Collaborative Filtering(short for ICF) has been widely adopte...

Partially Synthetic Data for Recommender Systems: Prediction Performance and Preference Hiding

This paper demonstrates the potential of statistical disclosure control ...

Breaking the Curse of Quality Saturation with User-Centric Ranking

A key puzzle in search, ads, and recommendation is that the ranking mode...

Please sign up or login with your details

Forgot password? Click here to reset