Scalable Realistic Recommendation Datasets through Fractal Expansions

01/23/2019 ∙ by Francois Belletti, et al. ∙ Google 0

Recommender System research suffers currently from a disconnect between the size of academic data sets and the scale of industrial production systems. In order to bridge that gap we propose to generate more massive user/item interaction data sets by expanding pre-existing public data sets. User/item incidence matrices record interactions between users and items on a given platform as a large sparse matrix whose rows correspond to users and whose columns correspond to items. Our technique expands such matrices to larger numbers of rows (users), columns (items) and non zero values (interactions) while preserving key higher order statistical properties. We adapt the Kronecker Graph Theory to user/item incidence matrices and show that the corresponding fractal expansions preserve the fat-tailed distributions of user engagements, item popularity and singular value spectra of user/item interaction matrices. Preserving such properties is key to building large realistic synthetic data sets which in turn can be employed reliably to benchmark Recommender Systems and the systems employed to train them. We provide algorithms to produce such expansions and apply them to the MovieLens 20 million data set comprising 20 million ratings of 27K movies by 138K users. The resulting expanded data set has 10 billion ratings, 2 million items and 864K users in its smaller version and can be scaled up or down. A larger version features 655 billion ratings, 7 million items and 17 million users.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Machine Learning (ML) benchmarks compare the capabilities of models, distributed training systems and linear algebra accelerators on realistic problems at scale. For these benchmarks to be effective, results need to be reproducible by many different groups which implies that publicly shared data sets need to be available.

Unfortunately, while Recommendation Systems constitute a key industrial application of ML at scale, large public data sets recording user/item interactions on online platforms are not yet available. For instance, although the Netflix data set [4] and the MovieLens data set [16] are publicly available, they are orders of magnitude smaller than proprietary data [10, 3, 50].

MovieLens 20M Industrial
#users 138K Hundreds of Millions
#items 27K 2M
#topics 19 600K
#observations 20M Hundreds of Billions
TABLE I: Size of MovieLens 20M [16] vs industrial dataset in [50].

Proprietary data sets and privacy: While releasing large anonymized proprietary recommendation data sets may seem an acceptable solution from a technical standpoint, it is a non-trivial problem to preserve user privacy while still maintaining useful characteristics of the dataset. For instance,[34] shows a privacy breach of the Netflix prize dataset. More importantly, publishing anonymized industrial data sets runs counter to user expectations that their data may only be used in a restricted manner to improve the quality of their experience on the platform.

Therefore, we decide not to make user data more broadly available to preserve the privacy of users. We instead choose to produce synthetic yet realistic data sets whose scale is commensurable with that of our production problems while only consuming already publicly available data.

Producing a realistic MovieLens 10 billion+ dataset: In this work, we focus on the MovieLens dataset which only entails movie ratings posted publicly by users of the MovieLens platform. The MovieLens data set has now become a standard benchmark for academic research in Recommender Systems, [44, 23, 17, 29, 45, 51, 53, 40, 49, 1, 31, 6, 19, 35] are only few of the many recent research articles relying on MovieLens whose latest version [16] has accrued more than citations according to Google Scholar. Unfortunately, the data set comprises only few observed interactions and more importantly a very small catalogue of users and items — when compared to industrial proprietary recommendation data.

In order to provide a new data set — more aligned with the needs of production scale Recommender Systems — we aim at expanding publicly available data by creating a realistic surrogate. The following constraints help create a production-size synthetic recommendation problem similar and at least as hard an ML problem as the original one for matrix factorization approaches to recommendations [22, 17]:

  • orders of magnitude more users and items are present in the synthetic dataset;

  • the synthetic dataset is realistic in that its first and second order statistics match those of the original dataset presented in Figure 1.

Key first and second order statistics of interest we aim to preserve are summarized in Figure 1 — the details of their computation are given in Section IV.

Fig. 1: Key first and second order properties of the original MovieLens 20m user/item rating matrix (after centering and re-scaling into ) we aim to preserve while synthetically expanding the data set. Top: item popularity distribution (total ratings of each item). Middle: user engagement distribution (total ratings of each user). Bottom: dominant singular values of the rating matrix (core to the difficulty of matrix factorization tasks). In all log/log plots the small fraction of non-positive row-wise and column-wise sums are removed.

Adapting Kronecker Graph expansions to user/item feedback: We employ the Kronecker Graph Theory introduced in [25] to achieve a suitable fractal expansion of recommendation data to benchmark linear and non-linear user/item factorization approaches for recommendations [22, 17]. Consider a recommendation problem comprising users and items. Let be the sparse matrix of recorded interactions (e.g. the rating left by the user for item if any and otherwise). The key insight we develop in the present paper is that a carefully crafted fractal expansion of can preserve high level statistics of the original data set while scaling its size up by multiple orders of magnitudes.

Many different transforms can be applied to the matrix which can be considered a standard sparse 2 dimensional image. A recent approach to creating synthetic recommendation data sets consists in making parametric assumptions on user behavior by instantiating a user model interacting with an online platform [9, 42]. Unfortunately, such methods (even calibrated to reproduce empirical facts in actual data sets) do not provide strong guarantees that the resulting interaction data is similar to the original. Therefore, instead of simulating recommendations in a parametric user-centric way as in [9, 42], we choose a non-parametric approach operating directly in the space of user/item affinity. In order to synthesize a large realistic dataset in a principled manner, we adapt the Kronecker expansions which have previously been employed to produce large realistic graphs in [25]. We employ a non-parametric analytically tractable simulation of the evolution of the user/item bi-partite graph to create a large synthetic data set. Our choice is to trade-off realism for analytic tractability. We emphasize the latter.

While Kronecker Graphs Theory is developed in [25, 26] on square adjacency matrices, the Kronecker product operator is well defined on rectangular matrices and therefore we can apply a similar technique to user/item interaction data sets. The Kronecker Graph generation paradigm has to be changed with the present data set in other aspects however: we need to decrease the expansion rate to generate data sets with the scale we desire, not orders of magnitude too large. We need to do so while maintaining key conservation properties of the original algorithm [26].

In order to reliably employ Kronecker based fractal expansions on recommender system data we devise the following contributions:

  • we develop a new technique based on linear algebra to adapt fractal Kronecker expansions to recommendation problems;

  • we demonstrate that key recommendation system specific properties of the original dataset are preserved by our technique;

  • we also show that the resulting algorithm we develop is scalable and easily parallelizable as we employ it on the actual MovieLens 20 million dataset;

  • we produce a synthetic yet realistic MovieLens 655 billion dataset to help recommender system research scale up in computational benchmark for model training.

The present article is organized as follows: we first recapitulate prior research on ML for recommendations and large synthetic dataset generation; we then develop an adaptation of Kronecker Graphs to user/item interaction matrices and prove key theoretical properties; finally we employ the resulting algorithm experimentally to MovieLens 20m data and validate its statistical properties.

Ii Related work

Recommender Systems constitute the workhorse of many e-commerce, social networking and entertainment platforms. In the present paper we focus on the classical setting where the key role of a recommender system is to suggest relevant items to a given user. Although other approaches are very popular such as content based recommendations [39] or social recommendations [8], collaborative filtering remains a prevalent approach to the recommendation problem [30, 41, 38].

Collaborative filtering: The key insight behind collaborative filtering is to learn affinities between users and items based on previously collected user/item interaction data. Collaborative filtering exists in different flavors. Neighborhood methods group users by inter-user similarity and will recommend items to a given user that have been consumed by neighbors [39]. Latent factor methods such as matrix factorization [22]

try to decompose user/item affinity as the result of the interaction of a few underlying representative factors characterizing the user and the item. Although other latent models have been developed and could be used to construct synthetic recommendation data sets (e.g. Principal Component Analysis 

[20] or Latent Dirichlet Allocation [7]), we focus on insights derived from matrix factorization.

The matrix factorization approach represents the affinity between a user and an item with an inner product where and

are two vectors in

representing the user and the item respectively. Given a sparse matrix of user/item interactions , user and item factors can therefore be learned by approximating with a low rank matrix where entails the user factors and contains the item factors. The data set represents ratings as in the MovieLens dataset [16] or item consumption ( if and only if the user has consumed item  [4]). The matrix factorization approach is an example of a solution to the rating matrix completion problem which aims at predicting the rating of an item by a user which has not been observed yet and corresponds to a value of in the sparse original rating matrix. Such a factorization method learns an approximation of the data that preserves a few higher order properties of the rating matrix . In particular, the low rank approximation tries to mimic the singular value spectrum of the original data set. We draw inspiration from matrix factorization to tackle synthetic data generation. The present paper will adopt a similar approach to extend collaborative filtering data-sets. Besides trying to preserve the spectral properties of the original data, we operate under the constraint of conserving its first and second order statistical properties.

Deep Learning for Recommender Systems:

Collaborative filtering has known many recent developments which motivate our objective of expanding public data sets in a realistic manner. Deep Neural Networks (DNNs) are now becoming common in both non-linear matrix factorization tasks 

[47, 17, 10] and sequential recommendations [52, 43, 15]. The mapping between user/item pairs and ratings is generally learned by training the neural model to predict user behavior on a large data set of previously observed user/item interactions.

DNNs consume large quantities of data and are computationally expensive to train, therefore they give rise to commonly shared benchmarks aimed at speeding up training. For training, a Stochastic Gradient Descent method is employed 

[24] which requires forward model computation and back-propagation to be run on many mini-batches of (user, item, score) examples. The matrix completion task still consists in predicting a rating for the interaction of user and item although has not been observed in the original data-set. The model is typically run on billions of examples as the training procedure iterates over the training data set.

Model freshness is generally critical to industrial recommendations [10] which implies that only limited time is available to re-train the model on newly available data. The throughput of the trainer is therefore crucial to providing more engaging recommendation experiences and presenting more novel items. Unfortunately, public recommendation data sets are too small to provide training-time-to-accuracy benchmarks that can be realistically employed for industrial applications. Too few different examples are available in MovieLens 20m for instance and the number of different available items is orders of magnitude too small. In many industrial settings, millions of items (e.g. products, videos, songs) have to be taken into account by recommendation models. The recommendation model learns an embedding matrices of size where and are typical values. As a consequence, the memory footprint of this matrix may dominate that of the rest of the model by several orders of magnitude. During training, the latency and bandwidth of the access to such embedding matrices have a prominent influence on the final throughput in examples/second. Such computational difficulties associated with learning large embedding matrices are worthwhile solving in benchmarks. A higher throughput enables training models with more examples which enables better statistical regularization and architectural expressiveness. The multi-billion interaction size of the data set used for training is also a major factor that affects modeling choices and infrastructure development in the industry.

Our aim is therefore to enable a comparison of modeling approaches, software engineering frameworks and hardware accelerators for ML in the context of industry scale recommendations. In the present paper we focus on enabling a better evaluation of examples/sec throughput and training-time-to-accuracy for Neural Collaborative Filtering Approaches [17] and Matrix Factorization Approaches [22]. A major issue with this approach is, as we mentioned, the size of publicly available collaborative filtering data sets which is orders of magnitude smaller than production grade data (see Table I) and has mis-representative orders of magnitudes in terms of numbers of distinct users and items. The present paper offers a first solution to this problem by providing a simple and tractable non-parametric fractal approach to scaling up public recommendation data sets by several orders of magnitude. Such data sets will help build a first set of benchmarks for model training accelerators based on publicly available data. We plan to publish the expanded data. However, MovieLens 20m is publicly available and our method can already be applied to recreate such an expanded data set. Meta-data is also common in industrial applications [10, 3, 5] but we consider its expansion outside the scope of this first development.

Iii Fractal expansions of user/item interaction data sets

The present section delineates the insights orienting our design decisions when expanding public recommendation data sets.

Iii-1 Self-similarity in user/item interactions

Interactions between users and items follow a natural hierarchy in data sets where items can be organized in topics, genres, categories etc [50]. There is for instance an item-level fractal structure in MovieLens 20m with a tree-like structure of genres, sub-genres, and directors. If users were clustered according to their demographics and tastes, another hierarchy would be formed [39]. The corresponding structured user/item interaction matrix is illustrated in Figure 2. The hierarchical nature of user/item interactions (topical and demographic) makes the recommendation data set structurally self-similar (i.e. patterns that occur at more granular scales resemble those affecting coarser scales [33]).

Fig. 2: Typical user/item interaction patterns in recommendation data sets. Self-similarity appears as a natural key feature of the hierarchical organization of users and items into groups of various granularity.

One can therefore build a user-group/item-category incidence matrix with user-groups as rows and item-categories as columns — a coarse interaction matrix. As each user group consists of many individuals and each item category comprises multiple movies, the original individual level user/item interaction matrix may be considered as an expanded version of the coarse interaction matrix. We choose to expand the user/item interaction matrix by extrapolating this self-similar structure and simulating its growth to yet another level of granularity: each original item is treated as a synthetic topic in the expanded data set and each actual user is considered a fictional user group.

A key advantage of this fractal procedure is that it may be entirely non-parametric and designed to preserve high level properties of the original dataset. In particular, a fractal expansion re-introduces the patterns originally observed in the entire real dataset within each block of local interactions of the synthetic user/item matrix. By carefully designing the way such blocks are produced and laid out, we can therefore hope to produce a realistic yet much larger rating matrix. In the following, we show how the Kronecker operator enables such a construction.

Iii-2 Fractal expansion through Kronecker products

The Kronecker product — denoted — is a non-standard matrix operator with an intrinsic self-similar structure:


where , and .

In the original presentation of Kronecker Graph Theory [25] as well as the stochastic extension [32] and the extended theory [26], the Kronecker product is the core operator enabling the synthesis of graphs with exponentially growing adjacency matrices. As in the present work, the insight underlying the use of Kronecker Graph Theory in [26] is to produce large synthetic yet realistic graphs. The fractal nature of the Kronecker operator as it is applied multiple times (see Figure 2 in [26] for an illustration) fits the self-similar statistical properties of real world graphs such as the internet, the web or social networks [25].

If is the adjacency matrix of the original graph, fractal expansions are created in [25] by chaining Kronecker products as follows:

As adjacency matrices are square, Kronecker Graphs are not employed on rectangular matrices in pre-existing work although the operation is well defined. Another slight divergence between present and pre-existing work is that — in their stochastic version — Kronecker Graphs carry Bernouilli probability distribution parameters in

while MovieLens ratings are in originally and after we center and rescale them. We show that these differences do not prevent Kronecker products from preserving core properties of rating matrices. A more important challenge is the size of the original matrix we deal with: . A naive Kronecker expansion would therefore synthesize a rating matrix with billion users which is too large.

Thus, although Kronecker products seem like an ideal candidate for the mechanism at the core of the self-similar synthesis of a larger recommendation dataset, some modifications are needed to the algorithms developed in [26].

Iii-3 Reduced Kronecker expansions

We choose to synthesize a user/item rating matrix

where is a matrix derived from but much smaller (for instance ). For reasons that will become apparent as we explore some theoretical properties of Kronecker fractal expansions, we want to construct a smaller derived matrix that shares similarities with . In particular, we seek with a similar row-wise sum distribution (user engagement distribution), column-wise distribution (item engagement distribution) and singular value spectrum (signal to noise ratio distribution in the matrix factorization).

Iii-4 Implementation at scale and algorithmic extensions

Computing a Kronecker product between two matrices and is an inherently parallel operation. It is sufficient to broadcast to each element of and then multiply by . Such a property implies that scaling can be achieved. Another advantage of the operator is that even a single machine can produce a large output data-set by sequentially iterating on the values of . Only storage space commensurable with the size of the original matrix is needed to compute each block of the Kronecker product. It is noteworthy that generalized fractal expansions can be defined by altering the standard Kronecker product. We consider such extensions here as candidates to engineer more challenging synthetic data sets. One drawback though is that these extensions may not preserve analytic tractability.

A first generalization defines a binary operator with as follows:


where is a sequence of pseudo-random numbers. Including randomization and non-linearity in appears as a simple way to synthesize data sets entailing more varied patterns. The algorithm we employ to compute Kronecker products is presented in Algorithm 1. The implementation we employ is trivially parallelizable. We only create a list of Kronecker blocks to dump entire rows (users) of the output matrix to file. This is not necessary and can be removed to enable as many processes to run simultaneously and independently as there are elements in (provided pseudo random numbers are generated in parallel in an appropriate manner).

  for  to  do
     kBlocks empty list
     for  to  do
         next pseudo random number
        kBlocks append kBlock
     end for
  end for
Algorithm 1 Kronecker fractal expansion

The only reason why we need a reduced version of is to control the size of the expansion. Also, and are equal after a row-wise and a column-wise permutation. Therefore, another family of appropriate extensions may be obtained by considering


where is a randomized sketching operation on the matrix which reduces its size by several orders of magnitude. A trivial scheme consists in sampling a small number of rows and columns from at random. Other random projections [2, 28, 12] may of course be used. The randomized procedures above produce a user/item interaction matrix where there is no longer a block-wise repetitive structure. Less obvious statistical patterns can give rise to more challenging synthetic large-scale collaborative filtering problems.

Iv Statistical properties of Kronecker fractal expansions

After having introduced Kronecker products to self-similarly expand a recommendation dataset into a much larger one, we now demonstrate how the resulting synthetic user/item interaction matrix shares crucial common properties with the original.

Iv-1 Salient empirical facts in MovieLens data

First, we introduce the critical properties we want to preserve. Note that throughout the paper we present results on a centered version of MovieLens 20m. The average rating of is subtracted from all ratings (so that the elements of the sparse rating matrix match un-observed scores and not bad movie ratings). Furthermore, we re-scale the centered ratings so that they are all in the interval . As a user/item interaction dataset on an online platform, one expects MovieLens to feature common properties of recommendation data sets such as “power-law” or fat-tailed distributions [50].

First important statistical properties for recommendations concern the distribution of interactions across users and across items. It is generally observed that such distributions exhibit a “power-law” behavior [50, 1, 36, 11, 37, 48, 27, 13]. To characterize such a behavior in the MovieLens data set, we take a look at the distribution of the total ratings along the item axis and the user axis. In other words, we compute row-wise and column-wise sums for the rating matrix and observe their distributions. The corresponding ranked distributions are exposed in Figure 1 and do exhibit a clear “power-law” behavior for rather popular items. However we observe that tail items have a higher popularity decay rate. Similarly, the engagement decay rate increases for the group of less engaged users.

The other approximate “power-law” we find in Figure 1 lies in the singular value spectrum of the MovieLens dataset. We compute the top singular values [18] of the MovieLens rating matrix by approximate iterative methods (e.g. power iteration) which can scale to its large dimension. The method yields the dominant singular values of and the corresponding singular vectors so that one can classically approximate by where is diagonal of dimension , is column-orthogonal of dimension and is row-orthogonal of dimension — which yields the rank matrix closest to in Frobenius norm.

Examining the distribution of the top singular values of in the MovieLens dataset (which has at most non-zero singular values) in Figure 1 highlights a clear “power-law” behavior in the highest magnitude part of the spectrum of . We observe in the spectral distribution an inflection for smaller singular values whose magnitude decays at a higher rate than larger singular values. Such a spectral distribution is as a key feature of the original dataset, in particular in that it conditions the difficulty of low-rank approximation approaches to the matrix completion problem. Therefore, we also want the expanded dataset to exhibit a similar behavior in terms of spectral properties.

In all the high level statistics we present, we want to preserve the approximate “power-law” decay as well as its inflection for smaller values. Our requirements for the expanding transform which we apply to are therefore threefold: we want to preserve the distributions of row-wise sums of , column-wise sums of and singular value distribution of . Additional requirements, beyond first and second order high level statistics will further increase the confidence in the realism of the expanded synthetic dataset. Nevertheless, we consider that focusing on these first three properties is a good starting point.

Iv-2 Preserving MovieLens data properties while expanding it

We now expose the fractal transform design we rely on to preserve the key statistical properties of the previous section.

Definition 1

Consider , we denote the set of row-wise sums of by , the set of column-wise sums of by , and the set of singular values of by .

Definition 2

Consider an integer and a non-zero positive integer , we denote the integer part of in base and the fractional part

First we focus on conservation properties in terms of row-wise and column-wise sums which correspond respectively to marginalized user engagement and item popularity distributions. In the following, denotes the Minkowski product of two sets, i.e. .

Proposition 1

Consider and and their Kronecker product . Then

Proof 1

Consider the row of , by definition of the corresponding sum can be rewritten as follows: which in turn equals

Refactoring the two sums concludes the proof for the row-wise sum properties. The proof for column-wise properties is identical.

Theorem 1

Consider and and their Kronecker product . Then

Proof 2

One can easily check that for any quadruple of matrices for which the notation makes sense and that . Let be the SVD of and the SVD of . Then . Now, . Writing the same decomposition for and considering that , are column-orthogonal while , are row-orthogonal concludes the proof.

The properties above imply that knowing the row-wise sums, column-wise sums and singular value spectrum of the reduced rating matrix and the original rating matrix is enough to deduce the corresponding properties for the expanded rating matrix — analytically. As in [26], the Kronecker product enables analytic tractability while expanding data sets in a fractal manner to orders of magnitude more data.

Iv-3 Constructing a reduced matrix with a similar spectrum

Considering that the quasi “power-law” properties of imply — as in [26] — that has a similar distribution to , we seek a small whose high order statistical properties are similar to those of . As we want to generate a dataset with several billion user/item interactions, millions of distinct users and millions of distinct items, we are looking for a matrix with a few hundred or thousand rows and columns. The reduced matrix we seek is therefore orders of magnitude smaller than . In order to produce a reduced matrix of dimensions one could use the reduced size older MovieLens 100K dataset [16]. Such a dataset can be interpreted as a sub-sampled reduced version of MovieLens 20m with similar properties. However the data sets have been collected seven years apart and therefore temporal non-stationarity issues become concerning. Also, we aim to produce an expansion method where the expansion multipliers can be chosen flexibly by practitioners. In our experiments, it is noteworthy that naive uniform user and item sampling strategies have not yielded smaller matrices with similar properties to in our experiments. Different random projections [2, 28, 12] could more generally be employed however we rely on a procedure better tailored to our specific statistical requirements.

We now describe the technique we employed to produce a reduced size matrix with first and second order properties close to which in turn led to constructing an expansion matrix similar to . We want the dimensions of to be with and

. Consider again the approximate Singular Value Decomposition (SVD) 

[18] of with the principal singular values of :


where has orthogonal columns, has orthogonal rows, and is diagonal with non-negative terms.

To reduce the number of rows and columns of while preserving its top singular values a trivial solution would consist in replacing and by a small random orthogonal matrices with few rows and columns respectively. Unfortunately such a method would only seemingly preserve the spectral properties of as the principal singular vectors would be widely changed. Such properties are important: one of the key advantages of employing Kronecker products in [26] is the preservation of the network values, i.e. the distributions of singular vector components of a Graph’s adjacency matrix.

To obtain a matrix with fewer rows than but column-orthogonal and similar to in the distribution of its values we use the following procedure. We re-size down to rows with through an averaging-based down-scaling method that can classically be found in standard image processing libraries (e.g. skimage.transform.resize in the scikit-image library [46]). Let be the corresponding resized version of . We then construct

as the column orthogonal matrix in

closest in Frobenius norm to . Therefore as in [14] we compute


We apply a similar procedure to to reduce its number of columns which yields a row orthogonal matrix with . The orthogonality of (column-wise) and (row-wise) guarantees that the singular value spectrum of


consists exactly of the leading components of the singular value spectrum of . Like , is re-scaled to take values in . The whole procedure to reduce down to is summarized in Algorithm 2.

Algorithm 2 Compute reduced matrix

We verify empirically that the distributions of values of the reduced singular vectors in and are similar to those of and respectively to preserve first order properties of and value distributions of its singular vectors. Such properties are demonstrated through numerical experiments in the next section.

V Experimentation on MovieLens 20 million data

The MovieLens 20m data comprises 20m ratings given by thousand users to thousand items. In the present section, we demonstrate how the fractal Kronecker expansion technique we devised and presented helps scale up this dataset to orders of magnitude more users, items and interactions — all in a parallelizable and analytically tractable manner.

V-1 Size of expanded data set

In present experiments we construct a reduced rating matrix of size which implies the resulting expanded data set will comprise billion interactions between million users and K items. In appendix, we present the results obtained for a reduced rating matrix of size and a synthetic data set consisting of billion interactions between million users and million items.

Such a high number of interactions and items enable the training of deep neural collaborative models such as the Neural Collaborative Filtering model [17] with a scale which is now more representative of industrial settings. Moreover, the increased data set size helps construct benchmarks for deep learning software packages and ML accelerators that employ the same orders of magnitude than production settings in terms of user base size, item vocabulary size and number of observations.

V-2 Empirical properties of reduced matrix

The construction technique of had for objective to produce, just like in [26], a matrix sharing the properties of though smaller in size. To that end, we aimed at constructing a matrix of dimension with properties close to those of in terms of column-wise sum, row-wise sum and singular value spectrum distributions.

We now check that the construction procedure we devised does produce a with the properties we expected. As the impact of the re-sizing step is unclear from an analytic stand-point, we had to resort to numerical experiments to validate our method.

Fig. 3: Properties of the reduced dataset built according to steps 4, 5 and 6. We validate the construction method numerically by checking that the distribution of row-wise sums, column-wise sums and singular values are similar between and . Note here that as is large we only compute its leading singular values. As we want to preserve statistical “power-laws”, we focus on preservation of the relative distribution of values and not their magnitude in absolute.

In Figure 3, one can assess that the first and second order properties of and match with high enough fidelity. In particular, the higher magnitude column-wise and row-wise sum distributions follow a “power-law” behavior similar to that of the original matrix. Similar observations can be made about the singular value spectra of and .

There is therefore now a reasonable likelihood that our adapted Kronecker expansion — although somewhat differing from the method originally presented in [26] — will enjoy the same benefits in terms of enabling data set expansion while preserving high order statistical properties.

V-3 Empirical properties of the expanded data set

We now verify empirically that the expanded rating matrix does share common first and second order properties with the original rating matrix . The new data size is orders of magnitude larger in terms of number of rows and columns and orders of magnitude larger in terms of number of non-zero terms. Notice here that, as in general is a dense matrix, the level of sparsity of the expanded data set is the same as that of the original.

Another benefit of using a fractal expansion method with analytic tractability, is that we can deduce high order statistics of the expanded data set beforehand without having to instantiate it. In particular, Proposition 1 implies that knowing the column-wise and row-wise sum distributions of and is sufficient to determine the corresponding marginals for the expanded data set . Similarly, the leading singular values of the Kronecker product can be computed with Theorem 1 just based on the leading singular values of and the singular values of .

Fig. 4: High order statistical properties of the expanded dataset . We validate the construction method numerically by checking that the distributions of row-wise sums, column-wise sums and singular values are similar between and . Here we leverage the tractability of Kronecker products as they impact column-wise and row-wise sum distributions as well as singular value spectra. The plots corresponding to the extended dataset are derived analytically based on the corresponding properties of the reduced matrix and the original matrix . Note here that as we only computed the leading singular values of , we only show the leading singular values of . In all plots we can observe the preservation of the linear log-log correspondence for the higher values in the distributions of interest (row-wise sums, column-wise sums and singular values) as well as the accelerated decay of the smaller values in those distributions.

In Figure 4, one can confirm that the spectral properties of the expanded data set as well as the user engagement (row-wise sums) and item popularity (column-wise sums) are similar to those of the original data set. Such observations indicate that the resulting data set is representative — in its fat-tailed data distribution and quasi “power-law” singular value spectrum — of problems encountered in ML for collaborative filtering. Furthermore, the expanded data set reproduces some irregularities of the original data, in particular the accelerating decay of values in ranked row-wise and column-wise sums as well as in the singular values spectrum.

V-4 Limitations

Although it does not condition the difficulty of rating matrix factorization problems — which depends primarily on the interaction distribution, the number of users, the number of items and sparsity of the rating matrix — the distribution of ratings is still an important statistical property of the MovieLens 20m dataset. Figure 5 shows that there is a certain degree of divergence between the rating scales of the original and synthetic data set. In particular, many more rating values are present in the expanded data set as a result of the multiplication of terms from and . The transformations turning into create a smoother scale of values leading to a Kronecker product with many different possible ratings. Although the non-zero value distribution is a divergence point between the two data sets, ratings in the synthetic data set are dominated by values which are close to the average as in the original MovieLens 20m. Therefore the synthetic ratings do have certain degree of realism as they represent user/item interactions where strong reactions (positive or negative) from users are much less likely than neutral interactions.

Fig. 5: Sorted ratings in the original MovieLens 20m data set and the extended data. We sample 20m ratings from the new data set. Multiplications inherent to Kronecker products create a finer granularity rating scale which somewhat differs from the original rating scale. However, we can check that the new rating scale is still representative of common recommendation problems in that most rating values are close to the average and few user/item interactions are very positive or negative.

Another limitation of the synthetic data set is the block-wise repetitive structure of Kronecker products. Although the synthetic data set is still hard to factorize as the product of two low rank matrices because its singular values are still distributed similarly to the original data set, it is now easy to factorize with a Kronecker SVD [21] which takes advantage of the block-wise repetitions in Eq (1). Randomized fractal expansions which presented in Eq (2) and Eq (3) address this issue. A simple example of such a randomized variation around the Kronecker product consists in shuffling rows and columns of each block in Eq (1) independently at random. The shuffles will break the block-wise repetitive structure and prevent Kronecker SVD from producing a trivial solution to the factorization problem.

As a result, the expansion technique we present appears as a reliable first candidate to train linear matrix factorization models [39] and non-linear user/item similarity scoring models [17].

Vi Conclusion

In conclusion, this paper presents a first attempt at synthesizing a realistic large-scale recommendation data sets without having to make compromises in terms of user privacy. We use a small size publicly available data set, MovieLens 20m, and expand it to orders of magnitude more users, items and observed ratings. Our expansion model is rooted into the hierarchical structure of user/item interactions which naturally suggests a fractal extrapolation model.

We leverage Kronecker products as self-similar operators on user/item rating matrices that impact key properties of row-wise and column-wise sums as well as singular value spectra in an analytically tractable manner. We modify the original Kronecker Graph generation method to enable an expansion of the original data by orders of magnitude that yields a synthetic data set matching industrial recommendation data sets in scale. Our numerical experiments demonstrate the data set we create has key first and second order properties similar to those of the original MovieLens 20m rating matrix.

Our next steps consist in making large synthetic data sets publicly available although any researcher can readily use the techniques we presented to scale up any user/item interaction matrix. Another possible direction is to adapt the present method to recommendation data sets featuring meta-data (e.g. timestamps, topics, device information). The use of meta-data is indeed critical to solve the “cold-start” problem of users and items having no interaction history with the platform. We also plan to benchmark the performance of well established baselines on the new large scale realistic synthetic data we produce.


  • [1] Abdollahpouri, H., Burke, R., and Mobasher, B. Controlling popularity bias in learning-to-rank recommendation. In Proceedings of the Eleventh ACM Conference on Recommender Systems (2017), ACM, pp. 42–46.
  • [2] Achlioptas, D. Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of computer and System Sciences 66, 4 (2003), 671–687.
  • [3] Belletti, F., Beutel, A., Jain, S., and Chi, E. Factorized recurrent neural architectures for longer range dependence. In

    International Conference on Artificial Intelligence and Statistics

    (2018), pp. 1522–1530.
  • [4] Bennett, J., Lanning, S., et al. The netflix prize. In Proceedings of KDD cup and workshop (2007), vol. 2007, New York, NY, USA, p. 35.
  • [5] Beutel, A., Covington, P., Jain, S., Xu, C., Li, J., Gatto, V., and Chi, E. H. Latent cross: Making use of context in recurrent recommender systems. In International Conference on Web Search and Data Mining (2018), ACM, pp. 46–54.
  • [6] Bhargava, A., Ganti, R., and Nowak, R. Active positive semidefinite matrix completion: Algorithms, theory and applications. In Artificial Intelligence and Statistics (2017), pp. 1349–1357.
  • [7] Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993–1022.
  • [8] Chaney, A. J., Blei, D. M., and Eliassi-Rad, T. A probabilistic model for using social networks in personalized item recommendation. In Proceedings of the 9th ACM Conference on Recommender Systems (2015), ACM, pp. 43–50.
  • [9] Chaney, A. J., Stewart, B. M., and Engelhardt, B. E. How algorithmic confounding in recommendation systems increases homogeneity and decreases utility. arXiv preprint arXiv:1710.11214 (2017).
  • [10] Covington, P., Adams, J., and Sargin, E. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems (2016), ACM, pp. 191–198.
  • [11] Cremonesi, P., Koren, Y., and Turrin, R. Performance of recommender algorithms on top-n recommendation tasks. In Proceedings of the fourth ACM conference on Recommender systems (2010), ACM, pp. 39–46.
  • [12] Fradkin, D., and Madigan, D. Experiments with random projections for machine learning. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (2003), ACM, pp. 517–522.
  • [13] Goel, S., Broder, A., Gabrilovich, E., and Pang, B. Anatomy of the long tail: ordinary people with extraordinary tastes. In Proceedings of the third ACM international conference on Web search and data mining (2010), ACM, pp. 201–210.
  • [14] Golub, G. H., and Van Loan, C. F. Matrix computations, vol. 3. JHU Press, 2012.
  • [15] Hariri, N., Mobasher, B., and Burke, R. Context-aware music recommendation based on latenttopic sequential patterns. In Proceedings of the sixth ACM conference on Recommender systems (2012), ACM, pp. 131–138.
  • [16] Harper, F. M., and Konstan, J. A. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis) 5, 4 (2016), 19.
  • [17] He, X., Liao, L., Zhang, H., Nie, L., Hu, X., and Chua, T.-S. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web (2017), International World Wide Web Conferences Steering Committee, pp. 173–182.
  • [18] Horn, R. A., Horn, R. A., and Johnson, C. R. Matrix analysis. Cambridge university press, 1990.
  • [19] Jawanpuria, P., and Mishra, B. A unified framework for structured low-rank matrix learning. In International Conference on Machine Learning (2018), pp. 2259–2268.
  • [20] Jolliffe, I. Principal component analysis. In International encyclopedia of statistical science. Springer, 2011, pp. 1094–1096.
  • [21] Kamm, J., and Nagy, J. G. Optimal kronecker product approximation of block toeplitz matrices. SIAM Journal on Matrix Analysis and Applications 22, 1 (2000), 155–172.
  • [22] Koren, Y., Bell, R., and Volinsky, C. Matrix factorization techniques for recommender systems. Computer, 8 (2009), 30–37.
  • [23] Krichene, W., Mayoraz, N., Rendle, S., Zhang, L., Yi, X., Hong, L., Chi, E., and Anderson, J. Efficient training on very large corpora via gramian estimation. arXiv preprint arXiv:1807.07187 (2018).
  • [24] LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature 521, 7553 (2015), 436.
  • [25] Leskovec, J., Chakrabarti, D., Kleinberg, J., and Faloutsos, C. Realistic, mathematically tractable graph generation and evolution, using kronecker multiplication. In European Conference on Principles of Data Mining and Knowledge Discovery (2005), Springer, pp. 133–145.
  • [26] Leskovec, J., Chakrabarti, D., Kleinberg, J., Faloutsos, C., and Ghahramani, Z. Kronecker graphs: An approach to modeling networks. Journal of Machine Learning Research 11, Feb (2010), 985–1042.
  • [27] Levy, M., and Bosteels, K. Music recommendation and the long tail. In 1st Workshop On Music Recommendation And Discovery (WOMRAD), ACM RecSys, 2010, Barcelona, Spain (2010), Citeseer.
  • [28] Li, P., Hastie, T. J., and Church, K. W. Very sparse random projections. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (2006), ACM, pp. 287–296.
  • [29] Liang, D., Altosaar, J., Charlin, L., and Blei, D. M. Factorization meets the item embedding: Regularizing matrix factorization with item co-occurrence. In Proceedings of the 10th ACM conference on recommender systems (2016), ACM, pp. 59–66.
  • [30] Linden, G., Smith, B., and York, J. Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet computing, 1 (2003), 76–80.
  • [31] Lu, J., Liang, G., Sun, J., and Bi, J. A sparse interactive model for matrix completion with side information. In Advances in neural information processing systems (2016), pp. 4071–4079.
  • [32] Mahdian, M., and Xu, Y. Stochastic kronecker graphs. In International Workshop on Algorithms and Models for the Web-Graph (2007), Springer, pp. 179–186.
  • [33] Mandelbrot, B. B. The fractal geometry of nature, vol. 1. WH freeman New York, 1982.
  • [34] Narayanan, A., and Shmatikov, V. How to break anonymity of the netflix prize dataset. arXiv preprint cs/0610105 (2006).
  • [35] Nimishakavi, M., Jawanpuria, P. K., and Mishra, B.

    A dual framework for low-rank tensor completion.

    In Advances in Neural Information Processing Systems (2018), pp. 5489–5500.
  • [36] Oestreicher-Singer, G., and Sundararajan, A. Recommendation networks and the long tail of electronic commerce. Mis quarterly (2012), 65–83.
  • [37] Park, Y.-J., and Tuzhilin, A. The long tail of recommender systems and how to leverage it. In Proceedings of the 2008 ACM conference on Recommender systems (2008), ACM, pp. 11–18.
  • [38] Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., and Riedl, J. Grouplens: an open architecture for collaborative filtering of netnews. In Proceedings of the 1994 ACM conference on Computer supported cooperative work (1994), ACM, pp. 175–186.
  • [39] Ricci, F., Rokach, L., and Shapira, B. Recommender systems: introduction and challenges. In Recommender systems handbook. Springer, 2015, pp. 1–34.
  • [40] Rudolph, M., Ruiz, F., Mandt, S., and Blei, D. Exponential family embeddings. In Advances in Neural Information Processing Systems (2016), pp. 478–486.
  • [41] Sarwar, B., Karypis, G., Konstan, J., and Riedl, J. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web (2001), ACM, pp. 285–295.
  • [42] Schmit, S., and Riquelme, C. Human interaction with recommendation systems. arXiv preprint arXiv:1703.00535 (2017).
  • [43] Shani, G., Heckerman, D., and Brafman, R. I. An mdp-based recommender system. Journal of Machine Learning Research 6, Sep (2005), 1265–1295.
  • [44] Tang, J., and Wang, K. Personalized top-n sequential recommendation via convolutional sequence embedding. In International Conference on Web Search and Data Mining (2018), IEEE, pp. 565–573.
  • [45] Tu, K., Cui, P., Wang, X., Wang, F., and Zhu, W. Structural deep embedding for hyper-networks. In Thirty-Second AAAI Conference on Artificial Intelligence (2018).
  • [46] Van der Walt, S., Schönberger, J. L., Nunez-Iglesias, J., Boulogne, F., Warner, J. D., Yager, N., Gouillart, E., and Yu, T. scikit-image: image processing in python. PeerJ 2 (2014), e453.
  • [47] Wang, H., Wang, N., and Yeung, D.-Y. Collaborative deep learning for recommender systems. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2015), ACM, pp. 1235–1244.
  • [48] Yin, H., Cui, B., Li, J., Yao, J., and Chen, C. Challenging the long tail recommendation. Proceedings of the VLDB Endowment 5, 9 (2012), 896–907.
  • [49] Zhang, J.-D., Chow, C.-Y., and Xu, J. Enabling kernel-based attribute-aware matrix factorization for rating prediction. IEEE Transactions on Knowledge and Data Engineering 29, 4 (2017), 798–812.
  • [50] Zhao, Q., Chen, J., Chen, M., Jain, S., Beutel, A., Belletti, F., and Chi, E. H. Categorical-attributes-based item classification for recommender systems. In Proceedings of the 12th ACM Conference on Recommender Systems (2018), ACM, pp. 320–328.
  • [51] Zheng, Y., Tang, B., Ding, W., and Zhou, H. A neural autoregressive approach to collaborative filtering. arXiv preprint arXiv:1605.09477 (2016).
  • [52] Zhou, B., Hui, S. C., and Chang, K. An intelligent recommender system using sequential web access patterns. In IEEE conference on cybernetics and intelligent systems (2004), vol. 1, IEEE Singapore, pp. 393–398.
  • [53] Zhou, G., Zhu, X., Song, C., Fan, Y., Zhu, H., Ma, X., Yan, Y., Jin, J., Li, H., and Gai, K. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2018), ACM, pp. 1059–1068.


MovieLens 655 billion

In this section, we present the properties of a larger expansion for MovieLens. The reduced rating matrix is now of size and the synthetic data consists of billion interactions between million users and million items.

Empirical properties of the reduced matrix

We assess the scalability of the approach we present to synthesize . In particular, we check that with a size of instead of still shares common statistical properties with the original matrix . Figure 6 demonstrates that the construction method we devised for still preserves key statistical properties of .

Fig. 6: Properties of the reduced dataset . Once more, we validate the construction method numerically by checking that the distribution of row-wise sums, column-wise sums and singular values are similar between and .

Numerical validation for the expanded data set

We now verify that even with different extension factors and a much larger size, the synthetic data set we generate is similar to the original MovieLens 20m. We focus on the distribution of column-wise and row-wise sums in as well as the singular value distribution of the expanded matrix. In Figure 7, we find again that the “power-law” statistical behaviors and their inflections are preserved by the expansion procedure we designed.

Fig. 7: We again validate the construction method numerically by checking that the distributions of row-wise sums, column-wise sums and singular values are similar between and . Here as well, we can observe the similarity between key features of the original rating matrix and its synthetic expansion.


The previous observations demonstrate the scalability and robustness of our expansion method, even with an expansion factor of . However, the same limitations are present as in the smaller case and Figure 8 shows that a similar divergence in rating scales exists between the original data set and its expanded synthetic version. Like before the synthetic ratings remain realistic in that their majority is near average. The present section therefore demonstrates that our method scales up and is able to synthesize very large realistic recommendation data sets.

Fig. 8: Sorted ratings in the original MovieLens 20m data set and the extended data. We sample 20m ratings from the new data set. Again we observe a certain divergence between the synthetic data set and the original. There is some realism though in that most interactions being near neutral.