I Introduction
Machine Learning (ML) benchmarks compare the capabilities of models, distributed training systems and linear algebra accelerators on realistic problems at scale. For these benchmarks to be effective, results need to be reproducible by many different groups which implies that publicly shared data sets need to be available.
Unfortunately, while Recommendation Systems constitute a key industrial application of ML at scale, large public data sets recording user/item interactions on online platforms are not yet available. For instance, although the Netflix data set [4] and the MovieLens data set [16] are publicly available, they are orders of magnitude smaller than proprietary data [10, 3, 50].
MovieLens 20M  Industrial  
#users  138K  Hundreds of Millions 
#items  27K  2M 
#topics  19  600K 
#observations  20M  Hundreds of Billions 
Proprietary data sets and privacy: While releasing large anonymized proprietary recommendation data sets may seem an acceptable solution from a technical standpoint, it is a nontrivial problem to preserve user privacy while still maintaining useful characteristics of the dataset. For instance,[34] shows a privacy breach of the Netflix prize dataset. More importantly, publishing anonymized industrial data sets runs counter to user expectations that their data may only be used in a restricted manner to improve the quality of their experience on the platform.
Therefore, we decide not to make user data more broadly available to preserve the privacy of users. We instead choose to produce synthetic yet realistic data sets whose scale is commensurable with that of our production problems while only consuming already publicly available data.
Producing a realistic MovieLens 10 billion+ dataset: In this work, we focus on the MovieLens dataset which only entails movie ratings posted publicly by users of the MovieLens platform. The MovieLens data set has now become a standard benchmark for academic research in Recommender Systems, [44, 23, 17, 29, 45, 51, 53, 40, 49, 1, 31, 6, 19, 35] are only few of the many recent research articles relying on MovieLens whose latest version [16] has accrued more than citations according to Google Scholar. Unfortunately, the data set comprises only few observed interactions and more importantly a very small catalogue of users and items — when compared to industrial proprietary recommendation data.
In order to provide a new data set — more aligned with the needs of production scale Recommender Systems — we aim at expanding publicly available data by creating a realistic surrogate. The following constraints help create a productionsize synthetic recommendation problem similar and at least as hard an ML problem as the original one for matrix factorization approaches to recommendations [22, 17]:

orders of magnitude more users and items are present in the synthetic dataset;

the synthetic dataset is realistic in that its first and second order statistics match those of the original dataset presented in Figure 1.
Key first and second order statistics of interest we aim to preserve are summarized in Figure 1 — the details of their computation are given in Section IV.
Adapting Kronecker Graph expansions to user/item feedback: We employ the Kronecker Graph Theory introduced in [25] to achieve a suitable fractal expansion of recommendation data to benchmark linear and nonlinear user/item factorization approaches for recommendations [22, 17]. Consider a recommendation problem comprising users and items. Let be the sparse matrix of recorded interactions (e.g. the rating left by the user for item if any and otherwise). The key insight we develop in the present paper is that a carefully crafted fractal expansion of can preserve high level statistics of the original data set while scaling its size up by multiple orders of magnitudes.
Many different transforms can be applied to the matrix which can be considered a standard sparse 2 dimensional image. A recent approach to creating synthetic recommendation data sets consists in making parametric assumptions on user behavior by instantiating a user model interacting with an online platform [9, 42]. Unfortunately, such methods (even calibrated to reproduce empirical facts in actual data sets) do not provide strong guarantees that the resulting interaction data is similar to the original. Therefore, instead of simulating recommendations in a parametric usercentric way as in [9, 42], we choose a nonparametric approach operating directly in the space of user/item affinity. In order to synthesize a large realistic dataset in a principled manner, we adapt the Kronecker expansions which have previously been employed to produce large realistic graphs in [25]. We employ a nonparametric analytically tractable simulation of the evolution of the user/item bipartite graph to create a large synthetic data set. Our choice is to tradeoff realism for analytic tractability. We emphasize the latter.
While Kronecker Graphs Theory is developed in [25, 26] on square adjacency matrices, the Kronecker product operator is well defined on rectangular matrices and therefore we can apply a similar technique to user/item interaction data sets. The Kronecker Graph generation paradigm has to be changed with the present data set in other aspects however: we need to decrease the expansion rate to generate data sets with the scale we desire, not orders of magnitude too large. We need to do so while maintaining key conservation properties of the original algorithm [26].
In order to reliably employ Kronecker based fractal expansions on recommender system data we devise the following contributions:

we develop a new technique based on linear algebra to adapt fractal Kronecker expansions to recommendation problems;

we demonstrate that key recommendation system specific properties of the original dataset are preserved by our technique;

we also show that the resulting algorithm we develop is scalable and easily parallelizable as we employ it on the actual MovieLens 20 million dataset;

we produce a synthetic yet realistic MovieLens 655 billion dataset to help recommender system research scale up in computational benchmark for model training.
The present article is organized as follows: we first recapitulate prior research on ML for recommendations and large synthetic dataset generation; we then develop an adaptation of Kronecker Graphs to user/item interaction matrices and prove key theoretical properties; finally we employ the resulting algorithm experimentally to MovieLens 20m data and validate its statistical properties.
Ii Related work
Recommender Systems constitute the workhorse of many ecommerce, social networking and entertainment platforms. In the present paper we focus on the classical setting where the key role of a recommender system is to suggest relevant items to a given user. Although other approaches are very popular such as content based recommendations [39] or social recommendations [8], collaborative filtering remains a prevalent approach to the recommendation problem [30, 41, 38].
Collaborative filtering: The key insight behind collaborative filtering is to learn affinities between users and items based on previously collected user/item interaction data. Collaborative filtering exists in different flavors. Neighborhood methods group users by interuser similarity and will recommend items to a given user that have been consumed by neighbors [39]. Latent factor methods such as matrix factorization [22]
try to decompose user/item affinity as the result of the interaction of a few underlying representative factors characterizing the user and the item. Although other latent models have been developed and could be used to construct synthetic recommendation data sets (e.g. Principal Component Analysis
[20] or Latent Dirichlet Allocation [7]), we focus on insights derived from matrix factorization.The matrix factorization approach represents the affinity between a user and an item with an inner product where and
are two vectors in
representing the user and the item respectively. Given a sparse matrix of user/item interactions , user and item factors can therefore be learned by approximating with a low rank matrix where entails the user factors and contains the item factors. The data set represents ratings as in the MovieLens dataset [16] or item consumption ( if and only if the user has consumed item [4]). The matrix factorization approach is an example of a solution to the rating matrix completion problem which aims at predicting the rating of an item by a user which has not been observed yet and corresponds to a value of in the sparse original rating matrix. Such a factorization method learns an approximation of the data that preserves a few higher order properties of the rating matrix . In particular, the low rank approximation tries to mimic the singular value spectrum of the original data set. We draw inspiration from matrix factorization to tackle synthetic data generation. The present paper will adopt a similar approach to extend collaborative filtering datasets. Besides trying to preserve the spectral properties of the original data, we operate under the constraint of conserving its first and second order statistical properties.Deep Learning for Recommender Systems:
Collaborative filtering has known many recent developments which motivate our objective of expanding public data sets in a realistic manner. Deep Neural Networks (DNNs) are now becoming common in both nonlinear matrix factorization tasks
[47, 17, 10] and sequential recommendations [52, 43, 15]. The mapping between user/item pairs and ratings is generally learned by training the neural model to predict user behavior on a large data set of previously observed user/item interactions.DNNs consume large quantities of data and are computationally expensive to train, therefore they give rise to commonly shared benchmarks aimed at speeding up training. For training, a Stochastic Gradient Descent method is employed
[24] which requires forward model computation and backpropagation to be run on many minibatches of (user, item, score) examples. The matrix completion task still consists in predicting a rating for the interaction of user and item although has not been observed in the original dataset. The model is typically run on billions of examples as the training procedure iterates over the training data set.Model freshness is generally critical to industrial recommendations [10] which implies that only limited time is available to retrain the model on newly available data. The throughput of the trainer is therefore crucial to providing more engaging recommendation experiences and presenting more novel items. Unfortunately, public recommendation data sets are too small to provide trainingtimetoaccuracy benchmarks that can be realistically employed for industrial applications. Too few different examples are available in MovieLens 20m for instance and the number of different available items is orders of magnitude too small. In many industrial settings, millions of items (e.g. products, videos, songs) have to be taken into account by recommendation models. The recommendation model learns an embedding matrices of size where and are typical values. As a consequence, the memory footprint of this matrix may dominate that of the rest of the model by several orders of magnitude. During training, the latency and bandwidth of the access to such embedding matrices have a prominent influence on the final throughput in examples/second. Such computational difficulties associated with learning large embedding matrices are worthwhile solving in benchmarks. A higher throughput enables training models with more examples which enables better statistical regularization and architectural expressiveness. The multibillion interaction size of the data set used for training is also a major factor that affects modeling choices and infrastructure development in the industry.
Our aim is therefore to enable a comparison of modeling approaches, software engineering frameworks and hardware accelerators for ML in the context of industry scale recommendations. In the present paper we focus on enabling a better evaluation of examples/sec throughput and trainingtimetoaccuracy for Neural Collaborative Filtering Approaches [17] and Matrix Factorization Approaches [22]. A major issue with this approach is, as we mentioned, the size of publicly available collaborative filtering data sets which is orders of magnitude smaller than production grade data (see Table I) and has misrepresentative orders of magnitudes in terms of numbers of distinct users and items. The present paper offers a first solution to this problem by providing a simple and tractable nonparametric fractal approach to scaling up public recommendation data sets by several orders of magnitude. Such data sets will help build a first set of benchmarks for model training accelerators based on publicly available data. We plan to publish the expanded data. However, MovieLens 20m is publicly available and our method can already be applied to recreate such an expanded data set. Metadata is also common in industrial applications [10, 3, 5] but we consider its expansion outside the scope of this first development.
Iii Fractal expansions of user/item interaction data sets
The present section delineates the insights orienting our design decisions when expanding public recommendation data sets.
Iii1 Selfsimilarity in user/item interactions
Interactions between users and items follow a natural hierarchy in data sets where items can be organized in topics, genres, categories etc [50]. There is for instance an itemlevel fractal structure in MovieLens 20m with a treelike structure of genres, subgenres, and directors. If users were clustered according to their demographics and tastes, another hierarchy would be formed [39]. The corresponding structured user/item interaction matrix is illustrated in Figure 2. The hierarchical nature of user/item interactions (topical and demographic) makes the recommendation data set structurally selfsimilar (i.e. patterns that occur at more granular scales resemble those affecting coarser scales [33]).
One can therefore build a usergroup/itemcategory incidence matrix with usergroups as rows and itemcategories as columns — a coarse interaction matrix. As each user group consists of many individuals and each item category comprises multiple movies, the original individual level user/item interaction matrix may be considered as an expanded version of the coarse interaction matrix. We choose to expand the user/item interaction matrix by extrapolating this selfsimilar structure and simulating its growth to yet another level of granularity: each original item is treated as a synthetic topic in the expanded data set and each actual user is considered a fictional user group.
A key advantage of this fractal procedure is that it may be entirely nonparametric and designed to preserve high level properties of the original dataset. In particular, a fractal expansion reintroduces the patterns originally observed in the entire real dataset within each block of local interactions of the synthetic user/item matrix. By carefully designing the way such blocks are produced and laid out, we can therefore hope to produce a realistic yet much larger rating matrix. In the following, we show how the Kronecker operator enables such a construction.
Iii2 Fractal expansion through Kronecker products
The Kronecker product — denoted — is a nonstandard matrix operator with an intrinsic selfsimilar structure:
(1) 
where , and .
In the original presentation of Kronecker Graph Theory [25] as well as the stochastic extension [32] and the extended theory [26], the Kronecker product is the core operator enabling the synthesis of graphs with exponentially growing adjacency matrices. As in the present work, the insight underlying the use of Kronecker Graph Theory in [26] is to produce large synthetic yet realistic graphs. The fractal nature of the Kronecker operator as it is applied multiple times (see Figure 2 in [26] for an illustration) fits the selfsimilar statistical properties of real world graphs such as the internet, the web or social networks [25].
If is the adjacency matrix of the original graph, fractal expansions are created in [25] by chaining Kronecker products as follows:
As adjacency matrices are square, Kronecker Graphs are not employed on rectangular matrices in preexisting work although the operation is well defined. Another slight divergence between present and preexisting work is that — in their stochastic version — Kronecker Graphs carry Bernouilli probability distribution parameters in
while MovieLens ratings are in originally and after we center and rescale them. We show that these differences do not prevent Kronecker products from preserving core properties of rating matrices. A more important challenge is the size of the original matrix we deal with: . A naive Kronecker expansion would therefore synthesize a rating matrix with billion users which is too large.Thus, although Kronecker products seem like an ideal candidate for the mechanism at the core of the selfsimilar synthesis of a larger recommendation dataset, some modifications are needed to the algorithms developed in [26].
Iii3 Reduced Kronecker expansions
We choose to synthesize a user/item rating matrix
where is a matrix derived from but much smaller (for instance ). For reasons that will become apparent as we explore some theoretical properties of Kronecker fractal expansions, we want to construct a smaller derived matrix that shares similarities with . In particular, we seek with a similar rowwise sum distribution (user engagement distribution), columnwise distribution (item engagement distribution) and singular value spectrum (signal to noise ratio distribution in the matrix factorization).
Iii4 Implementation at scale and algorithmic extensions
Computing a Kronecker product between two matrices and is an inherently parallel operation. It is sufficient to broadcast to each element of and then multiply by . Such a property implies that scaling can be achieved. Another advantage of the operator is that even a single machine can produce a large output dataset by sequentially iterating on the values of . Only storage space commensurable with the size of the original matrix is needed to compute each block of the Kronecker product. It is noteworthy that generalized fractal expansions can be defined by altering the standard Kronecker product. We consider such extensions here as candidates to engineer more challenging synthetic data sets. One drawback though is that these extensions may not preserve analytic tractability.
A first generalization defines a binary operator with as follows:
(2) 
where is a sequence of pseudorandom numbers. Including randomization and nonlinearity in appears as a simple way to synthesize data sets entailing more varied patterns. The algorithm we employ to compute Kronecker products is presented in Algorithm 1. The implementation we employ is trivially parallelizable. We only create a list of Kronecker blocks to dump entire rows (users) of the output matrix to file. This is not necessary and can be removed to enable as many processes to run simultaneously and independently as there are elements in (provided pseudo random numbers are generated in parallel in an appropriate manner).
The only reason why we need a reduced version of is to control the size of the expansion. Also, and are equal after a rowwise and a columnwise permutation. Therefore, another family of appropriate extensions may be obtained by considering
(3) 
where is a randomized sketching operation on the matrix which reduces its size by several orders of magnitude. A trivial scheme consists in sampling a small number of rows and columns from at random. Other random projections [2, 28, 12] may of course be used. The randomized procedures above produce a user/item interaction matrix where there is no longer a blockwise repetitive structure. Less obvious statistical patterns can give rise to more challenging synthetic largescale collaborative filtering problems.
Iv Statistical properties of Kronecker fractal expansions
After having introduced Kronecker products to selfsimilarly expand a recommendation dataset into a much larger one, we now demonstrate how the resulting synthetic user/item interaction matrix shares crucial common properties with the original.
Iv1 Salient empirical facts in MovieLens data
First, we introduce the critical properties we want to preserve. Note that throughout the paper we present results on a centered version of MovieLens 20m. The average rating of is subtracted from all ratings (so that the elements of the sparse rating matrix match unobserved scores and not bad movie ratings). Furthermore, we rescale the centered ratings so that they are all in the interval . As a user/item interaction dataset on an online platform, one expects MovieLens to feature common properties of recommendation data sets such as “powerlaw” or fattailed distributions [50].
First important statistical properties for recommendations concern the distribution of interactions across users and across items. It is generally observed that such distributions exhibit a “powerlaw” behavior [50, 1, 36, 11, 37, 48, 27, 13]. To characterize such a behavior in the MovieLens data set, we take a look at the distribution of the total ratings along the item axis and the user axis. In other words, we compute rowwise and columnwise sums for the rating matrix and observe their distributions. The corresponding ranked distributions are exposed in Figure 1 and do exhibit a clear “powerlaw” behavior for rather popular items. However we observe that tail items have a higher popularity decay rate. Similarly, the engagement decay rate increases for the group of less engaged users.
The other approximate “powerlaw” we find in Figure 1 lies in the singular value spectrum of the MovieLens dataset. We compute the top singular values [18] of the MovieLens rating matrix by approximate iterative methods (e.g. power iteration) which can scale to its large dimension. The method yields the dominant singular values of and the corresponding singular vectors so that one can classically approximate by where is diagonal of dimension , is columnorthogonal of dimension and is roworthogonal of dimension — which yields the rank matrix closest to in Frobenius norm.
Examining the distribution of the top singular values of in the MovieLens dataset (which has at most nonzero singular values) in Figure 1 highlights a clear “powerlaw” behavior in the highest magnitude part of the spectrum of . We observe in the spectral distribution an inflection for smaller singular values whose magnitude decays at a higher rate than larger singular values. Such a spectral distribution is as a key feature of the original dataset, in particular in that it conditions the difficulty of lowrank approximation approaches to the matrix completion problem. Therefore, we also want the expanded dataset to exhibit a similar behavior in terms of spectral properties.
In all the high level statistics we present, we want to preserve the approximate “powerlaw” decay as well as its inflection for smaller values. Our requirements for the expanding transform which we apply to are therefore threefold: we want to preserve the distributions of rowwise sums of , columnwise sums of and singular value distribution of . Additional requirements, beyond first and second order high level statistics will further increase the confidence in the realism of the expanded synthetic dataset. Nevertheless, we consider that focusing on these first three properties is a good starting point.
Iv2 Preserving MovieLens data properties while expanding it
We now expose the fractal transform design we rely on to preserve the key statistical properties of the previous section.
Definition 1
Consider , we denote the set of rowwise sums of by , the set of columnwise sums of by , and the set of singular values of by .
Definition 2
Consider an integer and a nonzero positive integer , we denote the integer part of in base and the fractional part
First we focus on conservation properties in terms of rowwise and columnwise sums which correspond respectively to marginalized user engagement and item popularity distributions. In the following, denotes the Minkowski product of two sets, i.e. .
Proposition 1
Consider and and their Kronecker product . Then
Proof 1
Consider the row of , by definition of the corresponding sum can be rewritten as follows: which in turn equals
Refactoring the two sums concludes the proof for the rowwise sum properties. The proof for columnwise properties is identical.
Theorem 1
Consider and and their Kronecker product . Then
Proof 2
One can easily check that for any quadruple of matrices for which the notation makes sense and that . Let be the SVD of and the SVD of . Then . Now, . Writing the same decomposition for and considering that , are columnorthogonal while , are roworthogonal concludes the proof.
The properties above imply that knowing the rowwise sums, columnwise sums and singular value spectrum of the reduced rating matrix and the original rating matrix is enough to deduce the corresponding properties for the expanded rating matrix — analytically. As in [26], the Kronecker product enables analytic tractability while expanding data sets in a fractal manner to orders of magnitude more data.
Iv3 Constructing a reduced matrix with a similar spectrum
Considering that the quasi “powerlaw” properties of imply — as in [26] — that has a similar distribution to , we seek a small whose high order statistical properties are similar to those of . As we want to generate a dataset with several billion user/item interactions, millions of distinct users and millions of distinct items, we are looking for a matrix with a few hundred or thousand rows and columns. The reduced matrix we seek is therefore orders of magnitude smaller than . In order to produce a reduced matrix of dimensions one could use the reduced size older MovieLens 100K dataset [16]. Such a dataset can be interpreted as a subsampled reduced version of MovieLens 20m with similar properties. However the data sets have been collected seven years apart and therefore temporal nonstationarity issues become concerning. Also, we aim to produce an expansion method where the expansion multipliers can be chosen flexibly by practitioners. In our experiments, it is noteworthy that naive uniform user and item sampling strategies have not yielded smaller matrices with similar properties to in our experiments. Different random projections [2, 28, 12] could more generally be employed however we rely on a procedure better tailored to our specific statistical requirements.
We now describe the technique we employed to produce a reduced size matrix with first and second order properties close to which in turn led to constructing an expansion matrix similar to . We want the dimensions of to be with and
. Consider again the approximate Singular Value Decomposition (SVD)
[18] of with the principal singular values of :(4) 
where has orthogonal columns, has orthogonal rows, and is diagonal with nonnegative terms.
To reduce the number of rows and columns of while preserving its top singular values a trivial solution would consist in replacing and by a small random orthogonal matrices with few rows and columns respectively. Unfortunately such a method would only seemingly preserve the spectral properties of as the principal singular vectors would be widely changed. Such properties are important: one of the key advantages of employing Kronecker products in [26] is the preservation of the network values, i.e. the distributions of singular vector components of a Graph’s adjacency matrix.
To obtain a matrix with fewer rows than but columnorthogonal and similar to in the distribution of its values we use the following procedure. We resize down to rows with through an averagingbased downscaling method that can classically be found in standard image processing libraries (e.g. skimage.transform.resize in the scikitimage library [46]). Let be the corresponding resized version of . We then construct
as the column orthogonal matrix in
closest in Frobenius norm to . Therefore as in [14] we compute(5) 
We apply a similar procedure to to reduce its number of columns which yields a row orthogonal matrix with . The orthogonality of (columnwise) and (rowwise) guarantees that the singular value spectrum of
(6) 
consists exactly of the leading components of the singular value spectrum of . Like , is rescaled to take values in . The whole procedure to reduce down to is summarized in Algorithm 2.
We verify empirically that the distributions of values of the reduced singular vectors in and are similar to those of and respectively to preserve first order properties of and value distributions of its singular vectors. Such properties are demonstrated through numerical experiments in the next section.
V Experimentation on MovieLens 20 million data
The MovieLens 20m data comprises 20m ratings given by thousand users to thousand items. In the present section, we demonstrate how the fractal Kronecker expansion technique we devised and presented helps scale up this dataset to orders of magnitude more users, items and interactions — all in a parallelizable and analytically tractable manner.
V1 Size of expanded data set
In present experiments we construct a reduced rating matrix of size which implies the resulting expanded data set will comprise billion interactions between million users and K items. In appendix, we present the results obtained for a reduced rating matrix of size and a synthetic data set consisting of billion interactions between million users and million items.
Such a high number of interactions and items enable the training of deep neural collaborative models such as the Neural Collaborative Filtering model [17] with a scale which is now more representative of industrial settings. Moreover, the increased data set size helps construct benchmarks for deep learning software packages and ML accelerators that employ the same orders of magnitude than production settings in terms of user base size, item vocabulary size and number of observations.
V2 Empirical properties of reduced matrix
The construction technique of had for objective to produce, just like in [26], a matrix sharing the properties of though smaller in size. To that end, we aimed at constructing a matrix of dimension with properties close to those of in terms of columnwise sum, rowwise sum and singular value spectrum distributions.
We now check that the construction procedure we devised does produce a with the properties we expected. As the impact of the resizing step is unclear from an analytic standpoint, we had to resort to numerical experiments to validate our method.
In Figure 3, one can assess that the first and second order properties of and match with high enough fidelity. In particular, the higher magnitude columnwise and rowwise sum distributions follow a “powerlaw” behavior similar to that of the original matrix. Similar observations can be made about the singular value spectra of and .
There is therefore now a reasonable likelihood that our adapted Kronecker expansion — although somewhat differing from the method originally presented in [26] — will enjoy the same benefits in terms of enabling data set expansion while preserving high order statistical properties.
V3 Empirical properties of the expanded data set
We now verify empirically that the expanded rating matrix does share common first and second order properties with the original rating matrix . The new data size is orders of magnitude larger in terms of number of rows and columns and orders of magnitude larger in terms of number of nonzero terms. Notice here that, as in general is a dense matrix, the level of sparsity of the expanded data set is the same as that of the original.
Another benefit of using a fractal expansion method with analytic tractability, is that we can deduce high order statistics of the expanded data set beforehand without having to instantiate it. In particular, Proposition 1 implies that knowing the columnwise and rowwise sum distributions of and is sufficient to determine the corresponding marginals for the expanded data set . Similarly, the leading singular values of the Kronecker product can be computed with Theorem 1 just based on the leading singular values of and the singular values of .
In Figure 4, one can confirm that the spectral properties of the expanded data set as well as the user engagement (rowwise sums) and item popularity (columnwise sums) are similar to those of the original data set. Such observations indicate that the resulting data set is representative — in its fattailed data distribution and quasi “powerlaw” singular value spectrum — of problems encountered in ML for collaborative filtering. Furthermore, the expanded data set reproduces some irregularities of the original data, in particular the accelerating decay of values in ranked rowwise and columnwise sums as well as in the singular values spectrum.
V4 Limitations
Although it does not condition the difficulty of rating matrix factorization problems — which depends primarily on the interaction distribution, the number of users, the number of items and sparsity of the rating matrix — the distribution of ratings is still an important statistical property of the MovieLens 20m dataset. Figure 5 shows that there is a certain degree of divergence between the rating scales of the original and synthetic data set. In particular, many more rating values are present in the expanded data set as a result of the multiplication of terms from and . The transformations turning into create a smoother scale of values leading to a Kronecker product with many different possible ratings. Although the nonzero value distribution is a divergence point between the two data sets, ratings in the synthetic data set are dominated by values which are close to the average as in the original MovieLens 20m. Therefore the synthetic ratings do have certain degree of realism as they represent user/item interactions where strong reactions (positive or negative) from users are much less likely than neutral interactions.
Another limitation of the synthetic data set is the blockwise repetitive structure of Kronecker products. Although the synthetic data set is still hard to factorize as the product of two low rank matrices because its singular values are still distributed similarly to the original data set, it is now easy to factorize with a Kronecker SVD [21] which takes advantage of the blockwise repetitions in Eq (1). Randomized fractal expansions which presented in Eq (2) and Eq (3) address this issue. A simple example of such a randomized variation around the Kronecker product consists in shuffling rows and columns of each block in Eq (1) independently at random. The shuffles will break the blockwise repetitive structure and prevent Kronecker SVD from producing a trivial solution to the factorization problem.
Vi Conclusion
In conclusion, this paper presents a first attempt at synthesizing a realistic largescale recommendation data sets without having to make compromises in terms of user privacy. We use a small size publicly available data set, MovieLens 20m, and expand it to orders of magnitude more users, items and observed ratings. Our expansion model is rooted into the hierarchical structure of user/item interactions which naturally suggests a fractal extrapolation model.
We leverage Kronecker products as selfsimilar operators on user/item rating matrices that impact key properties of rowwise and columnwise sums as well as singular value spectra in an analytically tractable manner. We modify the original Kronecker Graph generation method to enable an expansion of the original data by orders of magnitude that yields a synthetic data set matching industrial recommendation data sets in scale. Our numerical experiments demonstrate the data set we create has key first and second order properties similar to those of the original MovieLens 20m rating matrix.
Our next steps consist in making large synthetic data sets publicly available although any researcher can readily use the techniques we presented to scale up any user/item interaction matrix. Another possible direction is to adapt the present method to recommendation data sets featuring metadata (e.g. timestamps, topics, device information). The use of metadata is indeed critical to solve the “coldstart” problem of users and items having no interaction history with the platform. We also plan to benchmark the performance of well established baselines on the new large scale realistic synthetic data we produce.
References
 [1] Abdollahpouri, H., Burke, R., and Mobasher, B. Controlling popularity bias in learningtorank recommendation. In Proceedings of the Eleventh ACM Conference on Recommender Systems (2017), ACM, pp. 42–46.
 [2] Achlioptas, D. Databasefriendly random projections: Johnsonlindenstrauss with binary coins. Journal of computer and System Sciences 66, 4 (2003), 671–687.

[3]
Belletti, F., Beutel, A., Jain, S., and Chi, E.
Factorized recurrent neural architectures for longer range
dependence.
In
International Conference on Artificial Intelligence and Statistics
(2018), pp. 1522–1530.  [4] Bennett, J., Lanning, S., et al. The netflix prize. In Proceedings of KDD cup and workshop (2007), vol. 2007, New York, NY, USA, p. 35.
 [5] Beutel, A., Covington, P., Jain, S., Xu, C., Li, J., Gatto, V., and Chi, E. H. Latent cross: Making use of context in recurrent recommender systems. In International Conference on Web Search and Data Mining (2018), ACM, pp. 46–54.
 [6] Bhargava, A., Ganti, R., and Nowak, R. Active positive semidefinite matrix completion: Algorithms, theory and applications. In Artificial Intelligence and Statistics (2017), pp. 1349–1357.
 [7] Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993–1022.
 [8] Chaney, A. J., Blei, D. M., and EliassiRad, T. A probabilistic model for using social networks in personalized item recommendation. In Proceedings of the 9th ACM Conference on Recommender Systems (2015), ACM, pp. 43–50.
 [9] Chaney, A. J., Stewart, B. M., and Engelhardt, B. E. How algorithmic confounding in recommendation systems increases homogeneity and decreases utility. arXiv preprint arXiv:1710.11214 (2017).
 [10] Covington, P., Adams, J., and Sargin, E. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems (2016), ACM, pp. 191–198.
 [11] Cremonesi, P., Koren, Y., and Turrin, R. Performance of recommender algorithms on topn recommendation tasks. In Proceedings of the fourth ACM conference on Recommender systems (2010), ACM, pp. 39–46.
 [12] Fradkin, D., and Madigan, D. Experiments with random projections for machine learning. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (2003), ACM, pp. 517–522.
 [13] Goel, S., Broder, A., Gabrilovich, E., and Pang, B. Anatomy of the long tail: ordinary people with extraordinary tastes. In Proceedings of the third ACM international conference on Web search and data mining (2010), ACM, pp. 201–210.
 [14] Golub, G. H., and Van Loan, C. F. Matrix computations, vol. 3. JHU Press, 2012.
 [15] Hariri, N., Mobasher, B., and Burke, R. Contextaware music recommendation based on latenttopic sequential patterns. In Proceedings of the sixth ACM conference on Recommender systems (2012), ACM, pp. 131–138.
 [16] Harper, F. M., and Konstan, J. A. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis) 5, 4 (2016), 19.
 [17] He, X., Liao, L., Zhang, H., Nie, L., Hu, X., and Chua, T.S. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web (2017), International World Wide Web Conferences Steering Committee, pp. 173–182.
 [18] Horn, R. A., Horn, R. A., and Johnson, C. R. Matrix analysis. Cambridge university press, 1990.
 [19] Jawanpuria, P., and Mishra, B. A unified framework for structured lowrank matrix learning. In International Conference on Machine Learning (2018), pp. 2259–2268.
 [20] Jolliffe, I. Principal component analysis. In International encyclopedia of statistical science. Springer, 2011, pp. 1094–1096.
 [21] Kamm, J., and Nagy, J. G. Optimal kronecker product approximation of block toeplitz matrices. SIAM Journal on Matrix Analysis and Applications 22, 1 (2000), 155–172.
 [22] Koren, Y., Bell, R., and Volinsky, C. Matrix factorization techniques for recommender systems. Computer, 8 (2009), 30–37.
 [23] Krichene, W., Mayoraz, N., Rendle, S., Zhang, L., Yi, X., Hong, L., Chi, E., and Anderson, J. Efficient training on very large corpora via gramian estimation. arXiv preprint arXiv:1807.07187 (2018).
 [24] LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature 521, 7553 (2015), 436.
 [25] Leskovec, J., Chakrabarti, D., Kleinberg, J., and Faloutsos, C. Realistic, mathematically tractable graph generation and evolution, using kronecker multiplication. In European Conference on Principles of Data Mining and Knowledge Discovery (2005), Springer, pp. 133–145.
 [26] Leskovec, J., Chakrabarti, D., Kleinberg, J., Faloutsos, C., and Ghahramani, Z. Kronecker graphs: An approach to modeling networks. Journal of Machine Learning Research 11, Feb (2010), 985–1042.
 [27] Levy, M., and Bosteels, K. Music recommendation and the long tail. In 1st Workshop On Music Recommendation And Discovery (WOMRAD), ACM RecSys, 2010, Barcelona, Spain (2010), Citeseer.
 [28] Li, P., Hastie, T. J., and Church, K. W. Very sparse random projections. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (2006), ACM, pp. 287–296.
 [29] Liang, D., Altosaar, J., Charlin, L., and Blei, D. M. Factorization meets the item embedding: Regularizing matrix factorization with item cooccurrence. In Proceedings of the 10th ACM conference on recommender systems (2016), ACM, pp. 59–66.
 [30] Linden, G., Smith, B., and York, J. Amazon. com recommendations: Itemtoitem collaborative filtering. IEEE Internet computing, 1 (2003), 76–80.
 [31] Lu, J., Liang, G., Sun, J., and Bi, J. A sparse interactive model for matrix completion with side information. In Advances in neural information processing systems (2016), pp. 4071–4079.
 [32] Mahdian, M., and Xu, Y. Stochastic kronecker graphs. In International Workshop on Algorithms and Models for the WebGraph (2007), Springer, pp. 179–186.
 [33] Mandelbrot, B. B. The fractal geometry of nature, vol. 1. WH freeman New York, 1982.
 [34] Narayanan, A., and Shmatikov, V. How to break anonymity of the netflix prize dataset. arXiv preprint cs/0610105 (2006).

[35]
Nimishakavi, M., Jawanpuria, P. K., and Mishra, B.
A dual framework for lowrank tensor completion.
In Advances in Neural Information Processing Systems (2018), pp. 5489–5500.  [36] OestreicherSinger, G., and Sundararajan, A. Recommendation networks and the long tail of electronic commerce. Mis quarterly (2012), 65–83.
 [37] Park, Y.J., and Tuzhilin, A. The long tail of recommender systems and how to leverage it. In Proceedings of the 2008 ACM conference on Recommender systems (2008), ACM, pp. 11–18.
 [38] Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., and Riedl, J. Grouplens: an open architecture for collaborative filtering of netnews. In Proceedings of the 1994 ACM conference on Computer supported cooperative work (1994), ACM, pp. 175–186.
 [39] Ricci, F., Rokach, L., and Shapira, B. Recommender systems: introduction and challenges. In Recommender systems handbook. Springer, 2015, pp. 1–34.
 [40] Rudolph, M., Ruiz, F., Mandt, S., and Blei, D. Exponential family embeddings. In Advances in Neural Information Processing Systems (2016), pp. 478–486.
 [41] Sarwar, B., Karypis, G., Konstan, J., and Riedl, J. Itembased collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web (2001), ACM, pp. 285–295.
 [42] Schmit, S., and Riquelme, C. Human interaction with recommendation systems. arXiv preprint arXiv:1703.00535 (2017).
 [43] Shani, G., Heckerman, D., and Brafman, R. I. An mdpbased recommender system. Journal of Machine Learning Research 6, Sep (2005), 1265–1295.
 [44] Tang, J., and Wang, K. Personalized topn sequential recommendation via convolutional sequence embedding. In International Conference on Web Search and Data Mining (2018), IEEE, pp. 565–573.
 [45] Tu, K., Cui, P., Wang, X., Wang, F., and Zhu, W. Structural deep embedding for hypernetworks. In ThirtySecond AAAI Conference on Artificial Intelligence (2018).
 [46] Van der Walt, S., Schönberger, J. L., NunezIglesias, J., Boulogne, F., Warner, J. D., Yager, N., Gouillart, E., and Yu, T. scikitimage: image processing in python. PeerJ 2 (2014), e453.
 [47] Wang, H., Wang, N., and Yeung, D.Y. Collaborative deep learning for recommender systems. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2015), ACM, pp. 1235–1244.
 [48] Yin, H., Cui, B., Li, J., Yao, J., and Chen, C. Challenging the long tail recommendation. Proceedings of the VLDB Endowment 5, 9 (2012), 896–907.
 [49] Zhang, J.D., Chow, C.Y., and Xu, J. Enabling kernelbased attributeaware matrix factorization for rating prediction. IEEE Transactions on Knowledge and Data Engineering 29, 4 (2017), 798–812.
 [50] Zhao, Q., Chen, J., Chen, M., Jain, S., Beutel, A., Belletti, F., and Chi, E. H. Categoricalattributesbased item classification for recommender systems. In Proceedings of the 12th ACM Conference on Recommender Systems (2018), ACM, pp. 320–328.
 [51] Zheng, Y., Tang, B., Ding, W., and Zhou, H. A neural autoregressive approach to collaborative filtering. arXiv preprint arXiv:1605.09477 (2016).
 [52] Zhou, B., Hui, S. C., and Chang, K. An intelligent recommender system using sequential web access patterns. In IEEE conference on cybernetics and intelligent systems (2004), vol. 1, IEEE Singapore, pp. 393–398.
 [53] Zhou, G., Zhu, X., Song, C., Fan, Y., Zhu, H., Ma, X., Yan, Y., Jin, J., Li, H., and Gai, K. Deep interest network for clickthrough rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2018), ACM, pp. 1059–1068.
Appendix
MovieLens 655 billion
In this section, we present the properties of a larger expansion for MovieLens. The reduced rating matrix is now of size and the synthetic data consists of billion interactions between million users and million items.
Empirical properties of the reduced matrix
We assess the scalability of the approach we present to synthesize . In particular, we check that with a size of instead of still shares common statistical properties with the original matrix . Figure 6 demonstrates that the construction method we devised for still preserves key statistical properties of .
Numerical validation for the expanded data set
We now verify that even with different extension factors and a much larger size, the synthetic data set we generate is similar to the original MovieLens 20m. We focus on the distribution of columnwise and rowwise sums in as well as the singular value distribution of the expanded matrix. In Figure 7, we find again that the “powerlaw” statistical behaviors and their inflections are preserved by the expansion procedure we designed.
Limitations
The previous observations demonstrate the scalability and robustness of our expansion method, even with an expansion factor of . However, the same limitations are present as in the smaller case and Figure 8 shows that a similar divergence in rating scales exists between the original data set and its expanded synthetic version. Like before the synthetic ratings remain realistic in that their majority is near average. The present section therefore demonstrates that our method scales up and is able to synthesize very large realistic recommendation data sets.
Comments
There are no comments yet.