I Introduction
Machine Learning (ML) benchmarks compare the capabilities of models, distributed training systems and linear algebra accelerators on realistic problems at scale. For these benchmarks to be effective, results need to be reproducible by many different groups which implies that publicly shared data sets need to be available.
Unfortunately, while recommendation systems constitute a key industrial application of ML at scale, large public data sets recording user/item interactions on online platforms are not yet available. For instance, although the Netflix data set [6] and the MovieLens data set [19] are publicly available, they are orders of magnitude smaller than proprietary data [11, 3, 56].
MovieLens 20M  Industrial  
#users  138K  Hundreds of Millions 
#items  27K  2M 
#topics  19  600K 
#observations  20M  Hundreds of Billions 
Proprietary data sets and privacy: While releasing large anonymized proprietary recommendation data sets may seem an acceptable solution from a technical standpoint, it is a nontrivial problem to preserve user privacy while still maintaining useful characteristics of the dataset. For instance,[37] shows a privacy breach of the Netflix prize dataset. More importantly, publishing anonymized industrial data sets runs counter to user expectations that their data may only be used in a restricted manner to improve the quality of their experience on the platform.
Therefore, we decide not to make user data more broadly available to preserve the privacy of users. We instead choose to produce synthetic yet realistic data sets whose scale is commensurate with that of our production problems while only consuming already publicly available data.
Producing a realistic binary MovieLens 10 billion+ dataset: In this work, we focus on the MovieLens dataset which only entails movie ratings posted publicly by users of the MovieLens platform. The MovieLens data set has now become a standard benchmark for academic research in recommender systems. Many recent research articles rely on MovieLens [49, 26, 20, 32, 50, 57, 59, 44, 55, 1, 34, 7, 22, 38]. The latest version of MovieLens [19] has accrued more than citations according to Google Scholar. A binarized version of this dataset is obtained when all the ratings are substituted by (proposed in Neural Collaborative Filtering [20]). While in previous work we have considered the original MovieLens data set comprising ratings on a discrete scale [4], we now focus on its binarized version. Although the binarized version is representative of industrial collaborative filtering aiming at predicting which item a given user is most likely to view [11], the data set still only entails few observed interactions and more importantly a very small catalogue of users/items, compared to industrial proprietary recommendation data.
Industrial recommender systems typically have to nominate items from catalogues comprising several million distinct elements. The large number of observations collected by online platforms about user/item interactions also enables performance gains by increasing the dimension of the embeddings employed to represent items. In most modern ML recommendations, the model learns a vector valued representer in
for each of the users/items of the catalog. Typically each user and item is represented with scalars and elements are present in user set and in the item set. Storing, accessing and training such vast embedding tables presents unique challenges as large tables will no longer easily fit in the memory of a single machine: distributed embedding tables are often necessary to store the learned embedding tables; hierarchical embedding access strategies such as hierarchical softmax [56] or differentiated softmax [17] provide better data structures and learning paradigms; an appropriate negative sampling strategy [11] or regularization [26] is needed to solve the extreme classification problem selecting one item from the catalog constitutents. By scaling up the public MovieLens data set, we want to move the problem into a regime where such issues are critical so that the corresponding benchmark is helpful for industrial applications.In order to provide a new data set — more aligned with the needs of production scale recommender systems — we therefore aim at expanding publicly available data by creating a realistic surrogate. The following constraints help create a productionsize synthetic recommendation problem similar and at least as hard an ML problem as the original one for matrix factorization approaches to recommendations [25, 20]:

orders of magnitude more users and items are present in the synthetic dataset;

the synthetic dataset is realistic in that its first and second order statistics match those of the original dataset presented in Figure 1.
Key first and second order statistics of interest we aim to preserve are summarized in Figure 1 — the details of their computation are given in Section IV.
Adapting Kronecker Graph expansions to binarized user/item interactions: We employ the Kronecker Graph Theory introduced in [28] to achieve a suitable fractal expansion of recommendation data to benchmark linear and nonlinear user/item factorization approaches for recommendations [25, 20]. Consider a recommendation problem comprising users and items. Let be the sparse matrix of binarized recorded interactions (i.e. if user has consumed item and otherwise). The key insight we develop in the present paper is that a carefully crafted fractal expansion of can preserve high level statistics of the original data set while scaling its size up by multiple orders of magnitudes.
Many different transforms can be applied to the matrix which can be considered a standard sparse binary 2 dimensional image. A recent approach to creating synthetic recommendation data sets consists in making parametric assumptions on user behavior by instantiating a user model interacting with an online platform [10, 46]. Unfortunately, such methods (even calibrated to reproduce empirical facts in actual data sets) do not provide strong guarantees that the resulting interaction data is similar to the original. A challenging problem in this domain is to build user models that can provide such guarantees, which can be validated using online experiments. In this work, instead of simulating recommendations in a parametric usercentric way as in [10, 46], we choose a nonparametric approach operating directly in the space of user/item affinity. In order to synthesize a large realistic dataset in a principled manner, we adapt the Kronecker expansions which have previously been employed to produce large realistic graphs in [28]. We employ a nonparametric randomized simulation of the evolution of the user/item bipartite graph to create a large synthetic data set. It is noteworthy that as opposed to our original approach in [4] — where we put emphasis on analytic tractability — we now employ a method that loses some analytic tractability but still preserves key statistics of the data set. Furthermore, we show how randomized operations help address limitations of the previous method which yielded an interaction matrix with a discernible blockwise repetitive structure.
While Kronecker Graphs Theory is developed in [28, 29] on square adjacency matrices, the Kronecker product operator is well defined on rectangular matrices and therefore we can apply a similar technique to user/item interaction data sets — which was already noted in [29] but not developed extensively. As in [29] we will use a stochastic version of the Kronecker extension for a binary original matrix. The Kronecker Graph generation paradigm has to be changed with the present data set in other aspects. However, we need to decrease the expansion rate to generate data sets with the scale we desire, not orders of magnitude too large. We need to do so while maintaining key conservation properties of the original algorithm [29]. Furthermore, we introduce a new blockwise shuffling to randomize the Kronecker operator and yield a data set more helpful to train ML models for recommendations.
In order to reliably employ Kronecker based fractal expansions on recommender system data we devise the following contributions:

we develop a new technique based on linear algebra to adapt fractal Kronecker expansions to recommendation problems;

we introduce a randomly shuffled extension of the original Kronecker product to prevent blockwise structural repetitions and take steps to prevent test data from leaking into the data set employed to train collaborative filtering models;

we also show that the resulting algorithm we develop is scalable and easily parallelizable as we employ it on the actual MovieLens 20 million dataset;

we produce a synthetic yet realistic MovieLens 1.2 billion dataset to help recommender system research scale up in computational benchmark for model training;

we demonstrate that key recommendation system specific properties of the original dataset are preserved by the deterministic version of our technique;

we make the corresponding open source code available so that other researchers may reproduce our findings and tailor the generated synthetic data to their needs.
The present article is organized as follows: First, we describe prior research on ML for recommendations and large synthetic dataset generation. Next, we develop a randomized adaptation of Kronecker Graphs to user/item interaction matrices and prove key theoretical properties. Finally, we employ the resulting algorithm experimentally to MovieLens 20m data to validate its statistical properties.
Ii Related work
Recommender systems constitute the workhorse of many ecommerce, social networking and entertainment platforms. In the present paper we focus on the classical setting where the key role of a recommender system is to suggest relevant items to a given user. Although other approaches are very popular such as content based recommendations [43] or social recommendations [9], collaborative filtering remains a prevalent approach to the recommendation problem [33, 45, 42].
Collaborative filtering: The key insight behind collaborative filtering is to learn affinities between users and items based on previously collected user/item interaction data. Collaborative filtering exists in different flavors. Neighborhood methods group users by interuser similarity and will recommend items to a given user that have been consumed by neighbors [43]
. Latent methods try to decompose user/item affinity as the result of the interaction of a few underlying representative factors characterizing the user and the item (e.g. Principal Component Analysis
[23], Latent Dirichlet Allocation [8]). Matrix factorization [25] is a Latent Factor Method that relies on solving the matrix completion problem to recommend items for users.The matrix factorization approach represents the affinity between a user and an item with an inner product where and are two vectors in representing the user and the item respectively. Given a sparse matrix of user/item interactions , user and item factors can therefore be learned by approximating with a low rank matrix where entails the user factors and contains the item factors. The data set represents ratings as in the MovieLens dataset [19] or item consumption ( if and only if the user has consumed item [6]) — the latter being considered here. The matrix factorization approach is an example of a solution to the rating matrix completion problem which aims at predicting the rating of an item by a user which has not been observed yet and corresponds to a value of in the sparse original rating matrix. Such a factorization method learns an approximation of the data that preserves a few higher order properties of the rating matrix . In particular, the low rank approximation tries to mimic the singular value spectrum of the original data set. We draw inspiration from matrix factorization to tackle synthetic data generation. The present paper will adopt a similar approach to extend collaborative filtering datasets. Besides trying to preserve the spectral properties of the original data, we operate under the constraint of conserving its first and second order statistical properties.
Deep Learning for recommender systems:
Collaborative filtering has known many recent developments which motivate our objective of expanding public data sets in a realistic manner. Deep Neural Networks (DNNs) are now becoming common in both nonlinear matrix factorization tasks
[52, 20, 11] and sequential recommendations [58, 47, 18]. The mapping between user/item pairs and ratings is generally learned by training the neural model to predict user behavior on a large data set of previously observed user/item interactions.DNNs consume large quantities of data and are computationally expensive to train, therefore they give rise to commonly shared benchmarks aimed at speeding up training. For training, a Stochastic Gradient Descent method is employed
[27] which requires forward model computation and backpropagation to be run on many minibatches of (user, item, score) examples. The matrix completion task still consists in predicting a rating for the interaction of user and item although has not been observed in the original dataset. The model is typically run on billions of examples as the training procedure iterates over the training data set.Freshness in recommender systems:Model freshness is generally critical to industrial recommendations [11] which implies that only limited time is available to retrain the model on newly available data. The throughput of the trainer is therefore crucial to providing more engaging recommendation experiences and presenting more novel items. Unfortunately, public recommendation data sets are too small to provide trainingtimetoaccuracy benchmarks that can be realistically employed for industrial applications. Too few different examples are available in MovieLens 20m for instance and the number of different available items is orders of magnitude too small. In many industrial settings, millions of items (e.g. products, videos, songs) have to be taken into account by recommendation models. The recommendation model learns an embedding matrices of size where and are typical values. As a consequence, the memory footprint of this matrix may dominate that of the rest of the model by several orders of magnitude. During training, the latency and bandwidth of the access to such embedding matrices have a prominent influence on the final throughput in examples/second. Such computational difficulties associated with learning large embedding matrices are worthwhile solving in benchmarks. A higher throughput enables training models with more examples which enables better statistical regularization and architectural expressiveness. The multibillion interaction size of the data set used for training is also a major factor that affects modeling choices and infrastructure development in the industry.
Iii Fractal expansions of user/item interaction data sets
The present section delineates the insights orienting our design decisions when expanding public recommendation data sets.
Iii1 Selfsimilarity in user/item interactions
Interactions between users and items follow a natural hierarchy in data sets where items can be organized in topics, genres, and categories [56]. There is for instance an itemlevel fractal structure in MovieLens 20m with a treelike structure of genres, subgenres, and directors. If users were clustered according to their demographics and tastes, another hierarchy would be formed [43]. The corresponding structured user/item interaction matrix is illustrated in Figure 2. The hierarchical nature of user/item interactions (topical and demographic) makes the recommendation data set structurally selfsimilar (i.e. patterns that occur at more granular scales resemble those affecting coarser scales [36]).
One can therefore build a usergroup/itemcategory incidence matrix with usergroups as rows and itemcategories as columns — a coarse interaction matrix. As each user group consists of many individuals and each item category comprises multiple movies, the original individual level user/item interaction matrix may be considered as an expanded version of the coarse interaction matrix. We choose to expand the user/item interaction matrix by extrapolating this selfsimilar structure and simulating its growth to yet another level of granularity: original items and users are considered fictional topic and user groups in the expanded data set.
A key advantage of this fractal procedure is that it may be entirely nonparametric and designed to preserve high level properties of the original dataset. In particular, a fractal expansion reintroduces the patterns originally observed in the entire real dataset within each block of local interactions of the synthetic user/item matrix. By carefully designing the way such blocks are produced and laid out, we can therefore hope to produce a realistic yet much larger rating matrix. In the following, we show how the Kronecker operator enables such a construction.
Iii2 Fractal expansion through Kronecker products
The Kronecker product — denoted — is a nonstandard matrix operator with an intrinsic selfsimilar structure:
(1) 
where , and .
In the original presentation of Kronecker Graph Theory [28] as well as the stochastic extension [35] and the extended theory [29], the Kronecker product is the core operator enabling the synthesis of graphs with exponentially growing adjacency matrices. As in the present work, the insight underlying the use of Kronecker Graph Theory in [29] is to produce large synthetic yet realistic graphs. The fractal nature of the Kronecker operator as it is applied multiple times (see Figure 2 in [29] for an illustration) fits the selfsimilar statistical properties of real world graphs such as the internet, the web or social networks [28].
If is the adjacency matrix of the original graph, fractal expansions are created in [28] by chaining Kronecker products as follows:
As adjacency matrices are square, Kronecker Graphs are not employed on rectangular matrices in preexisting work although the operation is well defined. We show that these differences do not prevent Kronecker products from preserving core properties of binarized rating matrices. A more important challenge is the size of the original matrix we deal with: . A naive Kronecker expansion would therefore synthesize a rating matrix with billion users which is too large.
Thus, although Kronecker products seem like an ideal candidate for the mechanism at the core of the selfsimilar synthesis of a larger recommendation dataset, some modifications are needed to the algorithms developed in [29].
Iii3 Reduced Kronecker expansions
We choose to synthesize a user/item rating matrix
where is a matrix derived from but much smaller (for instance ). For reasons that will become apparent as we explore some theoretical properties of Kronecker fractal expansions, we want to construct a smaller derived matrix that shares similarities with . In particular, we seek with a similar rowwise sum distribution (user engagement distribution), columnwise distribution (item engagement distribution) and singular value spectrum (signal to noise ratio distribution in the matrix factorization).
Iii4 Implementation at scale and algorithmic extensions
Computing a Kronecker product between two matrices and is an inherently parallel operation. It is sufficient to broadcast to each element of and then multiply by . Such a property implies that scaling can be achieved. Another advantage of the operator is that even a single machine can produce a large output dataset by sequentially iterating on the values of . Only storage space commensurable with the size of the original matrix is needed to compute each block of the Kronecker product. It is noteworthy that generalized fractal expansions can be defined by altering the standard Kronecker product. We consider such extensions here as candidates to engineer more challenging synthetic data sets. One drawback though is that these extensions may not preserve analytic tractability.
A first generalization defines a binary operator with as follows:
(2) 
where is a sequence of pseudorandom numbers. Including randomization and nonlinearity in appears as a simple way to synthesize data sets entailing more varied patterns. The algorithm we employ to compute Kronecker products is presented in Algorithm 1. The implementation we employ is trivially parallelizable. We only create a list of Kronecker blocks to dump entire rows (users) of the output matrix to file. This is not necessary and can be removed to enable as many processes to run simultaneously and independently as there are elements in (provided pseudo random numbers are generated in parallel in an appropriate manner).
The only reason why we need a reduced version of is to control the size of the expansion. Also, and are equal after a rowwise and a columnwise permutation. Therefore, another family of appropriate extensions may be obtained by considering
(3) 
where is a randomized sketching operation on the matrix which reduces its size by several orders of magnitude. A trivial scheme consists in sampling a small number of rows and columns from at random. Other random projections [2, 31, 14] may of course be used. The randomized procedures above produce a user/item interaction matrix where there is no longer a blockwise repetitive structure. Less obvious statistical patterns can give rise to more challenging synthetic largescale collaborative filtering problems.
Iii5 Stochastic Kronecker product and dropout for binary rating matrices
As opposed to our original approach [4] which focused on item ratings, we now consider an original binary rating . In such a setting, a standard Kronecker product is not suitable as the ratings all take the same value of and therefore multiplications by elements of do not produce ratings that are all still binary. We instead use the stochastic Kronecker graph approach from [29] and employ the reduced matrix’s elements as dropout rates over the matrix . When computing the block of the expanded rating matrix, instead of using , we instead consider after having rescaled so that all its elements are in . For each element of , the dropout function for a rate
samples independently from a Bernoulli distribution with parameter
. If the sampled number is , is kept unchanged, otherwise it is dropped and set to . Such a dropout operator enjoys statistical properties that are similar to the Kronecker product [29] while being readily employable on binary datasets. The stochastic Kronecker product we devise can therefore be written as follows in matrix notation:where “Sh” denotes the random rowwise and columnwise shuffling operator and “drop” denotes the dropout operator whose first argument is the dropout rate and whose second argument is the matrix from which to zero out elements at random. Algorithm 2 exposes the implementation of the randomized Kronecker product .
Iii6 Randomized shuffling and Kronecker SVD
Another limitation of the synthetic data set initially presented in [4] is the blockwise repetitive structure of Kronecker products. Although the synthetic data set is still hard to factorize as the product of two low rank matrices because its singular values are still distributed similarly to the original data set, it is now easy to factorize with a Kronecker SVD [24] which takes advantage of the blockwise repetitions in Eq (1). Randomized fractal expansions which presented in Eq (2) and Eq (3) address this issue. The approach we adopt consists in shuffling rows and columns of each block in Eq (1) independently at random. The shuffles will break the blockwise repetitive structure and prevent Kronecker SVD from producing a trivial solution to the factorization problem.
Iii7 Preventing leaks from the test set into the training set
For matrix factorization tasks, the usual procedure to build disjoint training and test data sets for Movie Lens consists in selecting some ratings and removing them from the training set while adding them to the test set. A naive adaptation of the test data generation procedure to our extended data set would select test items directly on the larger matrix . Unfortunately, as , such a procedure would implicitly share data between interactions of the training and test sets through which incorporates information from the entire original data set . In order to generate training and test data without leaking test data into the training set, we proceed as follows. We consider two separate training and test sets selected from the original data set : and . With , we have and . For the MovieLens data set, where each rating of a given item by a specific user is timestamped, a typical approach to defining training and testing sets removes the last rating of each user from the train set and adds it to the test set. Such a procedure outputs a matrix with much fewer non zero elements than .
The smaller matrix is now derived from exclusively, without incorporating any data from . We create the extended versions of the train and test data sets separately as follows:
By construction, the procedure prevents test data from leaking into the train data and implicitly informing the model of patterns that will be present in the test set during training.
Iii8 Consistent randomized operations across training and testing sets
With a stochastic Kronecker featuring dropout and blockwise shuffling, additional precautions need to be taken. In order to guarantee that the randomized shuffles of rows and columns are consistent between the training and testing data, we flip the sign of the test elements in the rating matrix to keep track of their belonging to the test set. We apply all randomized operations to the resulting matrix comprising elements in :
where and . The positive elements of are attributed to and the negative elements are attributed to after having flipped their sign:
Such an operation is simple and guarantees the consistency of randomized shuffles of the subblocks in the extended matrices and .
Iv Statistical properties of Kronecker fractal expansions
After having introduced Kronecker products to selfsimilarly expand a recommendation dataset into a much larger one, we now develop theoretical insights about how the transform preserves crucial common properties with the original.
Iv1 Salient empirical facts in MovieLens data
First, we introduce the critical properties we want to preserve. As a user/item interaction dataset on an online platform, one expects MovieLens to feature common properties of recommendation data sets such as powerlaw or fattailed distributions [56] (a powerlaw or fattailed distribution over positive values behaves like for large enough values of with and ).
First important statistical properties for recommendations concern the distribution of interactions across users and across items. It is generally observed that such distributions exhibit a powerlaw behavior [56, 1, 39, 13, 40, 53, 30, 15]. To characterize such a behavior in the MovieLens data set, we take a look at the distribution of the total ratings along the item axis and the user axis. In other words, we compute rowwise and columnwise sums for the rating matrix and observe their distributions. The corresponding ranked distributions are exposed in Figure 1 and do exhibit a clear powerlaw behavior for rather popular items. However, we observe that tail items have a higher popularity decay rate. Similarly, the engagement decay rate increases for the group of less engaged users.
The other approximate powerlaw we find in Figure 1 lies in the singular value spectrum of the MovieLens dataset. We compute the top singular values [21] of the MovieLens rating matrix by power iteration, which can scale to its large dimension. The method yields the dominant singular values of and the corresponding singular vectors so that one can classically approximate by where is diagonal of dimension , is columnorthogonal of dimension and is roworthogonal of dimension — which yields the rank matrix closest to in Frobenius norm.
Examining the distribution of the top singular values of in the MovieLens dataset (which has at most nonzero singular values) in Figure 1 highlights a clear powerlaw behavior in the highest magnitude part of the spectrum of . We observe in the spectral distribution an inflection for smaller singular values whose magnitude decays at a higher rate than larger singular values. Such a spectral distribution is as a key feature of the original dataset. This property is particularly important for lowrank approximation approaches to the matrix completion problem, which have to choose a sufficiently large rank for approximating the observations. Therefore, we also want the expanded dataset to exhibit a similar behavior in terms of spectral properties.
In all the high level statistics we present, we want to preserve the approximate powerlaw decay as well as its inflection for smaller values. Our requirements for the expanding transform which we apply to are therefore threefold: we want to preserve the distributions of rowwise sums of , columnwise sums of and singular value distribution of . Additional requirements, beyond first and second order high level statistics will further increase the confidence in the realism of the expanded synthetic dataset.
Iv2 Analytic tractability through standard Kronecker products
Although we use a randomized version of the Kronecker product which does not offer the same level of analytic tractability, the choice of such a transform is deeply anchored in some of the theoretical properties of the standard Kronecker product. We now expose how — in its standard deterministic version — the fractal transform design we rely on preserves the key statistical properties of the previous section.
Definition 1
Consider , we denote the set of rowwise sums of by , the set of columnwise sums of by , and the set of nonzero singular values of by .
Definition 2
Consider an integer and a nonzero positive integer , we denote the integer part of in base and the fractional part
First we focus on conservation properties in terms of rowwise and columnwise sums which correspond respectively to marginalized user engagement and item popularity distributions. In the following, denotes the Minkowski product of two sets, i.e. .
Proposition 1
Consider and and their Kronecker product . Then
Proof 1
Consider the row of , by definition of the corresponding sum can be rewritten as follows: which in turn equals
Refactoring the two sums concludes the proof for the rowwise sum properties. The proof for columnwise properties is identical.
Theorem 1
Consider and and their Kronecker product . Then
Proof 2
One can easily check that for any quadruple of matrices for which the notation makes sense and that . Let be the SVD of and the SVD of . Then . Now, . Writing the same decomposition for and considering that , are columnorthogonal while , are roworthogonal concludes the proof.
The properties above imply that knowing the rowwise sums, columnwise sums and singular value spectrum of the reduced rating matrix and the original rating matrix is enough to deduce the corresponding properties for the expanded rating matrix — analytically. As in [29], the Kronecker product enables analytic tractability while expanding data sets in a fractal manner to orders of magnitude more data.
In practice, we use a randomized version of the Kronecker product whose blockwise shuffles do not have an analytically tractable effect of the high order statistics of the rating matrix. Therefore, we rely in section V4 on a statistical examination of the properties of the extended synthetic data set we produce with our randomized fractal operator to verify that our original theoretical insights from the deterministic case are still valid. In particular, we demonstrate that original high order statistics of the new data set we produce preserve — as in our first deterministic approach [4] — the original properties of the binary MovieLens 20m data set.
Iv3 Constructing a reduced matrix with a similar spectrum
Considering that the quasi powerlaw properties of imply — as in [29] — that has a similar distribution to , we seek a small whose high order statistical properties are similar to those of . As we want to generate a dataset with several billion user/item interactions, millions of distinct users and millions of distinct items, we are looking for a matrix with a few hundred or thousand rows and columns. The reduced matrix we seek is therefore orders of magnitude smaller than . In order to produce a reduced matrix of dimensions one could use the reduced size older MovieLens 100K dataset [19]. Such a dataset can be interpreted as a subsampled reduced version of MovieLens 20m with similar properties. These data sets have been collected seven years apart, wherein the characteristics of the dataset are not comparable. Also, we aim to produce an expansion method where the expansion multipliers can be chosen flexibly by practitioners. In our experiments, it is noteworthy that naive uniform user and item sampling strategies have not yielded smaller matrices with similar properties to in our experiments. Different random projections [2, 31, 14] could more generally be employed. However, we rely on a procedure better tailored to our specific statistical requirements.
We now describe the technique we employed to produce a reduced size matrix with first and second order properties close to which in turn led to constructing an expansion matrix similar to . We want the dimensions of to be with and
. Consider again the approximate Singular Value Decomposition (SVD)
[21] of with the principal singular values of :(4) 
where has orthogonal columns, has orthogonal rows, and is diagonal with nonnegative terms.
To reduce the number of rows and columns of while preserving its top singular values a trivial solution would consist in replacing and by a small random orthogonal matrices with few rows and columns respectively. Unfortunately such a method would only seemingly preserve the spectral properties of as the principal singular vectors would be widely changed. Such properties are important: one of the key advantages of employing Kronecker products in [29] is the preservation of the network values, i.e. the distributions of singular vector components of a graph’s adjacency matrix.
To obtain a matrix with fewer rows than but columnorthogonal and similar to in the distribution of its values we use the following procedure. We resize down to rows with by downscaling through local averaging (using skimage.transform.resize in the scikitimage library [51]). Let be the corresponding resized version of . We then construct
as the column orthogonal matrix in
closest in Frobenius norm to . Therefore as in [16] we compute(5) 
We apply a similar procedure to to reduce its number of columns which yields a row orthogonal matrix with . The orthogonality of (columnwise) and (rowwise) guarantees that the singular value spectrum of
(6) 
consists exactly of the leading components of the singular value spectrum of . Like , is rescaled to take values in . The whole procedure to reduce down to is summarized in Algorithm 3.
We verify empirically that the distributions of values of the reduced singular vectors in and are similar to those of and respectively to preserve first order properties of and value distributions of its singular vectors. Such properties are demonstrated through numerical experiments in the next section.
V Experimentation on MovieLens 20 million data
The MovieLens data comprises ratings given by users to items. In the present section, we demonstrate how the fractal Kronecker expansion technique we devised and presented helps scale up this dataset to orders of magnitude more users, items and interactions — all in a parallelizable manner.
V1 Preprocessing of MovieLens 20m
The first preprocessing step we apply to MovieLens 20m is binarizing all the ratings: all the rating values are set to . Such a step is standard for tasks such as Neural Collaborative Filtering [20]. The second preprocessing step filters out users who have fewer than ratings with distinct timestamps. The filter enables the splitting of MovieLens 20m into a train set consisting of all the ratings of each users except the last one in chronological order and its complement. For each user, the rating with the latest timestamp is put in the test set . After removal of users with too few ratings and splitting into training and test sets, we expand MovieLens 20m with the randomized Kronecker product presented in Algorithm 2.
V2 Size of expanded data set
In the present experiments we construct a reduced rating matrix of size . The dropout based method in Algorithm 2 yields a new data set whose size is detailed in Table II.
MovieLens 20m  Synthetic train set  Synthetic test set  

Interactions  M  B  M 
Users  K  M  M 
Items  K  K  K 
Such a high number of interactions and items enable the training of deep neural collaborative models such as the Neural Collaborative Filtering model [20] with a scale which is now more representative of industrial settings. Moreover, the increased data set size helps construct benchmarks for deep learning software packages and ML accelerators that employ the same orders of magnitude as production settings in terms of user base size, item vocabulary size and number of observations.
V3 Empirical properties of reduced matrix
The objective of the construction technique for was to produce a matrix sharing the properties of though smaller in size ( [29]). To that end, we aimed at constructing a matrix of dimension with properties close to those of in terms of columnwise sum, rowwise sum and singular value spectrum distributions.
We now check that the construction procedure we devised does produce a with the properties we expected. As the impact of the resizing step is unclear from an analytic standpoint, we had to resort to numerical experiments to validate our method.
In Figure 3, one can assess that the first and second order properties of and match with high enough fidelity. In particular, the higher magnitude columnwise and rowwise sum distributions follow a “powerlaw” behavior similar to that of the original matrix. Similar observations can be made about the singular value spectra of and .
There is therefore now a reasonable likelihood that our adapted Kronecker expansion — although somewhat differing from the method originally presented in [29] — will enjoy the same benefits in terms of enabling data set expansion while preserving high order statistical properties.
V4 Empirical properties of the expanded data set
We now verify empirically that the expanded rating matrix does share common first and second order properties with the original rating matrix . The new data size is orders of magnitude larger in terms of number of rows and columns and orders of magnitude larger in terms of number of nonzero terms. Notice here that because of the dropout, the density of the resulting data set is about that of the original data set.
In Figure 4, one can confirm that the spectral properties of the expanded data set as well as the user engagement (rowwise sums) and item popularity (columnwise sums) are similar to those of the original data set. Such observations demonstrate that the theoretical insights from Proposition 1 and Theorem 1 are indeed informative of the high order statistics of the synthetic data set we generate. Our expost empirical study indicates that the resulting data set is representative — in its fattailed data distribution and quasi “powerlaw” singular value spectrum — of problems encountered in ML for collaborative filtering. Furthermore, the expanded data set reproduces some unique properties of the original data, in particular the accelerating decay of values in ranked rowwise and columnwise sums as well as in the singular values spectrum.
Vi Conclusion
In conclusion, this paper presents an attempt at synthesizing a realistic largescale recommendation data sets without having to make compromises in terms of user privacy. We use a small size publicly available data set, MovieLens 20m, and expand it to orders of magnitude more users, items and observed ratings. Our expansion model is rooted into the hierarchical structure of user/item interactions which naturally suggests a fractal extrapolation model.
We leverage randomized Kronecker products as selfsimilar operators on user/item rating matrices that preserve key properties of rowwise and columnwise sums as well as singular value spectra. We modify the original Kronecker Graph generation method to enable a randomized expansion of the original data by orders of magnitude that yields a synthetic data set matching industrial recommendation data sets in scale. Our numerical experiments demonstrate the data set we create has key first and second order properties similar to those of the original MovieLens 20m binarized rating matrix.
Our next steps consist in making large synthetic data sets publicly available although any researcher can readily use the techniques we presented to scale up any user/item interaction matrix. Another possible direction is to adapt the present method to recommendation data sets featuring metadata (e.g. timestamps, topics, device information). The use of metadata is indeed critical to solve the “coldstart” problem of users and items having no interaction history with the platform. In this work, we did not consider here the temporal structure of the MovieLens data set. We leave the study of sequential user behavior — often found to be Long Range Dependent [41, 48, 5, 12] — and the extension of synthetic data generation to sequential recommendations [3, 49, 54] for further work. We also plan to benchmark the performance of well established baselines on the new large scale realistic synthetic data we produce.
References
 [1] Abdollahpouri, H., Burke, R., and Mobasher, B. Controlling popularity bias in learningtorank recommendation. In Proceedings of the Eleventh ACM Conference on Recommender Systems (2017), ACM, pp. 42–46.
 [2] Achlioptas, D. Databasefriendly random projections: Johnsonlindenstrauss with binary coins. Journal of computer and System Sciences 66, 4 (2003), 671–687.

[3]
Belletti, F., Beutel, A., Jain, S., and Chi, E.
Factorized recurrent neural architectures for longer range
dependence.
In
International Conference on Artificial Intelligence and Statistics
(2018), pp. 1522–1530.  [4] Belletti, F., Lakshmanan, K., Krichene, W., Chen, Y.F., and Anderson, J. Scalable realistic recommendation datasets through fractal expansions. arXiv preprint arXiv:1901.08910 (2019).
 [5] Belletti, F., Sparks, E., Bayen, A., and Gonzalez, J. Random projection design for scalable implicit smoothing of randomly observed stochastic processes. In Artificial Intelligence and Statistics (2017), pp. 700–708.
 [6] Bennett, J., Lanning, S., et al. The netflix prize. In Proceedings of KDD cup and workshop (2007), vol. 2007, New York, NY, USA, p. 35.
 [7] Bhargava, A., Ganti, R., and Nowak, R. Active positive semidefinite matrix completion: Algorithms, theory and applications. In Artificial Intelligence and Statistics (2017), pp. 1349–1357.
 [8] Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993–1022.
 [9] Chaney, A. J., Blei, D. M., and EliassiRad, T. A probabilistic model for using social networks in personalized item recommendation. In Proceedings of the 9th ACM Conference on Recommender Systems (2015), ACM, pp. 43–50.
 [10] Chaney, A. J., Stewart, B. M., and Engelhardt, B. E. How algorithmic confounding in recommendation systems increases homogeneity and decreases utility. arXiv preprint arXiv:1710.11214 (2017).
 [11] Covington, P., Adams, J., and Sargin, E. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems (2016), ACM, pp. 191–198.
 [12] Crane, R., and Sornette, D. Robust dynamic classes revealed by measuring the response function of a social system. Proceedings of the National Academy of Sciences 105, 41 (2008), 15649–15653.
 [13] Cremonesi, P., Koren, Y., and Turrin, R. Performance of recommender algorithms on topn recommendation tasks. In Proceedings of the fourth ACM conference on Recommender systems (2010), ACM, pp. 39–46.
 [14] Fradkin, D., and Madigan, D. Experiments with random projections for machine learning. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (2003), ACM, pp. 517–522.
 [15] Goel, S., Broder, A., Gabrilovich, E., and Pang, B. Anatomy of the long tail: ordinary people with extraordinary tastes. In Proceedings of the third ACM international conference on Web search and data mining (2010), ACM, pp. 201–210.
 [16] Golub, G. H., and Van Loan, C. F. Matrix computations, vol. 3. JHU Press, 2012.
 [17] Grave, E., Joulin, A., Cissé, M., Jégou, H., et al. Efficient softmax approximation for gpus. In Proceedings of the 34th International Conference on Machine LearningVolume 70 (2017), JMLR. org, pp. 1302–1310.
 [18] Hariri, N., Mobasher, B., and Burke, R. Contextaware music recommendation based on latenttopic sequential patterns. In Proceedings of the sixth ACM conference on Recommender systems (2012), ACM, pp. 131–138.
 [19] Harper, F. M., and Konstan, J. A. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis) 5, 4 (2016), 19.
 [20] He, X., Liao, L., Zhang, H., Nie, L., Hu, X., and Chua, T.S. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web (2017), International World Wide Web Conferences Steering Committee, pp. 173–182.
 [21] Horn, R. A., Horn, R. A., and Johnson, C. R. Matrix analysis. Cambridge university press, 1990.
 [22] Jawanpuria, P., and Mishra, B. A unified framework for structured lowrank matrix learning. In International Conference on Machine Learning (2018), pp. 2259–2268.
 [23] Jolliffe, I. Principal component analysis. In International encyclopedia of statistical science. Springer, 2011, pp. 1094–1096.
 [24] Kamm, J., and Nagy, J. G. Optimal kronecker product approximation of block toeplitz matrices. SIAM Journal on Matrix Analysis and Applications 22, 1 (2000), 155–172.
 [25] Koren, Y., Bell, R., and Volinsky, C. Matrix factorization techniques for recommender systems. Computer, 8 (2009), 30–37.
 [26] Krichene, W., Mayoraz, N., Rendle, S., Zhang, L., Yi, X., Hong, L., Chi, E., and Anderson, J. Efficient training on very large corpora via gramian estimation. arXiv preprint arXiv:1807.07187 (2018).
 [27] LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature 521, 7553 (2015), 436.
 [28] Leskovec, J., Chakrabarti, D., Kleinberg, J., and Faloutsos, C. Realistic, mathematically tractable graph generation and evolution, using kronecker multiplication. In European Conference on Principles of Data Mining and Knowledge Discovery (2005), Springer, pp. 133–145.
 [29] Leskovec, J., Chakrabarti, D., Kleinberg, J., Faloutsos, C., and Ghahramani, Z. Kronecker graphs: An approach to modeling networks. Journal of Machine Learning Research 11, Feb (2010), 985–1042.
 [30] Levy, M., and Bosteels, K. Music recommendation and the long tail. In 1st Workshop On Music Recommendation And Discovery (WOMRAD), ACM RecSys, 2010, Barcelona, Spain (2010), Citeseer.
 [31] Li, P., Hastie, T. J., and Church, K. W. Very sparse random projections. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (2006), ACM, pp. 287–296.
 [32] Liang, D., Altosaar, J., Charlin, L., and Blei, D. M. Factorization meets the item embedding: Regularizing matrix factorization with item cooccurrence. In Proceedings of the 10th ACM conference on recommender systems (2016), ACM, pp. 59–66.
 [33] Linden, G., Smith, B., and York, J. Amazon. com recommendations: Itemtoitem collaborative filtering. IEEE Internet computing, 1 (2003), 76–80.
 [34] Lu, J., Liang, G., Sun, J., and Bi, J. A sparse interactive model for matrix completion with side information. In Advances in neural information processing systems (2016), pp. 4071–4079.
 [35] Mahdian, M., and Xu, Y. Stochastic kronecker graphs. In International Workshop on Algorithms and Models for the WebGraph (2007), Springer, pp. 179–186.
 [36] Mandelbrot, B. B. The fractal geometry of nature, vol. 1. WH freeman New York, 1982.
 [37] Narayanan, A., and Shmatikov, V. How to break anonymity of the netflix prize dataset. arXiv preprint cs/0610105 (2006).

[38]
Nimishakavi, M., Jawanpuria, P. K., and Mishra, B.
A dual framework for lowrank tensor completion.
In Advances in Neural Information Processing Systems (2018), pp. 5489–5500.  [39] OestreicherSinger, G., and Sundararajan, A. Recommendation networks and the long tail of electronic commerce. Mis quarterly (2012), 65–83.
 [40] Park, Y.J., and Tuzhilin, A. The long tail of recommender systems and how to leverage it. In Proceedings of the 2008 ACM conference on Recommender systems (2008), ACM, pp. 11–18.
 [41] Pipiras, V., and Taqqu, M. S. Longrange dependence and selfsimilarity, vol. 45. Cambridge university press, 2017.
 [42] Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., and Riedl, J. Grouplens: an open architecture for collaborative filtering of netnews. In Proceedings of the 1994 ACM conference on Computer supported cooperative work (1994), ACM, pp. 175–186.
 [43] Ricci, F., Rokach, L., and Shapira, B. Recommender systems: introduction and challenges. In Recommender systems handbook. Springer, 2015, pp. 1–34.
 [44] Rudolph, M., Ruiz, F., Mandt, S., and Blei, D. Exponential family embeddings. In Advances in Neural Information Processing Systems (2016), pp. 478–486.
 [45] Sarwar, B., Karypis, G., Konstan, J., and Riedl, J. Itembased collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web (2001), ACM, pp. 285–295.
 [46] Schmit, S., and Riquelme, C. Human interaction with recommendation systems. arXiv preprint arXiv:1703.00535 (2017).
 [47] Shani, G., Heckerman, D., and Brafman, R. I. An mdpbased recommender system. Journal of Machine Learning Research 6, Sep (2005), 1265–1295.
 [48] Tang, J., Belletti, F., Jain, S., Chen, M., Beutel, A., Xu, C., and Chi, E. H. Towards neural mixture recommender for long range dependent user sequences. arXiv preprint arXiv:1902.08588 (2019).
 [49] Tang, J., and Wang, K. Personalized topn sequential recommendation via convolutional sequence embedding. In International Conference on Web Search and Data Mining (2018), IEEE, pp. 565–573.
 [50] Tu, K., Cui, P., Wang, X., Wang, F., and Zhu, W. Structural deep embedding for hypernetworks. In ThirtySecond AAAI Conference on Artificial Intelligence (2018).
 [51] Van der Walt, S., Schönberger, J. L., NunezIglesias, J., Boulogne, F., Warner, J. D., Yager, N., Gouillart, E., and Yu, T. scikitimage: image processing in python. PeerJ 2 (2014), e453.
 [52] Wang, H., Wang, N., and Yeung, D.Y. Collaborative deep learning for recommender systems. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2015), ACM, pp. 1235–1244.
 [53] Yin, H., Cui, B., Li, J., Yao, J., and Chen, C. Challenging the long tail recommendation. Proceedings of the VLDB Endowment 5, 9 (2012), 896–907.
 [54] Yu, F., Liu, Q., Wu, S., Wang, L., and Tan, T. A dynamic recurrent model for next basket recommendation. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval (2016), ACM, pp. 729–732.
 [55] Zhang, J.D., Chow, C.Y., and Xu, J. Enabling kernelbased attributeaware matrix factorization for rating prediction. IEEE Transactions on Knowledge and Data Engineering 29, 4 (2017), 798–812.
 [56] Zhao, Q., Chen, J., Chen, M., Jain, S., Beutel, A., Belletti, F., and Chi, E. H. Categoricalattributesbased item classification for recommender systems. In Proceedings of the 12th ACM Conference on Recommender Systems (2018), ACM, pp. 320–328.
 [57] Zheng, Y., Tang, B., Ding, W., and Zhou, H. A neural autoregressive approach to collaborative filtering. arXiv preprint arXiv:1605.09477 (2016).
 [58] Zhou, B., Hui, S. C., and Chang, K. An intelligent recommender system using sequential web access patterns. In IEEE conference on cybernetics and intelligent systems (2004), vol. 1, IEEE Singapore, pp. 393–398.
 [59] Zhou, G., Zhu, X., Song, C., Fan, Y., Zhu, H., Ma, X., Yan, Y., Jin, J., Li, H., and Gai, K. Deep interest network for clickthrough rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2018), ACM, pp. 1059–1068.
Comments
There are no comments yet.