It is widely known that novel recommendation approaches should be evaluated in the context of online experiments involving human subjects in order to obtain reasonably robust results about their performance (Herlocker et al., 2004). Nevertheless, most of the studies available in literature support their conclusions with offline trails relying on the preferences of users collected without considering the algorithms under investigation (Gunawardana and Shani, 2015). Despite the possible weaknesses of this approach (Said and Bellogín, 2014), offline experiments are extremely popular among researchers because of their limited costs and the theoretical reproducibility of their results. In industry, they are usually considered a powerful tool for pruning the number of possible recommender systems that need to be tested with real users, thus mitigating the economical impact of eventual failures.
It is necessary to rely on a collection of user preferences obtained in a particular domain to perform an offline experiment. For example, the MovieLens datasets represent a popular choice for conducting an offline evaluation in the field of movie recommender systems (Harper and Konstan, 2015). Nevertheless, the number and the variety of publicly available rating datasets is often limited, especially in less mainstream domains (Tso and Schmidt-Thieme, 2006). It is possible to identify different causes for this problem. For example, the companies capable of collecting rating datasets are usually reluctant to share them, because of the fear of violating the privacy of their users or of exposing commercially sensible data to their competitors. On the other hand, researchers often do not have the resources for obtaining a sufficient number of ratings that are worth to be publicly released.
Because of the shortage of public datasets, practitioners have started to rely on synthetic ratings in order to conduct their offline experiments (Yu et al., 2012). An obvious advantage of such an approach is that it enables the creation of rating datasets with an arbitrary number of users and items at a limited cost of dataset acquisition. However, the results obtained from such experiments may be questionable, as the generated datasets are usually not capable of capturing the characteristics of a particular domain of interest (Montaner et al., 2004)
In this work, we propose a novel approach for automatically generating synthetic datasets with a configurable number of users leveraging on a reference dataset that is used as the seed of the process and that encodes the peculiarities of a domain of interest. Such a generative method can be exploited to create different rating datasets containing users that exhibit behaviors similar to the ones available in the reference dataset. However, the synthetic users do not have a direct relation with the real users and, therefore, no private or commercially sensible information is leaked. At the same time, because the number of synthetic users is configurable, the generated dataset can be exploited to conduct scalability tests in a realistic way and to train recommendation algorithms using reinforcement learning approaches.
More formally, we aim to provide an answer to the following research questions.
What is the impact of using a synthetic dataset instead of a real one on the results of an offline experiment in the context of recommender systems?
Can a generative approach be exploited to create a synthetic dataset that exhibits properties similar enough to the ones of a real dataset?
To what extend this method can be consistently applied to datasets from different domains and of different sizes?
The remainder of this paper is structured as follows: in Section 2 we review related works and we compare them to our proposal. In Section 3 we introduce the generative approach for creating synthetic datasets, while in Section 4 we describe the experimental setup designed to validate it. We present and discuss the results in Section 5 and, in Section 6, we provide the conclusions.
2. Related work
Synthetic datasets are commonly used in literature to assess the performance of database systems or to study the behavior of data mining algorithms. For example, Agrawal et al. (Agrawal and Srikant, 1994) created a generator of retail transactions intended for the evaluation of association rule algorithms, while Houkjær et al. (Houkjær et al., 2006) introduced a software capable of creating relational data for benchmarking purposes. Such tools can generate realistic data in terms of their statistical distributions, which can be empirically learned for existing datasets or provided by a researcher using specialized languages.
Similar approaches have been also explored in the field of recommender systems, usually because of the lack of public datasets with the required characteristics. Tso et al. (Tso and Schmidt-Thieme, 2006)
created a synthetic data generator for evaluating context-aware recommenders based on Dirichlet and Chi-square distributions. The metric of information entropy is then exploited to control the randomness of the synthetic data. A similar method has been discussed by Pasinato et al.(Pasinato et al., 2013): their intuition is to represent the heterogeneous rating behaviors of the users with different statistical distributions.
Manouselis et al. (Manouselis and Costopoulou, 2008) presented a tool, named CollaFis, capable of creating synthetic ratings for the evaluation of either single-criteria or multi-criteria recommender systems. The users of CollaFis need to specify the characteristics of the generated data, like the number of users, items, and criteria. A common aspect of all the previously mentioned methods is that researchers are required to choose and configure the statistical distributions that are exploited to generate the artificial datasets. However, the main problem of such an approach is that it is impossible to predict the real behavior of many different users with a few statistical distributions (Montaner et al., 2004).
Another possible line of research is related to the imitation of a real collection of preferences. For example, CarmenRodríguez Hernández et al. (del Carmen Rodríguez-Hernández et al., 2017) developed a software, DataGenCARS, for creating artificial ratings using a set of parameters provided by the user or inferred from a reference dataset. However, we argue that statistics computed at a global level are not informative enough to create a synthetic dataset, as they are not able to capture the different behaviors of the various groups of users.
3. Dataset generation
Our approach for generating synthetic datasets starting from a reference dataset consists of two steps. In the first one, it is necessary to analyze an existing collection of user preferences in order to obtain an accurate representation of the domain of interest. Then, in the second one, it is possible to exploit such a representation for creating different generated datasets.
We argue that only relying on a few statistical distributions computed empirically at a global level from an existing dataset or specified by a researcher is not sufficient to realistically simulate the individual tastes of human beings (Montaner et al., 2004). Such methods would lead to the creation of datasets with users having no individual preferences, thus making the task of any recommender system nearly impossible.
For this reason, we included a preliminary clustering phase as part of the first step in order to group the users in a fixed number of communities. The individual rating behaviors, represented by different statistical distributions, are learned for each community and then exploited during the sampling phase.
For simplicity, we assume that each user can only express positive preferences about the items available in the system. However, this approach can also be exploited to simulate datasets with ratings expressed on a more complex scale by repeating these steps for each rating value and then by merging the results.
In the following, we detail the user clustering and distribution learning process and the rating sampling algorithm.
3.1. User clustering and distribution learning
We represent each user
from the reference dataset as a vector with length equal to the number of items. The component of such a vector is equal to if the user expressed a positive rating about the i-th item of the catalog, otherwise it is equal to .
Given this data structure, we decided to apply the K-means clustering algorithm(Hartigan and Wong, 1979) to group together users who liked a similar set of items in different clusters. The value of needs to be empirically selected by the experimenter because, in general, it depends on the characteristics of the reference dataset.
Every cluster identifies a different community of users. For generating a dataset similar to the reference one, it is necessary to know how many users belong to each community and what are the item preferences associated with them. More in detail, we create the following empirical distributions from the reference ratings:
, how users are distributed in clusters;
, how ratings are distributed in users for each cluster;
, how ratings are distributed in items for each cluster.
Note that only the first distribution is global, while the second and the third ones are associated with a cluster.
represents the probability of assigning a user to a certain cluster and it is computed by counting the number of users per cluster. The distributionrepresents the probability of finding a certain number of ratings per user in the cluster and it is computed by counting the number of ratings per user. Finally, the distribution represents the probability of finding a certain number of ratings per item in the cluster and it is computed by counting the number of ratings per item.
The user clustering and distribution learning process is formalized in Algorithm 1. Its output is represented by the previously mentioned empirical distributions.
3.2. Rating sampling
Starting from the empirical distributions obtained from Algorithm 1, it is possible to generate a synthetic dataset by applying to them a sampling function . In the following, we assume that is the weighted random sampling function.
As discussed in Section 1, the experimenter can select the number of users available in the generated dataset. This value, called
, is an input of the rating sampling algorithm, together with the probability distributions. The synthetic dataset can also have the same number of users in the reference dataset, that is.
Firstly, each generated user is assigned to a cluster from the reference dataset, according to the distribution of users per cluster. Then, the number of ratings for that user is selected considering the distribution of ratings per user in the cluster . Finally, items are sampled without replacement () from the distribution of ratings per item in the cluster . Thus, the number of user ratings and her liked items are associated with a particular community of users.
The rating sampling procedure is formalized in Algorithm 2.
4. Experimental setup
We compared the results obtained from the evaluation of different recommenders conducted on popular datasets typically exploited in literature with the ones computed in the same experimental conditions using various collections of synthetic preferences generated starting from them using multiple techniques.
In fact, we claim that a synthetic dataset can be successfully used during an evaluation campaign if the behavior of the recommender systems under analysis is similar to one that it would be possible to observe with the reference dataset. Thus, almost all the possible pairs of recommenders should exhibit the same relation of order for a given dimension and lead to similar conclusions.
Furthermore, we investigated what is the impact of the parameter on the results of the evaluation, in order to understand how to empirically select the most appropriate value for it.
In our experiments, we utilized Random, Most Popular, User KNN, BPRMF, and WRMF recommendation algorithms and the metrics of precision, recall, and NDCG as defined in the evaluation framework RecLab(Monti et al., 2018)
. Regarding the user preferences, we exploited the binarized versions of the MovieLens 100K, MovieLens 1M, and LastFM(Cantador et al., 2011) datasets. We considered as positive all ratings with a value higher than for MovieLens and than for LastFM. We relied on the default values of the evaluation framework for all other experimental parameters: we followed a random splitting protocol with a test set size equal to the of all available ratings and we recommended items for each test user.
From the aforementioned reference datasets we generated their synthetic versions exploiting the procedure described in Section 3. We considered equal to the number of users originally available, in order to compare datasets of similar size. Furthermore, we also created three baseline synthetic collections with the same number of ratings by not applying the user clustering phase. All the users of such baselines exhibit the same rating behavior, similarly to the approach described in (del Carmen Rodríguez-Hernández et al., 2017). In Table 1, we report different statistics regarding the baseline, generated, and reference datasets.
In this section, we first discuss the impact of the number of user communities on the evaluation results, then we present a comparison between exploiting the synthetic and the reference datasets.
5.1. Number of user communities
|Dataset||Most Popular||User KNN||BPRMF||WRMF|
|K = 5||0.088449||0.099890||0.078768||0.091749|
|K = 10||0.095793||0.124595||0.102805||0.111974|
|K = 50||0.098378||0.133946||0.103243||0.133838|
|K = 100||0.102415||0.150494||0.115587||0.149945|
|K = 200||0.099672||0.154158||0.122538||0.164114|
For studying what is the impact of the value on the results of an evaluation conducted with a synthetic dataset, we computed the measure of precision on different synthetic versions of the MovieLens 100K dataset created with . We report the numerical outcomes of this experiment in Table 2.
We also observed that it is possible to obtain similar results by considering other datasets and metrics. As expected, the values of precision for all the algorithms but the Random and Most Popular approaches improve by increasing the number of available clusters. However, this relationship is not linear, as doubling its value from to only slightly improves the results.
We empirically observed that reasonable values for could be or . In Section 5.2, we will assume that .
Therefore, we can provide an answer to RQ1 by observing that the impact of using a synthetic dataset in an evaluation campaign can be mitigated if we are able to simulate a sufficient number of heterogeneous user communities.
|Baseline dataset||Generated dataset||Reference dataset|
|Baseline dataset||Generated dataset||Reference dataset|
|Baseline dataset||Generated dataset||Reference dataset|
5.2. Synthetic and reference datasets
As anticipated in Section 4, we compared the evaluation results obtained when relying on the reference dataset and two synthetic datasets created with different approaches. We repeated this experiment with datasets of different sizes and from different domains in order to assess the generalizability of the results.
We observe that in all experiments and for almost all the possible pairs of recommenders the relative order of the measures is the same between the generated and the reference datasets.
As expected, their values are lower when exploiting the synthetic ratings, as they do not represent real preferences. Nevertheless, they are still useful to identify the most promising recommendation techniques in a certain domain, while the results obtained with the baseline datasets cannot be exploited for such a purpose.
With respect to RQ2, we can conclude that a generative approach capable of replicating the behaviors of different groups of users can be used for creating realistic datasets. We also discovered, as an answer to RQ3, that our approach can be potentially applied to datasets from different domains and of different sizes.
6. Conclusion and future work
In this paper, we have discussed a method for generating synthetic datasets with an arbitrary number of users starting from existing collections of preferences. Differently from the approaches already available in literature, we propose to first model user communities in order to generate more realistic ratings that can be successfully exploited during an evaluation campaign.
We empirically verified that the outcome of an offline comparison among different recommender systems conducted exploiting the generated datasets is consistent with the results obtained when using the reference datasets, provided that a sufficient number of user clusters is selected. This finding could encourage private companies to publicly release synthetic datasets created from internally available data without the fear of violating the privacy of their users or of exposing commercially sensible information.
As future work, we would like to explore additional methods for creating synthetic datasets. We believe that Generative Adversarial Networks (GANs) could be successfully exploited for this task, as they are already used to generate fake images starting from real ones (Goodfellow et al., 2014). Such approaches would require the definition of a way for representing the preferences of a user similarly to an image.
- Agrawal and Srikant (1994) Rakesh Agrawal and Ramakrishnan Srikant. 1994. Fast Algorithms for Mining Association Rules in Large Databases. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB ’94). Morgan Kaufmann Publishers, Burlington, MA, USA, 487–499.
- Cantador et al. (2011) Iván Cantador, Peter Brusilovsky, and Tsvi Kuflik. 2011. Second Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec2011). In Proceedings of the Fifth ACM Conference on Recommender Systems (RecSys ’11). ACM, New York, NY, USA, 387–388. https://doi.org/10.1145/2043932.2044016
- del Carmen Rodríguez-Hernández et al. (2017) María del Carmen Rodríguez-Hernández, Sergio Ilarri, Ramón Hermoso, and Raquel Trillo-Lado. 2017. DataGenCARS: A generator of synthetic data for the evaluation of context-aware recommendation systems. Pervasive and Mobile Computing 38 (jul 2017), 516–541. https://doi.org/10.1016/j.pmcj.2016.09.020
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Vol. 2. MIT Press, Cambridge, MA, USA, 2672–2680.
- Gunawardana and Shani (2015) Asela Gunawardana and Guy Shani. 2015. Evaluating Recommender Systems. In Recommender Systems Handbook (2 ed.). Springer Publishing, New York, NY, USA, Chapter 8, 265–308. https://doi.org/10.1007/978-1-4899-7637-6_8
- Harper and Konstan (2015) F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems 5, 4 (2015), 1–19. https://doi.org/10.1145/2827872
- Hartigan and Wong (1979) John A. Hartigan and Manchek A. Wong. 1979. Algorithm AS 136: A K-Means Clustering Algorithm. Applied Statistics 28, 1 (1979), 100. https://doi.org/10.2307/2346830
- Herlocker et al. (2004) Jonathan L. Herlocker, Joseph A. Konstan, Loren G. Terveen, and John T. Riedl. 2004. Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems 22, 1 (2004), 5–53. https://doi.org/10.1145/963770.963772
- Houkjær et al. (2006) Kenneth Houkjær, Kristian Torp, and Rico Wind. 2006. Simple and Realistic Data Generation. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB ’06). VLDB Endowment, Los Angeles, CA, USA, 1243–1246.
- Manouselis and Costopoulou (2008) Nikos Manouselis and Constantina Costopoulou. 2008. Preliminary Study of the Expected Performance of MAUT Collaborative Filtering Algorithms. In The Open Knowlege Society. A Computer Science and Information Systems Manifesto. Springer Publishing, New York, NY, USA, 527–536. https://doi.org/10.1007/978-3-540-87783-7_67
et al. (2004)
Miquel Montaner, Beatriz
López, and Josep Lluís de la Rosa.
Evaluation of recommender systems through simulated
Proceedings of the 4th International Workshop on Pattern Recognition in Information Systems. SciTePress, Setúbal, Portugal, 1–6. https://doi.org/10.5220/0002622703030308
- Monti et al. (2018) Diego Monti, Giuseppe Rizzo, and Maurizio Morisio. 2018. A Distributed and Accountable Approach to Offline Recommender Systems Evaluation. In Proceedings of the Workshop on Offline Evaluation for Recommender Systems at the 12th ACM Conference on Recommender Systems. REVEAL 2018, Vancouver, BC, Canada, Article 6, 5 pages. https://arxiv.org/abs/1810.04957
- Pasinato et al. (2013) Marden Pasinato, Carlos Eduardo Mello, Marie-Aude Aufaure, and Geraldo Zimbrao. 2013. Generating Synthetic Data for Context-Aware Recommender Systems. In 2013 BRICS Congress on Computational Intelligence and 11th Brazilian Congress on Computational Intelligence. IEEE, New York, NY, USA, 563–567. https://doi.org/10.1109/brics-cci-cbic.2013.99
- Said and Bellogín (2014) Alan Said and Alejandro Bellogín. 2014. Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks. In Proceedings of the 8th ACM Conference on Recommender Systems (RecSys ’14). ACM, New York, NY, USA, 129–136. https://doi.org/10.1145/2645710.2645746
- Tso and Schmidt-Thieme (2006) Karen H. L. Tso and Lars Schmidt-Thieme. 2006. Empirical Analysis of Attribute-Aware Recommendation Algorithms with Variable Synthetic Data. In Studies in Classification, Data Analysis, and Knowledge Organization. Springer Publishing, New York, NY, USA, 271–278. https://doi.org/10.1007/3-540-34416-0_29
- Yu et al. (2012) Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit Dhillon. 2012. Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems. In 2012 IEEE 12th International Conference on Data Mining. IEEE, New York, NY, USA, 765–774. https://doi.org/10.1109/icdm.2012.168