Doing Data Right: How Lessons Learned Working with Conventional Data should Inform the Future of Synthetic Data for Recommender Systems

We present a case that the newly emerging field of synthetic data in the area of recommender systems should prioritize `doing data right'. We consider this catchphrase to have two aspects: First, we should not repeat the mistakes of the past, and, second, we should explore the full scope of opportunities presented by synthetic data as we move into the future. We argue that explicit attention to dataset design and description will help to avoid past mistakes with dataset bias and evaluation. In order to fully exploit the opportunities of synthetic data, we point out that researchers can investigate new areas such as using data synthesize to support reproducibility by making data open, as well as FAIR, and to push forward our understanding of data minimization.



There are no comments yet.


page 1

page 2

page 3

page 4


Partially Synthetic Data for Recommender Systems: Prediction Performance and Preference Hiding

This paper demonstrates the potential of statistical disclosure control ...

Synthetic Data and Simulators for Recommendation Systems: Current State and Future Directions

Synthetic data and simulators have the potential to markedly improve the...

Point-of-Interest Recommender Systems: A Survey from an Experimental Perspective

Point-of-Interest recommendation is an increasing research and developin...

Beyond Our Behavior: The GDPR and Humanistic Personalization

Personalization should take the human person seriously. This requires a ...

Spatial Data Generators

This gem describes a standard method for generating synthetic spatial da...

Bias Disparity in Recommendation Systems

Recommender systems have been applied successfully in a number of differ...

Shape of synth to come: Why we should use synthetic data for English surface realization

The Surface Realization Shared Tasks of 2018 and 2019 were Natural Langu...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The era of big data has seen impressive examples of how knowledge and value can be created using data. It has also seen sobering reminders of how easy it is to ‘do data wrong’, causing unintended outcomes and often outright harm to the people (Joanna Redden and Terzieva, 2020). In this position paper, we point out that we are at the beginning of a new era of synthetic data and that we should take this beginning as an opportunity to ‘do data right’.

Synthetic data is data that is created to serve in the place of original data, which is directly collected or captured. Synthetic data makes it possible to carry out analyses or develop new algorithms on data that would be otherwise too sensitive to retain, share, or release. Synthetic data can also be used to augment an existing dataset to improve the performance of algorithms. Data that is only partially synthesized is referred to as semi-synthetic data. The goal of this paper is remind researchers that we must not repeat mistakes that have been made in the past and must also ensure that as our research moves forward, we take advantage of the full scope of the opportunities presented by synthetic data.

Our focus is on recommender system data, which takes the form of a user-item matrix with users and items. If a user has interacted with an item, the corresponding matrix cell contains a 1. Interactions can include clicks, views and purchases, which implicitly express a preference of a user. Today, recommender system research focuses on such implicit data, since explicit data, which consists of ratings explicitly expressing user preferences, is harder to come by. Recommender system data differs from many datasets of the big data era in that it is highly sparse and characterized by long-tail distributions. Specifically, we have active users who watch/consume/click many items and non-active users (including cold start users) who attempt to watch/consume/click very few items. Similarly for items, we have popular items that are consumed by many users and non popular items are consumed by few users.

The special characteristics of recommender system datasets make them challenging to synthesize, and research has just begun in this direction. This means that the time is right to avoid the pitfalls already encountered with conventional data. Further motivation past failure in ‘doing data right’ with regards to user privacy, which is still fresh in the minds of recommender system researchers. Memorably, in 2010, NetFlix Prize competition was discontinued after it was demonstrated that the data that was released to allow the competitors to develop recommender systems could be deanonymized, revealing the identity of individual users (Narayanan and Shmatikov, 2008). Flashing forward, the NetFlix Prize debacle has inspired research on synthetic data. Via the experience of our own research we see two directions emerging. First, research on using synthetic or semi-synthetic data to replace captured data in competitions (Slokom et al., 2019). Second, research on ensuring that synthetic or semi-synthetic data that is derived from data originally collected from users does not leak sensitive information on those users (Slokom et al., 2021).

The paper is structured as follows. First, we look at two areas, going beyond privacy, where past research in recommender systems arguably failed in ‘doing data right’ when working with conventional: bias and evaluation. We discuss how research in synthetic data can grab the chance of not repeating past mistakes. Then, we discuss two opportunities that are opened by synthetic data, which are not offered by conventional data: open data and data minimization. We present remarks on how the recommender system community can build on these opportunities. Our paper closes with a short summary and an outlook.

2. Addressing Past Mistakes

The era of big data has been driven by the idea that more data will automatically give rise to more reliable analysis and better systems. In recent years, however, machine learning researchers have initiated a more systematic approach to data in which the quality, not just quantity, of data is central. These efforts are well represented by the initiative of

datasheets for datasets (Gebru et al., 2018). In a nutshell, this initiative proposes that every dataset is described by a datasheet with a standardized format that documents: the motivation (why a dataset is created), creation (how the dataset is created), composition (what information it contains), intended uses (what tasks it should (not) be used for), data distribution (what are the properties of the dataset). In this section, we look at past cases of ‘doing data wrong’ related to data bias and to evaluation. We comment on how understanding, documenting, as well as explicitly designing, the characteristics of data is currently offering course correction for research practices and also on how work on synthetic data can be steered so that the same problems that we have confronted while working with conventional data do not arise anew.

2.1. Bias Mitigation

In its early days, the recommender systems community did not considered issues of bias and fairness. Thankfully, recent work has started to illuminate these issues. Here, we provide a brief summary. Discrimination and unfairness in recommender systems can originate from different sources: First, input bias (Tsintzou et al., 2018; Lin et al., 2019) that users exhibit in the input data. In (Lin et al., 2019), the authors studied how different collaborative filtering algorithms propagate bias existing in the input data and its impact on users. In (Ekstrand et al., 2018), the authors evaluated the ability of recommender system algorithms to produce equal utility for users of different demographic groups. A set of results showed a statistically significant differences in effectiveness between users’ gender and age groups. Second, algorithmic bias (Tsintzou et al., 2018; Mansoury et al., 2019) examines the effectiveness of recommendation algorithms in capturing different users’ interests across item categories. For example, popularity bias, where the recommender gives higher accuracy scores to algorithms that favor popular items irrespective of their ability to meet user needs. In (Edizel et al., 2019), the authors proposed FaiRecSys, an algorithm that mitigates algorithmic bias by post-processing the recommendation matrix with minimum impact on the accuracy of recommendations provided to the end-users. Third, evaluation metric error and bias (Tian and Ekstrand, 2018) simulates the recommender data generation and evaluation processes to quantify how erroneous current evaluation practices are. In (Yao et al., 2021), the authors proposed a simulation framework for measuring the impact of a recommender system under different types of user behavior. The framework goes beyond one-step recommendation and incorporates the interaction between user preferences and system effects, to better understand recommender system biases over time.

Biased data, biased algorithm and a biased metric will have an impact on all users with different degrees, which leads to discrimination, unfairness and harm. Data synthesis is an important approach to mitigate bias. Synthesized data can potentially support recommender systems’ experimentation, tuning, validation and performance prediction. When synthesizing data, there are some points that we attempt to achieve or test. For instance, the (semi-)synthesized data can be used to mitigate bias (Krishnan et al., 2014; Huang et al., 2020), improve consumer-provider fairness (Li et al., 2021; Boratto et al., 2020), data augmentation (Belletti et al., 2019).

We argue that although data synthesis is helpful to address bias, alone it is not enough. It is critical that the design decisions that were made when creating a synthesized dataset are well motivated, and made explicit, and also that they are well documented. In this way, future researchers can understand how bias was handled and assure themselves that new forms of bias were not introduced during the synthesis process. With explicit design and careful documentation, we can learn, understand, and explain where things have gone wrong and ideally be able to work toward redressing problem i.e., harms and preventing further problems. The goal of datasheets for datasets is to provide more transparency, accountability and control in the machine learning and recommender system communities. Moving forward it is crucial that datasheets are also crated for synthetic data.

2.2. Reliable Evaluation

In its early days, the recommender systems community did not fully appreciate the importance of systematic evaluation. Arguably, it was (Said and Bellogín, 2014) that awakened researchers to the importance of completely controlling the dimensions of an evaluation in order to achieve a fair comparison. The first dimension mentioned by (Said and Bellogín, 2014)

is data. In recent years, the community has made strides in evaluation practices and reproducibility, see 

(Bellogín and Said, 2021), which contains a section documenting the effort. We point out that a datasheets approach to synthetic data, will ensure that synthetic data will be used appropriately for evaluation from the start and invalid comparisons between datasets will be avoided. Another dimension mentioned by (Said and Bellogín, 2014) is evaluation strategies. Here, we dive deeper to discuss why careful attention must be paid to evaluation strategies for synthetic or semi-synthetic data.

In the machine learning literature, the quality of synthetic data is often evaluated using machine learning performance. Such an evaluation involves comparing the performance metrics of predictive models trained on synthetic and on real data (called as model compatibility). This performance of a machine learning models trained and tested on real and or synthetic data is compared based on different scenarios (Heyburn et al., 2018; Jordon et al., 2018a; Fekri et al., 2020): Train on Real and Test on Synthetic data () Train on Synthetic and Test on Real (), Train on Real, Test on Real () and Train on Synthetic, Test on Synthetic (), and lastly trained and tested on a mixture of real and synthetic data (). In principle, these scenarios are transferable to the evaluation of synthetic data in recommender systems. However, it is important to consider whether and actually yield meaningful information about how useful synthetic data is for recommendation. The reason is that, if the synthetic data provides synthetic users, then users in the training set (or test set) are different from those in the test set (respectively training set).

It is critical to develop evaluation frameworks that are suitable for use in evaluating synthetic data in the context of recommender systems. In other words, evaluation itself must be an object of research. Here, we cite two directions that could serve as a starting point. First, relative ranking of a set of algorithms, rather than absolute scores could serve as an important tool. The relative performance of a set of algorithms trained and tested on the synthetic dataset should be the same as their relative performance when trained and tested on the original dataset (Jordon et al., 2018a, b; Bowen and Snoke, 2019; Slokom et al., 2019)

. For example, if (semi-)synthetic data is released for use in a data science challenge, this relative ranking would be more important that the absolute scores achieved by the algorithms. This direction of research is not yet well explored by researchers in recommender system community. Second, special attention must be paid to ensure that the test set remains comparable when different types of (semi-)synthetic data are compared. We have proposed on way to address this issue for semi-synthetic data 

(Slokom et al., 2021).

We close this section on evaluation by mentioning the importance of studying data characteristics. There is an interaction between the exact nature of the data, and the types or recommender system algorithms that perform well on that data. These aspects have traditionally been understudied by the community, also the situation is hopefully changing in the wake of (Adomavicius and Zhang, 2012; Deldjoo et al., 2021). Because it is straightforward to control the properties of synthetic data, the study of synthetic data opens a whole new world of possibilities for use to understand which algorithms works well with which type of data, and why. Again, we see that the proper documentation of synthetic data in datasheets is critical for such research to be reproducible and thereby useful.

3. Paving the Way for Future Research

Synthetic data is generally intended to take the place of original data. However, in order to take advantage of the full potential of synthetic data, which must also invest research effort in developing the potential of synthetic data to transcend conventional data, and be used for purposes for which conventional data is not suited.

3.1. FAIR and Beyond

FAIR is the combination of different small practices that make the data easier to find, easier to understand, less likely to be lost, and more likely to be usable during the project time and years later (Inau et al., 2021). FAIR principles (Fair, [n.d.]) are guidelines for data management and stewardship that are valid for both machines and humans: Findable: (meta)data should be discoverable, identifiable and searchable via the assignment of metadata and unique identifiers. Accessible: (meta)data should be available and retrievable with access via authentication and authorisation procedures. Interoperable: (meta)data should be semantically understandable, allowing the broadest possible data exchange i.e., exchange and reuse between researchers, institutions, organisations or countries. Reusable: (meta)data should be sufficiently described, well documented, and shared with the least restrictive licenses, allowing the widest reuse possible.

The FAIR principles can drive forward progress in recommender system research because they can support reproducibility. However, the FAIR principles do not dictate that the data has to be shared openly (OpenAIRE, 2018), which is a hindrance to reproducibility. For instance, the data can be FAIR but not open: it is FAIR within the company but it does not open to researchers, scientists and users outside the company. Data synthesis offers a possibility to make data FAIRly open without the need to release the original data. The (semi-)synthetic data could be designed to protect user’s sensitive information while still maintaining its value for training recommender systems, which is needed for reproducibility. We have suggested one approach in (Slokom et al., 2021), but this work represents only a beginning. The (semi-)synthetic data could also be designed to protect information that is important for companies’ competitiveness while at the same time preserving the information that is necessary for the data to contain in order for third-parties to be able to have oversight over how companies collect and use the data of users.

3.2. Data Minimization

Finally, we discuss the issue of data minimization. Article 5(1)(c) of the European Union’s General Data Protection Regulation (GDPR) requires that personal data should be limited to only what is necessary to the purposes for which the data is processed (Regulation, 2018). Linking back to the discussion of FAIR, we note that in (Inau et al., 2021; Boeckhout et al., 2018), authors suggested that FAIR data and metadata can facilitate compliance with data minimization principle since FAIR principles allow for an assessment of which data to reuse.

Here, we zero in specifically on data minimization for recommender systems. In (Larson et al., 2017), the authors proposed to adopt training data requirements analysis to analyze and evaluate the trade-off between the amount of data that the system requires, and the performance of the system. In (Krishnaraj, 2019), the authors proposed to extend the data minimzation principles advocated in GDPR and studied their effect on recommender systems. They investigated the effects of reducing the amount of data used to model a recommender system and showed that a substantial amount of data can be dropped without a large impact on the performance. In (Biega et al., 2020), authors pointed to the lack of an homogeneous interpretation of the data minimization principle. They argued that personalization-based systems do not necessarily need to collect user data, but that they do so to improve the quality of the results. They found that the performance decrease incurred by data minimization might not be substantial but that it might disparately impact different users. To support minimization, (Biega et al., 2020) suggested that we need to design new protocols for user-system interaction, a system that does not only focus on providing infinite recommendations while collecting infinite data about its users’. In other words, we need to propose new learning mechanisms that select necessary data that respect specific minimization requirements while maintaining a good personalized-based recommendation performance.

Synthetic data presents a promising opportunity to understand what data minimizing means for recommender systems. Minimized datasets can be synthesized with different characteristics and the impact of these characteristics could be studied. We believe that cold start user profiles could be a good starting point to understand and find the minimal necessary data in a user profile. Then, recommender system research need to look at how much data is really necessary to accomplish a given recommender system task. We expect the study of data minimization to move forward the state of the art in recommender systems, but also to make it possible to gain understanding of how the GDPR must be enforced for recommender system data.

Using synthetic data to study data minimization is potentially relevant to oversight beyond the GDPR as well. Currently, there is growing concern about the manipulative impact of hypertargeting, which infringes on privacy and consumer rights. Previously, we have proposed the concept of hypotargeting (Larson and Slokom, 2019), i.e., imposing a constraint on the number of unique recommendation lists that a recommender system can present to its users in a given time window. Because the number of unique lists remains finite, it becomes feasible to audit the experience that a recommender system is offering to its users. Such oversight can watch for bias, filter bubbles, and unfair targeting.

4. Summary and Outlook

In this position paper, we have described mistakes that have occurred over the history of recommender system research, specifically, neglecting the issue of bias and overlooking the importance of evaluation framework. We have argued that we must ensure that these mistakes are not repeated as we develop approaches to craete synethetic data for evaluation. We have also pointed to areas where synthetic data has a special contribution to make in the future, specifically, extending FAIR principles to make data open and also moving forward our understanding of data minimization for recommender systems and how to minimize data appropriately and effectively.

Throughout we have emphasized the importance of explicitly designing and documenting synthetic datasets, following the idea of datasheets for datasets (Gebru et al., 2018). Future research will need to embrace the development of best practices for design, documentation, and evaluation of synthetic data as research areas in their own right.

Recommender system research must also create bridges across disciplines. As pointed out by (Gebru et al., 2018), the risk datasets causing harm can be exacerbated when developers are not domain experts. Moving forward it is essential to include experts from specific domains, such as health, psychology, and communication science, in synthetic data research. Further, interdisciplinary collaboration is also necessary with legal experts to understand how synthetic data can best protect privacy, and support data minimization and regulatory oversight.