Synthetic Attribute Data for Evaluating Consumer-side Fairness

by   Robin Burke, et al.
DePaul University

When evaluating recommender systems for their fairness, it may be necessary to make use of demographic attributes, which are personally sensitive and usually excluded from publicly-available data sets. In addition, these attributes are fixed and therefore it is not possible to experiment with different distributions using the same data. In this paper, we describe the Frequency-Linked Attribute Generation (FLAG) algorithm, and show its applicability for assigning synthetic demographic attributes to recommendation data sets.



There are no comments yet.


page 1

page 2

page 3

page 4


Towards Intersectionality in Machine Learning: Including More Identities, Handling Underrepresentation, and Performing Evaluation

Research in machine learning fairness has historically considered a sing...

Consumer Fairness in Recommender Systems: Contextualizing Definitions and Mitigations

Enabling non-discrimination for end-users of recommender systems by intr...

Neutralizing Self-Selection Bias in Sampling for Sortition

Sortition is a political system in which decisions are made by panels of...

Understanding the Representation and Representativeness of Age in AI Data Sets

A diverse representation of different demographic groups in AI training ...

Selective Fairness in Recommendation via Prompts

Recommendation fairness has attracted great attention recently. In real-...

Evaluating Fairness of Machine Learning Models Under Uncertain and Incomplete Information

Training and evaluation of fair classifiers is a challenging problem. Th...

A Versatile Framework for Evaluating Ranked Lists in terms of Group Fairness and Relevance

We present a simple and versatile framework for evaluating ranked lists ...

Code Repositories


Frequency Linked Attribute Generation for Evaluating Consumer-side Fairness

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Fairness in recommender systems spans a number of different research questions. One key area is the impact of user demographic attributes on their experiences using recommender systems, an aspect of consumer-side fairness (C-fairness) (Burke, Robin, 2017). In some application areas, such as employment, there may be a legal mandate to ensure that users in protected groups have similar quality recommendations to those who are not.

A challenge in performing C-fairness research is that demographic attributes are rarely included in public data sets used in recommendation, especially in sensitive areas such as employment: such attributes would make it much easier to de-anonymize the data and uncover the identities of the users.

In this position paper, we outline our solution, the Frequency-Linked Attribute Generation (FLAG) algorithm, for probabilistic generation of synthetic demographic attributes.

1.1. Fairness

A standard simplification in fairness-aware machine learning is to consider users as divided into protected and unprotected groups, where fairness towards the protected group is desired 

(Zemel et al., 2013). In the case of job seekers, the protected group may depend on the job category, but may often be associated with gender and / or racial / ethnic identity.

For the 2017 RecSys Challenge (Abel et al., 2017), the career-oriented social networking site released a data set consisting of interactions between users and job postings. Most attributes of jobs and users were anonymized, so that it is not possible to make use of any demographic information for fairness-aware recommendation research. The data set is large and sparse with over 10 million interactions. We produced a sample of the data by concentrating on users of career level 0 and within region 7, leaving approximately 410k users and around 3 million interactions. These users have profiles that range in length from one interaction up to 30 – the very small number of users with larger profiles were removed.

2. Synthetic attribute generation

It is well established that different types of users have different behaviors in employment-seeking contexts (Wanberg et al., 1996). Male job seekers tend to be more optimistic relative to expected salary, for example (Heckert et al., 2002). Similar differences are reported for race and ethnic identity, even when controlling for background (Avery, 2003).

These findings suggest that the potential exists for a feedback loop in job recommendation with respect to job quality. White male users may click more optimistically on such jobs, and other users may be less likely to. A recommender system may pick up on this difference, and allocate recommendations accordingly, leading to an unacceptable degree of disparity between the quality of jobs presented to different groups.

2.1. Frequency-Linked Attribute Generation

The FLAG algorithm does not attempt to uncover any ground truth about the demographic status of any individual or group within its input data, and could not be used for this purpose. We treat the task as one of generating a membership probability distribution, which can then be applied to assign a binary-valued attribute.

In order to serve as a useful proxy for unprotected / protected status, labeled and , respectively, there are certain requirements that a synthetic demographic attribute should have:

  • Group labels should be assigned based on a probability distribution with every user having some non-zero probability of receiving either label or .

  • The feature should be correlated with differences in user behavior, so that it can be applied to data sets where only behavior is known.

  • The data generator should be parameterized such that groups and can have different relative sizes, and that they can have behavioral profiles that vary in overall similarity. This will allow us to evaluate algorithms under a range of conditions.

To understand FLAG algorithm and how it meets these requirements, we start with the observation that profile sizes in recommendation data sets generally follow a power-law distribution with a small number of very active users and a much larger number of less active ones. Our subset of the XING data follows such a power law distribution for profile size with an estimated exponent of 1.45

222Calculations performed with the powerRlaw package version 0.70.1 in R 3.4.4.

. Note that we are using profile length (number of clicks) as our behavioral indicator in this research, but it could be any other property that has a left-skewed distribution and that is probabilistically associated with a demographic attribute of interest.

Let be the distribution of profile sizes for users. equals the number of users with profiles of size . The maximum profile size is : equals 30 in our XING subset. In FLAG, membership probability in groups and is a function of the size of a user’s profile, following a power law distribution.

We begin by setting the probability of membership in group B to be , with the parameter controlling the skew of the distribution. As approaches , the distribution approaches a uniform line at 1 – all users are in group – and as approaches , it approaches zero – all users are in group .

This gives us a variety of different distributional shapes for groups and , but it does not allow us to control their relative sizes. The expected number of group users under is given by:


To control the relative sizes of the two groups, we introduce a parameter that specifies the fraction of users that we would in group . In other words, . We can achieve this result by uniformly scaling each value. That is, we multiply each by . This ensures that each group has, in expectation, the desired size. Taking both parameters into account, we can write the FLAG function as a membership probability distribution over profile sizes:



To generate attributes, we process each user profile and given the profile size , we calculate the group membership probability . We conduct a Bernoulli trial with probability and on success, assign to group ; otherwise, group .

Note that not all combinations of and are possible. If we attempt to make the two groups very different in behavior (with a large value), it may be impossible to have the groups be similar in size. Legal values will be in the following range:


2.2. Generation Results

In Figure 1, we have set to mirror the overall distribution and to 0.4. (Note that with , the maximum value is 0.43 for the XING data set.) The figure shows the distribution of the expected value of the profile count at each size, using a log-log scale. The full profile size distribution is included for comparison. As can be seen, group dominates in the lower profile sizes where the bulk of the data lies. As the proportion of group nodes gets smaller and smaller, the group distribution approaches the original data.

Figure 1. Distributions for generated groups.

Figure 2 shows legal values for with . As increases, the behaviors of the two groups, as expressed in profile size, become increasingly different.

Figure 2. Expected profile sizes with increasing .

3. Examples

Recommendation data sets for public use that contain sensitive demographic characteristics are relatively rare. To demonstrate the validity of our synthetic data generation method, we worked with the well-known MovieLens 1M data set, which does contain a gender attribute for users. Figure 3 shows the distribution of profile lengths for this data. We can see that female users make up a minority of the user base (1709 females vs. 4331 males) and they tend to have shorter profile lengths (average of 164 movies per male users, shown in light blue, and 144 movies per female user, in magenta). The total distribution is dark blue. Note the relatively linear appearance of the distribution in the log-log plot, suggesting that a power law is an appropriate model.

Figure 3. Profile length distribution by gender.

We followed the procedure described above to generate a synthetic attribute associated with profile length. We tuned the and parameters to match the real distribution as closely as we could, arriving at and . Figure 4 shows one run of attribute generation using these values, and indicates that the distribution of the feature matches in many respects the gender feature in the original data. The total number of group profiles is 4592 in this run and 1468 in group . Since it is a stochastic process, different runs produce slightly different results.

Figure 4. Profile length distribution by synthetic attribute.

We can apply a similar process to features associated with items, which would be needed for the evaluation of provider-side fairness (P-fairness) in situations where the demographics of providers were relevant. For example, jobs in minority-owned businesses might be considered a protected class in some job recommendation settings. Again, we turn to MovieLens using genre information as the protected feature. Figure 5 shows the distribution of the item profiles for movies with the “Documentary” feature as opposed to those without this feature. There are considerably fewer such movies (110 out of 3706) and they tend to have much smaller item profiles – meaning that documentaries does not tend to attract as many ratings as other movies in the data set.

Figure 5. Profile length distribution by “Documentary” genre.

For the data set, we generated a synthetic attribute with and , as shown in Figure 6. In this case, the fit is not quite as good, somewhat fewer group movies, not extending quite as much in profile length as the original data. Compared to the original attribute distribution, we see that the slope of the distribution is too steep. However, further adjustment of produces illegal values. Such findings suggest that it may be necessary to augment the model with a third parameter, adjusting the power-law baseline value to account for all distributions that may arise.

Figure 6. Profile length distribution by synthetic genre attribute.

4. Ethical Considerations

As noted earlier, the beneficial aim of this research is to enable experimental development of recommendation algorithms with improved fairness properties, an important goal given the prevalence of recommendation algorithms in commerce, social media, and other online settings. Without synthetic data, research into such systems becomes limited by the availability of data about real individuals and their sensitive demographic characteristics. A select group of researchers (especially in industry settings) may have access to such data, but the inability to share data and compare results inevitably slows research productivity and innovation. Synthetic data is therefore essential to progress towards fairness-aware recommender systems.

Two concerns might be raised about the ethical propriety of generating synthetic data for recommendation evaluation: de-anonymization and external validity. De-anonymization occurs when operations performed on the data make it possible to recover some aspects of it that were not intended to be made public: for example, the gender of the user, if this information was withheld in the original data release, or most significantly, the identity of a system user. Since recommendation data sets contain profiles of user activity including consumer behavior, the relevation of user identity is considered a important risk. Data sets such as those associated with the RecSys Challenge are carefully anonymized precisely so such recovery of individual identity is not possible.

Note first that the FLAG algorithm does not attribute demographic attributes to users. It generates a synthetic A or B label, which does not have a demographic meaning. We do know, based on the way in which the labels are generated, that a user assigned to group A is more likely to have a larger user profile, but that is only a probabilistic association. We assume that profile length is distributed according to a power law, and in such a distribution, there will always be more users with small profiles even in group A. It should also be noted that every users’ profile length is readily derived in any recommendation data set. Therefore, access to the output of FLAG tells a researcher nothing about an anonymized user record that could not already be determined by simple inspection of the input data to see the position of that user’s profile length relative to the distribution as a whole.

The question of external validity asks whether demographic attributes of interest such as gender, race, age, etc. follow the type of distribution we assume and are linked to profile length as our model suggests. If such attributes have a very different relation to the input variable, then our synthetic attribute will fail to be a good stand-in for the real demographic attribute. The example of MovieLens above suggests that the labels generated are statistically similar to some user and item features, but this association would have to be verified in any domain where this technique is to be applied.

Since FLAG assigns labels probabilistically, it will not capture other aspects of the data that may be correlated with a demographic attribute. For example, our prior work demonstrated some differences in genre preferences between male and female users in the MovieLens 1M data set (Burke et al., 2018). The group A and group B users assigned by FLAG would not show these differences. This is a consequence of using a single dimension of user behavior to control attribute generation. Note that other features such as user age might also be associated with differences in profile length and any such variables would be conflated in synthetic attribute production. This is one reason to avoid any claim that FLAG is inferring unknown demographic aspects of users.

In order to make the assigned labels track additional aspects of user behavior, such as genre preference, these dimensions would have to be incorporated into the model. The model would then move closer towards an inferential approach (inferring missing demographic attributes) rather than a synthetic one in which the labels are meaningless although systematically assigned. This tension can be resolved by generating fully synthetic data, with an algorithm that generates all the profile information and captures demographic differences as well. This is much more complex challenge that we leave for future work.

We believe that using the fairly neutral and domain-general profile length characteristic is a good compromise between the concerns of avoiding attribute inference and of generating unrealistic data. However, the question of the external validity of FLAG’s attribute generation remains to be fully answered. It is possible that fairness results relative to synthetic data will not translate to real-world applications. This concern must be answered through additional research.

5. Conclusion

Fairness-aware recommendation research requires appropriate data for experimentation. However, sensitive demographic characteristics that are of most interest in areas where fairness is important are precisely those that are least likely to be disclosed. This paper has outlined the Frequency-Linked Attribute Generation (FLAG) algorithm for generating such attributes. We show it is possible to augment real user data with synthetic data designed to closely match the characteristics of the real attributes in distribution and linkage to user behavior. We provide a suggestive example of generating synthetic data in the MovieLens data set and show that we are able to reproduce a probabilistic association between a demographic attribute and profile length.


  • (1)
  • Abel et al. (2017) Fabian Abel, Yashar Deldjoo, Mehdi Elahi, and Daniel Kohlsdorf. 2017. Recsys challenge 2017: Offline and online evaluation. In Proceedings of the Eleventh ACM Conference on Recommender Systems. ACM, 372–373.
  • Avery (2003) Derek R Avery. 2003. Racial differences in perceptions of starting salaries: How failing to discriminate can perpetuate discrimination. Journal of Business and Psychology 17, 4 (2003), 439–450.
  • Burke et al. (2018) Robin Burke, Nasim Sonboli, and Aldo Ordonez-Gauger. 2018. Balanced Neighborhoods for Multi-sided Fairness in Recommendation. In Conference on Fairness, Accountability and Transparency. 202–214.
  • Burke, Robin (2017) Burke, Robin. 2017. Multisided Fairness for Recommendation. In Workshop on Fairness, Accountability and Transparency in Machine Learning (FATML). arXiv, Halifax, Nova Scotia, Article arXiv:1707.00093 [cs.CY], 5 pages.
  • Heckert et al. (2002) Teresa M Heckert, Heather E Droste, Patrick J Adams, Christopher M Griffin, Lisa L Roberts, Michael A Mueller, and Hope A Wallis. 2002. Gender differences in anticipated salary: Role of salary estimates for others, job characteristics, career paths, and job inputs. Sex roles 47, 3-4 (2002), 139–151.
  • Wanberg et al. (1996) Connie R Wanberg, John D Watt, and Deborah J Rumsey. 1996. Individuals without jobs: An empirical study of job-seeking behavior and reemployment. Journal of Applied Psychology 81, 1 (1996), 76.
  • Zemel et al. (2013) Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. 2013. Learning fair representations. In Proceedings of the 30th International Conference on Machine Learning (ICML-13). ACM, Atlanta, GA, USA, 325–333.