Reputation (In)dependence in Ranking Systems: Demographics Influence Over Output Disparities

05/25/2020 ∙ by Guilherme Ramos, et al. ∙ Association for Computing Machinery 0

Recent literature on ranking systems (RS) has considered users' exposure when they are the object of the ranking. Although items are the object of reputation-based RS, users have a central role also in this class of algorithms. Indeed, when ranking the items, user preferences are weighted by how relevant this user is in the platform (i.e., their reputation). In this paper, we formulate the concept of disparate reputation (DR) and study if users characterized by sensitive attributes systematically get a lower reputation, leading to a final ranking that reflects less their preferences. We consider two demographic attributes, i.e., gender and age, and show that DR systematically occurs. Then, we propose mitigation, which ensures that reputation is independent of the users' sensitive attributes. Experiments on real-world data show that our approach can overcome DR and also improve ranking effectiveness.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Ranking algorithms are one of our primary forms of interaction with Web content, from search to recommendations. The fact that human beings are inherently part of the ranking process has become a topic of prime relevance. On the one hand, it is known that the users perceive highly ranked results as more reliable. Hence, a biased ranking would lead to a loss of trust in the system (Pan et al., 2007; Ramos et al., 2020). On the other hand, users can also be the object of the ranking (e.g., when dealing with job candidates in services such as LinkedIn), so the ranking position gives them a certain exposure (Singh and Joachims, 2018); if user exposure is affected by their sensitive attributes (such as gender), this might lead to undesired effects, such as discrimination (Hajian et al., 2016).

Item rankings are based on user preferences, and the literature has studied the impact of biased rankings on the users (Kulshrestha et al., 2019). To the best of our knowledge, the impact of users’ sensitive attributes in the scope of item ranking systems is an underexplored area. No study analyzes how the ranking systems’ underlying mechanisms might lead to a biased ranking w.r.t. users’ sensitive attributes.

To tackle this issue in-depth, we focus on a class of ranking systems where each user is given a different relevance by the ranking algorithm, based on a notion of reputation. Specifically, reputation-based ranking systems score the items by weighting user preferences with the reputation of each user in the system. Reputation may be automatically computed based on user behavior or notions of trust (Medo and Wakeling, 2010; Li et al., 2012; Saúde et al., 2020). These systems are a form of non-personalized ranking, useful when users are not logged in (e.g., movie rankings in IMDB) or to preserve the system in case of attacks (Li et al., 2012; Saúde et al., 2020, 2017).

Kamishima et al. (Kamishima et al., 2018) recently introduced the concept of recommendation independence. Given a sensitive feature (either associated with the consumers, the providers, or the items), they present a framework to generate recommendations whose outcome is statistically independent of a specified sensitive feature. In this paper, we embrace a similar concept in the ranking systems domain. We propose a method to ensure that, for a given sensitive users’ feature, the reputation scores are not biased; indeed, the average of the opinions have the same importance. However, since no personalization exists in our class of ranking systems, our formulation drastically differs from that of Kamishima et al., and no comparison is possible.

To characterize our problem, we divide users into classes based on the demographic attributes that define them and introduce the concept of disparate reputation (DR), capturing if users belonging to different classes are given systematically lower/higher reputation values. In our case, we compute reputation solely considering user preferences. If DR occurs, then the ranking exhibits a bias based on users’ sensitive attributes, thus not reflecting their preferences. This bias might lead to negative consequences, such as () unfairness towards the consumers, as users belonging to minorities, might receive systematically worse results, () unfairness towards providers whose items are targeted mainly to/preferred by minorities, as these items would systematically get a worse exposure, and () trust issues for the platform as a whole, with underwhelmed minorities and providers possibly leaving the platform.

In this work, we show that DR systematically affects the generated rankings, considering two different attributes of the users (gender and age). This allows us to study and tackle the problem under binary and multiclass settings111While gender is by no means a binary construct, to the best of our knowledge, no real-world dataset containing non-binary genders exists. With a binary setting, we mean that we consider a binary gender feature. By also dealing with the age attribute, we show that our problem and solution can be adapted, as is, also to no-binary genders.. To avoid this phenomenon, we propose an algorithm that ensures that reputation is independent of users’ sensitive attributes. Further, the proposed additional step to introduce reputation independence may be included in any ranking system that computes rankings as a weighted average of the ratings. Results of real-world data show the effectiveness of our approach at generating rankings not affected by DR. Besides, thanks to reputation independence, the generated rankings are closer to the primary user preferences, w.r.t. those from state-of-the-art solutions.

Our contributions can be summarized as follows: () we propose a metric to characterize DR based on users’ sensitive attributes, () we present an algorithm to introduce reputation independence from sensitive attributes, () we measure DR and the effects of our mitigation on real-world data.

2. Preliminaries

Notation. We denote sets by calligraphic letters, e.g., , and . We denote a set of users by and a set of items by . We denote a possibly sparse matrix of ratings that users in give to items in by . We assume the ratings to be normalized (dividing by the maximum allowed rating) to be in , and we denote by the difference between the maximum and the minimum normalized ratings. Hence, for and , if user did not rate item , and it is positive otherwise. We denote a set of user attributes by , that correspond, for example, to the gender, age, etc. An attribute has different classes. For instance, the attribute that is gender has two or more classes . We denote classes of an attribute by . If user belongs to class for attribute , then we write that . We denote the set of users that rated item by , the set of items that user rated by and the set of users with class of attribute by . In this work, we assume that if an attribute has classes , then for all

. Given a vector

, we denote its average by and its standard deviation by .

Problem Statement. Given a dataset with users , items , ratings given from users to items , and an attribute such that , we would like to achieve the following:

compute users’ reputation based on user preferences; reputation should capture how revelant are the preferences of a user for the rest of the community, thus excluding the trivial reputation assignment that yields equal reputation for each user;

compute the items’ rankings as a weighted average of users’ reputations with items’ ratings;

obtain reputations’ distributions for each pair of users sets and , where are classes of the same attribute, are statistically indistinguishable (reputation independence).

3. Reputation-based Ranking Systems

In (Li et al., 2012),  Li et al.

proposed a reputation-based RS that is an iterative scheme that converges with exponential rate and is more robust to attacks than the arithmetic average (AA). At each iteration, the system: (i) estimates the ranking,

, of each item by combining the ratings given to the item with the reputations of users, , that rated the item; (ii) estimates the users’ reputation by measuring how different are the user’s ratings to the items’ ranking estimated in (i). Specifically, for the variant L1-AVG, for ,


for any initial (we opt to set ) and is a hyper-parameter that penalizes the disagreement of ratings with rankings for each user. We denote by the vector collecting users’ reputations in the same order as . However, this RS presents some unintuitive properties. The ranking is a weighted sum divided by the number of parcels in the summation, instead of being normalized by the sum of the weights. Hence, an item such that all the given ratings are maximum will yield a ranking that is not the maximum as long as at least one of the users that rated the item has reputation below 1. Therefore, in (Saúde et al., 2020), the system in (1) was enhanced, adjusting the ranking computation of (1) to be


This RS also converges with exponential rate and is more robust to attacks than (1).

4. Reputation Independence

This section presents our metric to characterize DR and our mitigation algorithm to ensure reputation independence from sensitive attributes of the users.

Characterizing disparate reputation (DR). Let be an attribute of the users, with classes . Considering two classes (in the same attribute), we denote as and the average reputation of the users characterized by that class.

We are interested in studying whether a common attribute characterizes polarized reputations for groups of users. For example, whether there is a bias in the reputation-based RS such that the opinion of users coming from different classes (e.g., males and females) contribute differently to the ranking computation.

To characterize if users belonging to different classes have different reputations scores, we define the disparate reputation metric, computed as The metric ranges in ; it is when both averages of the reputations are the same (). Negative values point that class has users with higher reputation values and, vice-versa, for the class and positive values.

To characterize if disparate reputation systematically affects the users belonging to a class, we propose to do a statistical test, the Mann-Whitney (MW) test (Mann and Whitney, 1947), to each pair in

. It is a nonparametric test, with the null hypothesis that it is equally likely that a randomly selected value from one population will be less than or greater than a randomly selected value from another population. This test is often used to scrutinize whether two independent samples were selected from populations with the same distribution.

Reputation independence. To avoid sensitive attributes of the users to impact the ranking system systematically, we design a strategy that, given a sensitive attribute of the users in the system, mitigates the bias in the user reputations for each group of users with different values for that attribute, thus leading the reputation computation to be independent of the sensitive attribute.

Given a reputation-based RS that updates the items’ ranking as a weighted average of ratings with users’ reputations (such as (1) and (2)), we propose to harmonize users’ reputations inside each group of a specific attribute, to achieve a similar distribution of reputations among each group. If we consider (1) or (2) to compute rankings and reputations doing iterations, we use and to do the following additional step to ensure independence for attribute , with classes ,


where and , with
Observe that, in Equation (3), we select the minimum between the averages and the minimum between the standard deviations to ensure that the reputations’ readjustment still lies in the interval . So, Equation (3

) harmonizes the reputation’s distributions for each class of an attribute to follow a common probability distribution, ensuring that reputations of each class are “statistically indistinguishable”.

5. Evaluation on Real Data

Capturing and dealing with DR is not a trivial task due to the lack of public datasets with ratings and sensitive attributes of the users. This led us to this preliminary study, whose goal is to illustrate the problem and to validate it considering different sensitive attributes.

Here, we compare the state-of-the-art RS proposed in (Saúde et al., 2020) and computed as in (2), and the solution we introduce in (3). We do this both in terms of DR and ranking effectiveness.

We use the MovieLens-1M dataset, which has 1 000 209 ratings from 6 040 users to 3 952 items. We evaluate our work on users’ attributes , available in the dataset.

5.1. Evaluating Disparate Reputation

Attribute: Gender. First, we investigate if there is bias on users’ reputations under the attribute gender. We start by characterizing DR by presenting the Box-whisker-chart (BWC) for the reputations, see Figure 1 (a). Using the proposed DR metric to assess bias concerning the attribute gender. Using solely equation (2), we obtain . Hence, since , it follows that the class of male users do have in average larger reputation values than the class of female users, yielding a bias on the attribute gender (as we observe also in Figure 1 (a)). Next, we test the null hypothesis that the median difference is 0 at the 5% level based on the MW test. The hypothesis is rejected with a p-value of . Hence, we confirm that there is bias in the reputations for these two classes.

Figure 1. BWC for reputations of users resulting from (2) in (a), and from  (2) and (3) in (b), with , for groups of users under the attribute gender.

To mitigate the bias, we compute the reputation and the ranking as presented in equation (3), obtaining the BWC for reputations of Figure 1 (b). In this case, we get a DR of . Hence, we mitigate the bias on the attribute gender (as we observe also in Figure 1 (b)). This time, the null hypothesis that the median difference is 0 is not rejected at the 5% level based on the MW test, with a p-value of . This result confirms that we successfully mitigated the bias on the reputations for these two classes.


Table 1. MW tests for the reputations resulting from (2) of each pair of classes. means is not rejected and means is rejected.

() -0.0089 -0.0142 -0.0161 -0.0159 -0.0153 -0.0164 -0.0053 -0.0072 -0.0070 -0.0064 -0.0075 -0.0019 -0.0017 -0.0011 -0.0022 0.0002 0.0008 -0.0003 0.0006 -0.0005 -0.0011

Table 2. DR of reputations resulting from (2), for attribute age.

Attribute: Age. We characterize if DR occurs, by computing the Box-whisker-chart (BWC) for the reputations, see Figure 2 (a).

Subsequently, we perform a similar analysis to the attribute age, where we consider groups of users with age in the classes .

The DR metric, when only (2) is used, yields the results in Table 2 that reveal the existence of bias. We only filled the up-triangular part of the table, because the DR anti-commutes and the low-triangular part is equal to the symmetric of the up-triangular one. The result of the MW test for the reputations, resulting from (2) of each pair of classes, for the null hypothesis that the median difference is 0 () at a 5% confidence level vs. the hypothesis that the median difference is not 0 () is summarized in Table 1. We only filled the up-triangular part of the table since the MW test commutes.

When we mitigate bias for the attribute age with (3), we achieve the results in Table 3 and Fig. 2 (b). Now, the null hypothesis that the reputations’ median difference is 0 () is not rejected at the 5% confidence level, using the MW test, for any pair of classes under attribute age (table is not reported due to space constraints).

Figure 2. BWC for reputations of users resulting from (2) in (a), and from  (2) and (3) in (b), with , for groups of users under the attribute age.

() 0 0

Table 3. DR of reputations resulting from (2) and (3), for attribute age.

5.2. Evaluating Effectiveness

No attribute Gender Age
Table 4. using AA rankings as the ground truth and rankings obtained with (2).

Finally, to evaluate the effectiveness of the proposed method, we use the Kendall Tau (Kendall, 1938) with AA as the ground truth, as it is done in (Li et al., 2012). We report the observed for each of the attributes considered in Table 4. We notice that the Kendall Tau improves when we mitigate bias relative to an attribute. This improvement means that our approach yields an order of rankings closer to the AA, but yet assigning different relevance to different users, w.r.t. (2).

The reputation concept treats users differently, which may lead to a ranking with a bias for specific users’ attributes. With the proposed approach, for a specific attribute, we mitigate bias. With our method, the concept of reputation still plays a role inside each group with a particular attribute value, but it does not cause bias. So, we get “closer” to the average in the sense that the AA does not treat groups differently, and with our approach, we also do not treat groups differently for a given attribute.

6. Conclusions

Reputation-based ranking systems try to rank items by ensuring the preferences of the community as a whole are reflected in the way items are sorted. In this sense, computing effective formulations of user reputation, to weight the individual preferences, is vital.

In this work, we introduce a measure of disparate reputation (DR), to analyze if user reputation is affected by users’ sensitive attributes. To avoid this, we introduce a novel approach that ensures reputation independence from sensitive user attributes. Experiments on real data, which considered different demographic attributes of the users, showed that DR occurs in state-of-the-art approaches and that our mitigation can introduce reputation independence from sensitive attributes and, at the same time, increase ranking quality.

Avenues for further research include exploring further datasets, considering attributes with possibly not disjoint classes, and specifying multiple attributes to mitigate bias.


This work was supported in part by FCT project POCI-01-0145-FEDER-031411-HARMONY. G. Ramos further acknowledges the support of Institute for Systems and Robotics, Instituto Superior Técnico (Portugal), through scholarship BL229/2018_IST-ID. L. Boratto acknowledges Agència per a la Competivitat de l’Empresa, ACCIÓ, for their support under project “Fair and Explainable Artificial Intelligence (FX-AI)”.


  • S. Hajian, F. Bonchi, and C. Castillo (2016) Algorithmic bias: from discrimination discovery to fairness-aware data mining. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2125–2126. Cited by: §1.
  • T. Kamishima, S. Akaho, H. Asoh, and J. Sakuma (2018) Recommendation independence. In Conference on Fairness, Accountability and Transparency, FAT 2018,

    Proceedings of Machine Learning Research

    , Vol. 81, pp. 187–201.
    Cited by: §1.
  • M. G. Kendall (1938) A new measure of rank correlation. Biometrika 30 (1/2), pp. 81–93. Cited by: §5.2.
  • J. Kulshrestha, M. Eslami, J. Messias, M. B. Zafar, S. Ghosh, K. P. Gummadi, and K. Karahalios (2019) Search bias quantification: investigating political bias in social media and web search. Inf. Retr. Journal 22 (1-2), pp. 188–227. Cited by: §1.
  • R. Li, J. Xu Yu, X. Huang, and H. Cheng (2012) Robust reputation-based ranking on bipartite rating networks. In Proceedings of the 2012 SIAM international conference on data mining, pp. 612–623. Cited by: §1, §3, §5.2.
  • H. B. Mann and D. R. Whitney (1947)

    On a test of whether one of two random variables is stochastically larger than the other

    The annals of mathematical statistics, pp. 50–60. Cited by: §4.
  • M. Medo and J. R. Wakeling (2010) The effect of discrete vs. continuous-valued ratings on reputation and ranking systems. EPL (Europhysics Letters) 91 (4), pp. 48004. Cited by: §1.
  • B. Pan, H. Hembrooke, T. Joachims, L. Lorigo, G. Gay, and L. Granka (2007) In google we trust: users’ decisions on rank, position, and relevance. Journal of Computer-Mediated Communication 12 (3), pp. 801–823. Cited by: §1.
  • G. Ramos, L. Boratto, and C. Caleiro (2020) On the negative impact of social influence in recommender systems: a study of bribery in collaborative hybrid algorithms. Information Processing & Management 57 (2), pp. 102058. External Links: ISSN 0306-4573 Cited by: §1.
  • J. Saúde, G. Ramos, L. Boratto, and C. Caleiro (2020) A robust reputation-based group ranking system and its resistance to bribery. arXiv, pp. arXiv–2004. Cited by: §1, §3, §5.
  • J. Saúde, G. Ramos, C. Caleiro, and S. Kar (2017) Reputation-based ranking systems and their resistance to bribery. In 2017 IEEE International Conference on Data Mining (ICDM), pp. 1063–1068. Cited by: §1.
  • A. Singh and T. Joachims (2018) Fairness of exposure in rankings. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, pp. 2219–2228. Cited by: §1.