1. Introduction
Ranking algorithms are one of our primary forms of interaction with Web content, from search to recommendations. The fact that human beings are inherently part of the ranking process has become a topic of prime relevance. On the one hand, it is known that the users perceive highly ranked results as more reliable. Hence, a biased ranking would lead to a loss of trust in the system (Pan et al., 2007; Ramos et al., 2020). On the other hand, users can also be the object of the ranking (e.g., when dealing with job candidates in services such as LinkedIn), so the ranking position gives them a certain exposure (Singh and Joachims, 2018); if user exposure is affected by their sensitive attributes (such as gender), this might lead to undesired effects, such as discrimination (Hajian et al., 2016).
Item rankings are based on user preferences, and the literature has studied the impact of biased rankings on the users (Kulshrestha et al., 2019). To the best of our knowledge, the impact of users’ sensitive attributes in the scope of item ranking systems is an underexplored area. No study analyzes how the ranking systems’ underlying mechanisms might lead to a biased ranking w.r.t. users’ sensitive attributes.
To tackle this issue indepth, we focus on a class of ranking systems where each user is given a different relevance by the ranking algorithm, based on a notion of reputation. Specifically, reputationbased ranking systems score the items by weighting user preferences with the reputation of each user in the system. Reputation may be automatically computed based on user behavior or notions of trust (Medo and Wakeling, 2010; Li et al., 2012; Saúde et al., 2020). These systems are a form of nonpersonalized ranking, useful when users are not logged in (e.g., movie rankings in IMDB) or to preserve the system in case of attacks (Li et al., 2012; Saúde et al., 2020, 2017).
Kamishima et al. (Kamishima et al., 2018) recently introduced the concept of recommendation independence. Given a sensitive feature (either associated with the consumers, the providers, or the items), they present a framework to generate recommendations whose outcome is statistically independent of a specified sensitive feature. In this paper, we embrace a similar concept in the ranking systems domain. We propose a method to ensure that, for a given sensitive users’ feature, the reputation scores are not biased; indeed, the average of the opinions have the same importance. However, since no personalization exists in our class of ranking systems, our formulation drastically differs from that of Kamishima et al., and no comparison is possible.
To characterize our problem, we divide users into classes based on the demographic attributes that define them and introduce the concept of disparate reputation (DR), capturing if users belonging to different classes are given systematically lower/higher reputation values. In our case, we compute reputation solely considering user preferences. If DR occurs, then the ranking exhibits a bias based on users’ sensitive attributes, thus not reflecting their preferences. This bias might lead to negative consequences, such as () unfairness towards the consumers, as users belonging to minorities, might receive systematically worse results, () unfairness towards providers whose items are targeted mainly to/preferred by minorities, as these items would systematically get a worse exposure, and () trust issues for the platform as a whole, with underwhelmed minorities and providers possibly leaving the platform.
In this work, we show that DR systematically affects the generated rankings, considering two different attributes of the users (gender and age). This allows us to study and tackle the problem under binary and multiclass settings^{1}^{1}1While gender is by no means a binary construct, to the best of our knowledge, no realworld dataset containing nonbinary genders exists. With a binary setting, we mean that we consider a binary gender feature. By also dealing with the age attribute, we show that our problem and solution can be adapted, as is, also to nobinary genders.. To avoid this phenomenon, we propose an algorithm that ensures that reputation is independent of users’ sensitive attributes. Further, the proposed additional step to introduce reputation independence may be included in any ranking system that computes rankings as a weighted average of the ratings. Results of realworld data show the effectiveness of our approach at generating rankings not affected by DR. Besides, thanks to reputation independence, the generated rankings are closer to the primary user preferences, w.r.t. those from stateoftheart solutions.
Our contributions can be summarized as follows: () we propose a metric to characterize DR based on users’ sensitive attributes, () we present an algorithm to introduce reputation independence from sensitive attributes, () we measure DR and the effects of our mitigation on realworld data.
2. Preliminaries
Notation. We denote sets by calligraphic letters, e.g., , and . We denote a set of users by and a set of items by . We denote a possibly sparse matrix of ratings that users in give to items in by . We assume the ratings to be normalized (dividing by the maximum allowed rating) to be in , and we denote by the difference between the maximum and the minimum normalized ratings. Hence, for and , if user did not rate item , and it is positive otherwise. We denote a set of user attributes by , that correspond, for example, to the gender, age, etc. An attribute has different classes. For instance, the attribute that is gender has two or more classes . We denote classes of an attribute by . If user belongs to class for attribute , then we write that . We denote the set of users that rated item by , the set of items that user rated by and the set of users with class of attribute by . In this work, we assume that if an attribute has classes , then for all
. Given a vector
, we denote its average by and its standard deviation by .Problem Statement. Given a dataset with users , items , ratings given from users to items , and an attribute such that , we would like to achieve the following:
compute users’ reputation based on user preferences; reputation should capture how revelant are the preferences of a user for the rest of the community, thus excluding the trivial reputation assignment that yields equal reputation for each user;
compute the items’ rankings as a weighted average of users’ reputations with items’ ratings;
obtain reputations’ distributions for each pair of users sets and , where are classes of the same attribute, are statistically indistinguishable (reputation independence).
3. Reputationbased Ranking Systems
In (Li et al., 2012), Li et al.
proposed a reputationbased RS that is an iterative scheme that converges with exponential rate and is more robust to attacks than the arithmetic average (AA). At each iteration, the system: (i) estimates the ranking,
, of each item by combining the ratings given to the item with the reputations of users, , that rated the item; (ii) estimates the users’ reputation by measuring how different are the user’s ratings to the items’ ranking estimated in (i). Specifically, for the variant L1AVG, for ,(1) 
for any initial (we opt to set ) and is a hyperparameter that penalizes the disagreement of ratings with rankings for each user. We denote by the vector collecting users’ reputations in the same order as . However, this RS presents some unintuitive properties. The ranking is a weighted sum divided by the number of parcels in the summation, instead of being normalized by the sum of the weights. Hence, an item such that all the given ratings are maximum will yield a ranking that is not the maximum as long as at least one of the users that rated the item has reputation below 1. Therefore, in (Saúde et al., 2020), the system in (1) was enhanced, adjusting the ranking computation of (1) to be
(2) 
This RS also converges with exponential rate and is more robust to attacks than (1).
4. Reputation Independence
This section presents our metric to characterize DR and our mitigation algorithm to ensure reputation independence from sensitive attributes of the users.
Characterizing disparate reputation (DR). Let be an attribute of the users, with classes . Considering two classes (in the same attribute), we denote as and the average reputation of the users characterized by that class.
We are interested in studying whether a common attribute characterizes polarized reputations for groups of users. For example, whether there is a bias in the reputationbased RS such that the opinion of users coming from different classes (e.g., males and females) contribute differently to the ranking computation.
To characterize if users belonging to different classes have different reputations scores, we define the disparate reputation metric, computed as The metric ranges in ; it is when both averages of the reputations are the same (). Negative values point that class has users with higher reputation values and, viceversa, for the class and positive values.
To characterize if disparate reputation systematically affects the users belonging to a class, we propose to do a statistical test, the MannWhitney (MW) test (Mann and Whitney, 1947), to each pair in
. It is a nonparametric test, with the null hypothesis that it is equally likely that a randomly selected value from one population will be less than or greater than a randomly selected value from another population. This test is often used to scrutinize whether two independent samples were selected from populations with the same distribution.
Reputation independence. To avoid sensitive attributes of the users to impact the ranking system systematically, we design a strategy that, given a sensitive attribute of the users in the system, mitigates the bias in the user reputations for each group of users with different values for that attribute, thus leading the reputation computation to be independent of the sensitive attribute.
Given a reputationbased RS that updates the items’ ranking as a weighted average of ratings with users’ reputations (such as (1) and (2)), we propose to harmonize users’ reputations inside each group of a specific attribute, to achieve a similar distribution of reputations among each group. If we consider (1) or (2) to compute rankings and reputations doing iterations, we use and to do the following additional step to ensure independence for attribute , with classes ,
(3) 
where and , with
Observe that, in Equation (3), we select the minimum between the averages and the minimum between the standard deviations to ensure that the reputations’ readjustment still lies in the interval . So, Equation (3
) harmonizes the reputation’s distributions for each class of an attribute to follow a common probability distribution, ensuring that reputations of each class are “statistically indistinguishable”.
5. Evaluation on Real Data
Capturing and dealing with DR is not a trivial task due to the lack of public datasets with ratings and sensitive attributes of the users. This led us to this preliminary study, whose goal is to illustrate the problem and to validate it considering different sensitive attributes.
Here, we compare the stateoftheart RS proposed in (Saúde et al., 2020) and computed as in (2), and the solution we introduce in (3). We do this both in terms of DR and ranking effectiveness.
We use the MovieLens1M dataset, which has 1 000 209 ratings from 6 040 users to 3 952 items. We evaluate our work on users’ attributes , available in the dataset.
5.1. Evaluating Disparate Reputation
Attribute: Gender. First, we investigate if there is bias on users’ reputations under the attribute gender. We start by characterizing DR by presenting the Boxwhiskerchart (BWC) for the reputations, see Figure 1 (a). Using the proposed DR metric to assess bias concerning the attribute gender. Using solely equation (2), we obtain . Hence, since , it follows that the class of male users do have in average larger reputation values than the class of female users, yielding a bias on the attribute gender (as we observe also in Figure 1 (a)). Next, we test the null hypothesis that the median difference is 0 at the 5% level based on the MW test. The hypothesis is rejected with a pvalue of . Hence, we confirm that there is bias in the reputations for these two classes.
To mitigate the bias, we compute the reputation and the ranking as presented in equation (3), obtaining the BWC for reputations of Figure 1 (b). In this case, we get a DR of . Hence, we mitigate the bias on the attribute gender (as we observe also in Figure 1 (b)). This time, the null hypothesis that the median difference is 0 is not rejected at the 5% level based on the MW test, with a pvalue of . This result confirms that we successfully mitigated the bias on the reputations for these two classes.
Attribute: Age. We characterize if DR occurs, by computing the Boxwhiskerchart (BWC) for the reputations, see Figure 2 (a).
Subsequently, we perform a similar analysis to the attribute age, where we consider groups of users with age in the classes .
The DR metric, when only (2) is used, yields the results in Table 2 that reveal the existence of bias. We only filled the uptriangular part of the table, because the DR anticommutes and the lowtriangular part is equal to the symmetric of the uptriangular one. The result of the MW test for the reputations, resulting from (2) of each pair of classes, for the null hypothesis that the median difference is 0 () at a 5% confidence level vs. the hypothesis that the median difference is not 0 () is summarized in Table 1. We only filled the uptriangular part of the table since the MW test commutes.
When we mitigate bias for the attribute age with (3), we achieve the results in Table 3 and Fig. 2 (b). Now, the null hypothesis that the reputations’ median difference is 0 () is not rejected at the 5% confidence level, using the MW test, for any pair of classes under attribute age (table is not reported due to space constraints).
5.2. Evaluating Effectiveness
No attribute  Gender  Age 

Finally, to evaluate the effectiveness of the proposed method, we use the Kendall Tau (Kendall, 1938) with AA as the ground truth, as it is done in (Li et al., 2012). We report the observed for each of the attributes considered in Table 4. We notice that the Kendall Tau improves when we mitigate bias relative to an attribute. This improvement means that our approach yields an order of rankings closer to the AA, but yet assigning different relevance to different users, w.r.t. (2).
The reputation concept treats users differently, which may lead to a ranking with a bias for specific users’ attributes. With the proposed approach, for a specific attribute, we mitigate bias. With our method, the concept of reputation still plays a role inside each group with a particular attribute value, but it does not cause bias. So, we get “closer” to the average in the sense that the AA does not treat groups differently, and with our approach, we also do not treat groups differently for a given attribute.
6. Conclusions
Reputationbased ranking systems try to rank items by ensuring the preferences of the community as a whole are reflected in the way items are sorted. In this sense, computing effective formulations of user reputation, to weight the individual preferences, is vital.
In this work, we introduce a measure of disparate reputation (DR), to analyze if user reputation is affected by users’ sensitive attributes. To avoid this, we introduce a novel approach that ensures reputation independence from sensitive user attributes. Experiments on real data, which considered different demographic attributes of the users, showed that DR occurs in stateoftheart approaches and that our mitigation can introduce reputation independence from sensitive attributes and, at the same time, increase ranking quality.
Avenues for further research include exploring further datasets, considering attributes with possibly not disjoint classes, and specifying multiple attributes to mitigate bias.
Acknowledgments.
This work was supported in part by FCT project POCI010145FEDER031411HARMONY. G. Ramos further acknowledges the support of Institute for Systems and Robotics, Instituto Superior Técnico (Portugal), through scholarship BL229/2018_ISTID. L. Boratto acknowledges Agència per a la Competivitat de l’Empresa, ACCIÓ, for their support under project “Fair and Explainable Artificial Intelligence (FXAI)”.
References
 Algorithmic bias: from discrimination discovery to fairnessaware data mining. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2125–2126. Cited by: §1.

Recommendation independence.
In Conference on Fairness, Accountability and Transparency, FAT 2018,
Proceedings of Machine Learning Research
, Vol. 81, pp. 187–201. Cited by: §1.  A new measure of rank correlation. Biometrika 30 (1/2), pp. 81–93. Cited by: §5.2.
 Search bias quantification: investigating political bias in social media and web search. Inf. Retr. Journal 22 (12), pp. 188–227. Cited by: §1.
 Robust reputationbased ranking on bipartite rating networks. In Proceedings of the 2012 SIAM international conference on data mining, pp. 612–623. Cited by: §1, §3, §5.2.

On a test of whether one of two random variables is stochastically larger than the other
. The annals of mathematical statistics, pp. 50–60. Cited by: §4.  The effect of discrete vs. continuousvalued ratings on reputation and ranking systems. EPL (Europhysics Letters) 91 (4), pp. 48004. Cited by: §1.
 In google we trust: users’ decisions on rank, position, and relevance. Journal of ComputerMediated Communication 12 (3), pp. 801–823. Cited by: §1.
 On the negative impact of social influence in recommender systems: a study of bribery in collaborative hybrid algorithms. Information Processing & Management 57 (2), pp. 102058. External Links: ISSN 03064573 Cited by: §1.
 A robust reputationbased group ranking system and its resistance to bribery. arXiv, pp. arXiv–2004. Cited by: §1, §3, §5.
 Reputationbased ranking systems and their resistance to bribery. In 2017 IEEE International Conference on Data Mining (ICDM), pp. 1063–1068. Cited by: §1.
 Fairness of exposure in rankings. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, pp. 2219–2228. Cited by: §1.
Comments
There are no comments yet.