Interplay between Upsampling and Regularization for Provider Fairness in Recommender Systems

06/07/2020 ∙ by Ludovico Boratto, et al. ∙ Association for Computing Machinery Universita Cagliari 0

Considering the impact of recommendations on item providers is one of the duties of multi-sided recommender systems. Item providers are key stakeholders in online platforms, and their earnings and plans are influenced by the exposure their items receive in recommended lists. Prior work showed that certain minority groups of providers, characterized by a common sensitive attribute (e.g., gender or race), are being disproportionately affected by indirect and unintentional discrimination. Existing fairness-aware frameworks expose limits to handle a situation where all these conditions hold: (i) the same provider is associated to multiple items of a list suggested to a user, (ii) an item is created by more than one provider jointly, and (iii) predicted user-item relevance scores are biasedly estimated for items of provider groups. Under this scenario, we characterize provider (un)fairness with a novel metric, claiming for equity of relevance scores among providers groups, based on their contribution in the catalog. We assess this form of equity on synthetic data, simulating diverse representations of the minority group in the catalog and the observations. Based on learned lessons, we devise a treatment that combines observation upsampling and loss regularization, while learning user-item relevance scores. Experiments on real-world data show that our treatment leads to higher equity of relevance scores. The resulting suggested items provide fairer visibility and exposure, wider minority-group item coverage, and no or limited loss in recommendation utility.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recommender systems help individuals explore vast catalogs of items. To this end, such systems adopt a model that implements a suitable way of ranking items. Conventionally, items are ranked in order of their decreasing relevance for a given user,

estimated via machine learning

. The literature traditionally focused on optimizing user-item relevance for user’s recommendation utility Ricci et al. (2015). However, many recommendation scenarios involve multiple stakeholders, and should account for the impact on more than one group of participants Burke (2017). For instance, the ranked lists may influence profits and plans of item providers Jannach and Jugovac (2019).

Motivation. The motivation driving this paper is that an automated model, optimized for user’s recommendation utility, can introduce an indirect and unintentional discrimination for providers belonging to a legally-protected minority class (e.g., gender or ethnicity) Zliobaite (2017); Dwork et al. (2012). Given the primary role of recommender systems also for minority providers, having their items unfairly recommended would have human, ethical, social, and economic consequences Ricci et al. (2015). Furthermore, due to these phenomena, providers might lose their trust in the platform and then leave it, impacting on the ecosystem as a whole. Hence, it is imperative to uncover, characterize, and mitigate discrimination inherent in the recommendation model, so that no platform systematically and repeatedly disadvantages minority providers.

Problem Statement. The literature in ranking and recommendation recently focused on aligning the exposure or the attention to providers with their relevance or contribution in the catalog, at individual or group level Yang and Stoyanovich (2017); Liu et al. (2019); Kamishima et al. (2018); Biega et al. (2018). Our study encodes the idea of a group-level proportionality between the contribution in the catalog and the relevance assigned to items of a provider group, following a distributive norm based on equity Walster et al. (1973). Operationalizing this notion during the user-item relevance optimization stage may be envisioned as a proactive way of addressing provider’s fairness along the recommendation pipeline. Despite potentially bringing fairness-related benefits on the suggested lists by itself, this action may also help to deal with fairer relevances, when true expected relevances required by other fairness-aware treatments are not available. Ensuring equity of relevance for minority providers is not trivial, since their items tend to be under-represented in observations. This may influence the predicted relevance and, in cascade, the recommendations involving minority providers. The disparate impact we address consists in items of a small minority group of providers systematically receiving a relevance, and potentially an exposure, not proportional to their contribution in the catalog.

Open Issues. While a range of frameworks to assess and mitigate provider unfairness have been introduced in the context of non-personalized people rankings Biega et al. (2018); Singh and Joachims (2018); Lahoti et al. (2019) and item recommendation Kamishima et al. (2018); Beutel et al. (2019), several issues remain open.

Existing frameworks for provider fairness consider a one-to-one association between items and providers Beutel et al. (2019); Sapiezynski et al. (2019). This is natural in a people ranking setting, since the concepts of provider and item being ranked coincide. However, under a more general item recommendation scenario, items and providers may be linked by a many-to-many relationship. Items might be created jointly by more than one provider (e.g., a movie having multiple directors) and the same provider can offer more than one item. Hence, current frameworks fail to assess how fair are recommendations for providers in the general context we described (e.g., for items having both female and male providers). Further, provider unfairness is traditionally mitigated through a form of re-ranking, assuming to have access to true unbiased relevances Singh and Joachims (2018); Biega et al. (2018). In practice, these relevances are typically estimated via machine learning, leading to a biased estimate of the relevance scores. Recommender systems are known to be biased from several perspectives (e.g., popularity, presentation, and, obviously, unfairness for users and providers). Predicting a relevance score on biased/unfair results and basing a re-ranking approach on this possibly biased relevance, might lead to undesired effects.

To our knowledge, no approach can deal with equity of relevance for provider groups under the above scenario. Indeed, while in-processing regularizations of relevance exist Kamishima et al. (2018); Beutel et al. (2019) and this would overcome the second issue, these treatment are fundamentally driven by a fairness objective different from ours, not relying on equity, and still based on a one-to-one relationship between an item and its provider (i.e., there is no straightforward extension of these works to consider items associated to more values of a sensitive attribute).

Motivating Example. These challenges are depicted with concrete examples, taken from the MovieLens-10M dataset, presented in detail in Section 5.1.1. The ID 8097 is associated to the movie “Shark Tale”, with three directors (our providers), Bibo Bergeron, Vicky Jenson, and Rob Letterman. Considering a binary gender attribute111While gender is by no means a binary construct, to the best of our knowledge no dataset with non-binary genders exists. What we are considering is a binary feature, as the current publicly available datasets offer., it is clear that a one-to-one mapping would assign this item either to the male or female group. In reality, the recommendation of this item should account for both the group proportions in the item and the amount of providers associated to each value of the sensitive attribute. Indeed, recommending an item with 11 providers with the same gender (e.g., “Fantasia”, ID 1282, 11 male directors) or an item with 2 providers of a different gender (e.g.,“Shrek”, ID 4306) would impact fairness for provider groups in different ways. Further, female directors appear on the of items in the catalog, but end up to be under-represented with only of observations. Considering the pair-wise approach we employed in this work and the (un)fairness metric we will present, female providers receive of relevance (and of exposure), being affected by our target disparate impact. Hence, an approach to overcome such an impact under this recommendation scenario is needed.

Contributions. Compared to prior work, both in the fairness metric and the mitigation, we consider a many-to-many relationship between items and their providers, and assess the representation of each value of a sensitive attribute in a given item (in our previous example, we would assess how represented each gender is in that item). Second, we introduce and optimize for a notion of equity of relevance, considering a user-item relevance learning procedure to be fair if the relevance given to the items of a certain group of providers is proportional to its representation in the catalog. To this end, we propose a pre-processing strategy that up-samples observations where the minority group is predominant (e.g., an item where the minority is represented with two providers is better than item with only one provider of that group; moreover, the lower is the representation of the majority in that item, the more we can help the minority, by favoring an upsampling of these latter items). In addition, an in-processing component aims to control that the relevance to the items of the minority group is proportional to its contribution in the catalog. Specifically, our contribution is summarized as follows:

  • we define provider fairness in recommendations through a notion of equity between relevance given to provider groups and their contribution in the catalog, under a many-to-many relation between items and providers;

  • we assess our notion of provider unfairness for the minority group of providers on synthetic data that simulates diverse representations of the group in the catalog and the observations, and learn lessons that guide our mitigation;

  • we present a mitigation approach that relies on (i) tailored upsampling in pre-processing and (ii) a regularization term added to the original training optimization function to operationalize our notion of fairness;

  • we extend two public datasets with gender information of the providers, enabling the consequent evaluation of the impact of our metrics and strategies on real-world datasets with very small minority groups.

Roadmap. The remaining of this paper is structured as follows: Section 2 formalizes key concepts and metrics, and Section 3 describes our exploratory analysis. Then, Section 4 introduces our mitigation approach, while Section 5 assess its feasibility. Section 6 provides connections with prior work. Finally, Section 7 depicts concluding remarks and future perspectives.

2 Concepts and Definitions

In this section, we outline the recommendation scenario we seek to investigate and the concepts and definitions used throughout this paper.

2.1 Recommender System Formalization

Given a set of users , a set of items , and and a set of providers , we assume that each item is jointly offered by a subset of providers , with , and a provider offers a subset of items , with . For instance, in the context of course recommendation, if we consider instructors as providers of course items, a course could have two instructors who give lectures cooperatively. Similarly, the same instructor could deliver three different courses on the platform, two of them cooperatively and one alone, just as an example. Each provider is associated with discrete sensitive attributes , with , , . For instance, a set could be associated with the gender attribute and, thus, being defined as , assuming that an attribute is discrete and that we encoded each discrete value to a unique integer.

We assume that users have interacted with a subset of items in . The collected feedback from user-item interactions can be abstracted to a set of pairs (, ) obtained from the normal user’s activity or triplets (, , ), whose is either provided by users (e.g., ratings) or computed by the system (e.g, frequency). In our study, we consider pairs derived from explicit feedback, by applying a pre-selected threshold to rating values, in order to model the recommendation task as a personalized ranking problem. We denote the user-item feedback matrix by , where indicates that user interacted with item , and otherwise. Furthermore, we denote the set of items that user interacted with by .

We assume that each user and item is internally represented by a

-sized numerical vector from a user-vector matrix

and an item-vector matrix , respectively. The recommender system’s task is to optimize for predicting unobserved user-item relevance. It can be abstracted as learning , where denotes the predicted relevance, denotes learnt user and item matrices, and denotes the function predicting the relevance between and . Given a user , items are ranked by decreasing , and top-, with and

, items are recommended. Our study will focus on top-10 recommendations, since they probably get the most attention of users and 10 is a widely employed cut-off. Finally, we denote the set of

items recommended to user by .

2.2 Associating Providers’ Sensitive Attributes to Items

Formalizing our target notion of fairness for provider groups, under the scenario depicted in Section 2.1, requires to deal with several aspects. Fairness studies in ranking and recommendation traditionally targeted people as entities to be ranked or recommended Biega et al. (2018); Yang and Stoyanovich (2017); Lahoti et al. (2019). We argue that, while still having individuals being directly affected by how recommendations are generated, entities to be recommended are not always individuals, and may include items (e.g., movies, courses). This turns out to key challenges that rise in cascade.

First, in many cases, there is no direct one-to-one mapping between an item and the individual who has created or offered it (i.e., the provider). Realistic scenarios need to consider items created by a more than one provider cooperatively (e.g., a course with two instructors) and how the sensitive attributes are associated to the involved providers. It can be even difficult to come up with a one-to-many mapping for items offered by an entity not directly linked to individuals (e.g., a training company providing an online course).

Second, the fact that an item might have more than one provider behind it poses to the problem of how to model the representation of a providers’ sensitive attribute, when considering that item (e.g., how each gender is represented in a given item), based on the individuals associated to it. It should be trivial to see that linking a unique gender, either binary or multi-class, discrete or continuous, to a sensitive attribute of a provider and claim fairness on such a variable is impractical. More sophisticated solutions should be considered.

Based on these observations, we define a notion of sensitive attribute representation for an item , subjected to a sensitive attribute . This notion requires to consider the membership of each provider to a class of the sensitive attribute (which we previously denoted as ), while mapping sensitive attributes to items.

Definition 1 (Sensitive attribute representation)

Given a sensitive attribute , the sensitive attribute representation of an item with respect to is defined as:

(1)

where is the set of ’s providers with attribute . Each vector has size for all items , and each of its values represents the number of providers who belong to a given class of the attribute , ranging in . Similarly to us, Sapyezinski et al. Yang and Stoyanovich (2017) use a function to map each ranked item to a vector. However, their vector is used as a proxy of uncertainty, while assigning a sensitive attribute value to a person to be ranked (e.g., given a binary gender construct, if a system considers that a person is male with a probability of , the vector associated to that person is ). Our notion differs both conceptually and operationally, as we model and compute how each value that a sensitive attribute can assume is represented across providers associated to a given item, in magnitude. Our notion could be extended to model uncertainty, while getting the value of the sensitive attribute associated to a single provider, which is assumed by us to be . To better highlight our contribution, our study leaves this combination as a future work.

2.3 Identifying the Minority Group

Our study considers groups of providers who belong to a given class of the attribute . Each group is involved in the creation/delivering of a certain number of items in the catalog and, consequently, in a certain number of the user-item observations. Specifically, given the definitions previously provided in Section 2.2, the representation of a group in the catalog and the observations is computed in our study as follows:

Definition 2 (Provider representation in the catalog)

Given a sensitive attribute , the representation of providers with a value of the sensitive attribute in the catalog, is defined as:

(2)

where is the element of the vector associated to the value , as per definition in Eq. 1. The representation ranges in , and accounts for the contribution of providers belonging to a given group in the delivering of items in the catalog. A value close to means that ’s providers rarely contribute to items in the catalog, and viceversa for values close to . Similarly, we define the representation of a provider group in the observations.

Definition 3 (Provider representation in the observations)

Given a sensitive attribute , the representation of providers with a value of the sensitive attribute equal to , in the observations , where are the observed interactions, is defined as:

(3)

In our study, we are interested in investigating how recommendation decisions impact on a group of providers identified as a minority. There exists different modalities to identify a minority group , one of them being the lowest representation in the catalog, i.e., . This choice will better support us to account for differences in contribution among provider groups, assuming that the catalog curation does not suffer from sampling bias (e.g., a job site that refuses to add female engineers to its catalog). While it could be reasonable to assume that certain groups of providers are less represented than others in the catalog (e.g., because certain categories of items are traditionally offered by providers of a given gender), the recommendation loop may lead to under-represent the minority group in the observations more and more with respect to its group contribution in the catalog, i.e., . This effect may inadvertently bias the learnt relevance scores, and, consequently, detain recommendations of minority-group items.

2.4 Defining Equity of Relevance

Compared with a widely-explored context of fairness in people rankings, recommender systems involve personalization (while in non-personalized ranking the utility associated to a query is unique) and need to consider that the same provider can appear behind more than one suggested item in the same list (e.g., an instructor with two of their courses recommended in the top- list for a user). This point, combined with the concept of item sensitive attribute representation, should be considered while dealing with fairness for (a group of) providers, based on how their items are recommended.

Our fairness notion is driven by the idea of a fair sharing of assets. In the context of recommendations, we consider the relevance of items delivered by provider groups to be an asset to be distributed fairly. One popular distributive norm, equity, encodes the idea of proportionality between two variables Walster et al. (1973), and has been recently applied to the context of people ranking Biega et al. (2018). While technically complementary and similar to our approach, their notion targets a purpose different than group fairness, and does not aim at binding relevance to contribution, in recommendation.

Our notion of provider’s group fairness for recommender systems, called equity of relevance, requires that groups of providers characterized by common sensitive attributes receive a relevance proportional to contribution in the platform. As a proxy for contribution, we consider the representation of a provider group in the catalog.

Definition 4 (Group relevance)

Given a sensitive attribute and a recommender system , the relevance of a group of providers with attribute is defined as:

(4)
Definition 5 (Fairness objective: equity of relevance)

Given a sensitive attribute , the relevance returned by a recommender system is fair if the following condition is met:

(5)

It should be noted that this definition is suitable to be optimized directly during the user-item relevance learning step, and that the relevance scores are determined by the user and the item. To keep the reader engaged with the presentation of our contribution, the contextualization of these metrics with respect to the literature is presented in Section 6.1.

3 Optimizing under Different Catalog-Observation Representations

To illustrate the unfairness towards a minority group of providers, and further emphasize the value of our analytical modeling, we simulate various imbalances in catalog and observations of the minority group. Then, we characterize to what extent a model is unfair against the minority group.

3.1 Synthetic Datasets

Our exploratory study in this section is set up in a recommendation context that associates each provider to a generic binary sensitive attribute. The unfairness towards a certain group of providers, characterized by a common sensitive attribute, can occur with imbalanced data. To study the effect of imbalances, we characterize them in two forms: catalog imbalance and observation imbalance. To facilitate this, we assume that each item is associated with a single provider, leaving experiments on items associated with more than one provider to the real-world datasets leveraged in Section 5.

Catalog imbalances emerge when providers from a different group occur in the catalog with varied frequencies. For instance, there may be significantly fewer female/male providers than male/female providers who offer items to users. On the other hand, with observation imbalances, users may interact with items from certain provider groups with different tendencies. This imbalance is often part of a feedback loop involving existing methods of recommendation, either introduced by models or by humans. If users do not receive any item offered by a provider belonging to a certain group, users will not interact with that class of providers. In cascade, models will be served with only few data on this preference relation. For instance, train data about female/male providers may be significantly less than train data about male/female providers.

We simulate these two types of imbalance through synthetic datasets, using two stochastic block models Yao and Huang (2017). We create a catalog block model to determine the probability that an item is offered by a provider in a particular group. Non-uniformity in this block model will lead to catalog imbalances. We then arrange an observation block model, determining the probability that a user observes an item from a given provider’s group, simulating an implicit feedback scenario. The group ratios may be non-uniform, leading to observation imbalance. Formally, let vector , with , be the block-model parameters for catalog probability. For an item , the probability of assigning it to a provider with is . Moreover, given a user , let be such that the probability of observing an item with a provider having is . Specifically, based on groups in , we consider five catalog block models and five observation block models , with . To replicate our target recommendation context, where observation imbalance is assumed to be equal or higher than catalog imbalance, our study will consider setups , with and . Hence, our exploration will cover both situations with a really small minority and situations where the groups are more balanced. Specifically, providers in are identified as the minority group, i.e., .

(a) Popularity tail
(b) Provider Group Imbalance
Figure 1: Synthetic Datasets Imbalance. Popularity tail across items based on the observed interactions, conveyed by each of our synthetic dataset, according with the procedure in Section 3.1 (a). Catalog and observation representations of the minority group in synthetic data, where C stands for “Catalog”, O stands for “Observations”, and C-O is the difference between catalog and observations representations, based on Eq. 2 and 4, respectively (b).

For each setup , we selected a catalog block model and an observation block model, (i) generating users and items, (ii) assigning catalog representations based on , and (iii) sampling implicit observations, according with . This step means that we randomly sampled a user , then we selected the provider group of the item in that pair according to , and we sampled an item from the selected group. To limit anomalous results and distorted recommendation outputs, for each provider group, our specific procedure samples the item

simulating a scenario where items in the same provider group have a different probability of being selected. To this end, we used an exponential distribution

with scale for the distribution function . The parameter determines the scale of the exponential distribution, with . Given the list of items in and the distribution , the index of the sample item in

is represented by the absolute rounded value of the random variable

. Decreasing means that we make the selection more uniform across items. Our exploratory study was carried out with , with the aim of reflecting realistic popularity tails. Figure 1 shows popularity tails, catalog, and observation representations in synthetic datasets.

3.2 Pair-wise Optimization and Exploratory Protocols

Pair-wise optimization is one of the most influential approaches to train recommendation models, and represents the foundation of many cutting edge personalized algorithms Chen et al. (2017); Xue et al. (2017); Xiao et al. (2017). The underlying Bayesian formulation Rendle et al. (2012)

aims to maximize a posterior probability that can be adapted to the parameter vector of an arbitrary model class (e.g., matrix factorization or neighborhood-based). In our study, we adopt matrix factorization 

Koren et al. (2009), due to its popularity and flexibility. Model parameters , i.e., user and item matrices, are estimated through an objective function that maximizes the margin between the relevance predicted for an observed item , and the relevance predicted for an unobserved item. The optimization process considers a set of triplets that are fed into the model during training:

(6)

where and are the sets of items for which user ’s feedback is observed and unobserved, respectively.

The original implementation proposed by Rendle et al. (2012) requires that, for each user , triplets per observed item should be created; the unobserved item is randomly selected. The objective function can be formalized as follows:

(7)

where

is a sigmoid function returning a value between 0 and 1.

The code for our exploratory study was implemented in Python on top of Tensorflow. User and item matrices, with vectors of size

, were initialized with values uniformly distributed in

. The optimization function is transformed to the equivalent minimization dual problem. For each user, we randomly took apart of his/her observations for training, for validation, and for testing. Given the training user-item observations, the model was served with batches of triplets. For each user , we created triplets per observed item ; the unobserved item was randomly selected for each triplet. The optimizer used for gradient update was Adam. Training lasted until convergence on the validation set. Parameters were selected via grid search. The validity of a model was assessed on the test set.

3.3 Observations on Synthetic Datasets

Through synthetic data, we explore a wider range of configurations, questioning situations not usually observable in public datasets but that might occur in the real world, e.g., datasets with different representations of the minority.

First, we run the pair-wise optimization procedure on all our synthetic datasets. Then, we analyze the resulting relevance scores for each provider group with respect to their contribution and observation in the catalog, seeking to understand the relevant characteristics of our notion of equity in Eq. 5. To this end, Figure 2 depicts the representation in catalog, observations, and relevance for the minority group. This plot allows us to see to what extent each perspective influences the measures obtained with our metric of fairness, represented with the purple bar. Results show us that relevance (red) is consistent across datasets having the same representation of the minority group in observations (e.g., 0.5-0.4 and 0.4-0.4). Further, for each dataset, the relevance is similar to the amount of observations (green), and increases as much as the amount of observations increases. It follows that the representation in observations for the minority group plays a key role in shaping the representation of the group in terms of relevance. Our notion of fairness may directly depend on the gap between the representation in contribution (amber) and in observations (green). The smaller the gap, the higher the equity of relevance.

Figure 2: Contribution-Relevance Relation. The representation of the minority group in terms of items in the catalog, observations, and relevance, with indicating the difference between contribution and relevance representations, as per our notion of fairness in Eq. 5.

Observation 1. Equity of relevance depends on the difference between contribution and observation representations. The larger the difference between such two representations, the larger the disparate equity of relevance.

Next, according to the relevance learnt by the recommendation model on each synthetic dataset, we suggested to each user items; then, in Figure 3, we measured the disparate visibility (exposure) for the minority group, both ranging between ; we consider visibility as the percentage of providers of a given group in the recommendations (regardless of their position in the ranking), while we use a definition of exposure inspired by Singh and Joachims Singh and Joachims (2018). They will be explicitly defined in Section 5.1.2. The higher the value, the higher the disparate impact. The connection of all these results allows us to understand how much the inequalities in relevance for provider groups, learnt by the recommendation model, result in inequalities on the recommended lists.

(a) Disparate visibility for the minority
(b) Disparate exposure for the minority
Figure 3: Disparate Impacts. Disparate visibility (a) and exposure (b) for the minority group in top- lists. The disparate impacts are calculated with Eq. 13 and 14, respectively.

Observation 2. In contexts with high catalog and observation imbalances, there is a larger disparate visibility (exposure) against the minority group, based on its contribution in the catalog. Furthermore, the higher the disparate equity of relevance, the higher the disparate visibility (exposure).

We can observe that the effect on exposure is more evident. We conjecture that this result might depend on the fact that, when in presence of a small minority, the items from the minority group are progressively inserted at lower positions of the top-10 or even excluded, because of the lower predicted relevance. The considerations we made suggest to investigate treatments that control the interplay between observations and relevance representations, direct input and output of the optimization process. Hence, we will play with the minority group presence in observation, and regularize equity of relevance.

4 Treatments for Equity of Relevance

With an understanding of our fairness goals and of the intuitions we came up with in the exploratory study, this section describes how we can optimize a recommender system to meet our notion of fairness, while preserving utility.

Our exploratory analysis revealed that equity of relevance may depend on the representation of providers’ groups in both the catalog and the observations, and that the more similar the two representations are for a group, the higher the resulting equity of relevance. It is unlikely that this property is met in observations collected from real-world platforms, as we will later show. It follows that controlling the balance among catalog-observation representations for a group could require to act on the observations. To this end, we will up-sample observations of the minority group, to fill the gap in existing imbalances.

Balanced representations of the minority group between the catalog and the observations would not ensure, by default, equity of relevance in real-world situations. Differently from the synthetic data we generated, observations in real world show several imbalances (e.g., due to presentation, preferences, user interfaces), which are hard to simulate, that may still distort the output relevance. It follows that, when an upsampling mechanism is not sufficient to accomplish our goals, we need a regularization approach to account for equity of relevance during optimization. Only regularizing equity, with no upsampling, may not be sufficient as well, if minority observations are too few. Therefore, our treatment procedure aims to investigate the interplay between upsampling and regularization, with respect to the notion of equity presented in Eq. 5.

To deal with upsampling, we play with the data sampling strategies that generate observation instances (i.e., user-item pairs); conversely, to account for equity of relevance, we will define a training loss function aimed to minimize the pair-wise error specified in Eq.

7 and maximize equity of relevance defined in Eq. 5. We will show empirically that, although the optimization relies on a given set of interactions, even artificially up-sampled, the approach generalizes to real and unseen interactions. The treatment builds upon the following steps:

Observation Upsampling. We propose to up-sample observations related to the minority group with different user-item selection techniques, with the aim of covering a range of alternative setups:

  • [leftmargin=*]

  • real consists of an upsampling of existing observations belonging to the minority group, with repetitions. Specifically, we select the item of the existing user-item interaction to be up-sampled, based on a probability function that takes into account the contribution of the minority , for each item . The higher the contribution of the minority group, the higher the probability to be selected. Then, the real interactions involving the selected item are retrieved, and the one to be up-sampled is randomly selected.

  • fake stands for a random upsampling on synthetic observations, with no repetitions. This strategy instills new observations related to items from the minority group. Similarly to real, the item involved in the up-sampled interaction is selected based on a probability function that accounts for the contribution of the minority , for each item . Then, the user to be included in the up-sampled interaction is randomly selected among those users of who have not already interacted with item .

  • fake-by-pop refers to an upsampling of synthetic observations based on item popularity, with no repetitions. Given items with at least one provider from the minority, the item to be inserted in the up-sampled observation is selected according to an item-popularity probability. The higher the popularity, the higher the probability to be selected. The user of the up-sampled interaction is randomly chosen among those users of who have not already interacted with item . This latter point makes this upsampling procedure different from real, even though both the strategies keep the same popularity tail across minority-group items.

The mentioned strategies assume to up-sample pairs , until the representation of the minority group in the observations meets a target percentage of the total observations. This percentage, investigated in the experimental section, will target the representation of the minority group in the catalog.

Regularized Optimization. Given a range of batches of training data samples (i.e., either pairs for a point-wise approach or triplets for a pair-wise approach), built on top of the up-sampled observations, each training batch is fed into the model that follows a regularized paradigm derived from a traditional optimization setup. The loss function can be formalized as follows:

(8)

where is the original accuracy loss, computed over . In our experimental study, we deal with a pair-wise optimization, thus the accuracy loss is computed as in Eq. 7. The parameter expresses the trade-off between accuracy and equity of relevance. With , we yield the output of the recommender, not taking equity into account. Conversely, with , the output of the recommender is discarded, and we focus on maximizing equity.

The regularization term, , operationalizes our notion of equity of relevance formulated in Eq. 5. The proposed fairness criterion is equivalent to compute, in percentage, the relevance received by minority-group items in a batch with respect to the total relevance received by all items in that batch, and then equalize it to the percentage of contribution of the minority group in the catalog. Let be the contribution of the minority group in the catalog, computed as in Eq. 2, the regularization can be defined as follows:

(9)

These regularized optimization implies that the model is penalized if the difference in relevance and contribution for the minority group of providers is high. The function introduced into the denominator is needed in order to count all the relevance scores in the batch, included the ones on items with no minority-group provider involved. The choice of the absolute value, instead of an L2 norm or an Earth mover distance as examples, is because of its simplicity and effectiveness, especially when dealing with two groups. Our framework can be easily extended to other options.

The contextualization of our regularization with respect to the literature is presented in Section 6.2.

5 Experimental Treatment Evaluation and Analysis

In this section, we empirically study the effects of each component of our treatment and of the treatment as a whole on the needs of both users (i.e., recommendation utility) and providers (i.e., equity of provider relevance, visibility, and exposure). We answer the following three research questions:

RQ1. How much should we up-sample minority-group observations to improve the trade-off between recommendation utility and equity of relevance?

RQ2. How do upsampling and regularization impact on the trade-off between recommendation utility and equity of relevance, individually and jointly?

RQ3. How does our treatment concretely enable equity of relevance for the minority group? How does it impact on internal mechanisms?

5.1 Experimental Setup

5.1.1 Datasets

In order to validate and ensure the reproducibility of our proposal, we selected datasets that are publicly available, covering different domains. We remark that this experimentation is made difficult because there are very few datasets targeting our scenario, and the datasets we consider are highly sparsed.

Movielens-10M (ml-10m) Harper and Konstan (2016) includes ratings applied to movies by

users. In order to be fed into a pair-wise model, observations are binarized using a threshold (i.e., ratings equal or higher than

are marked as , the other ones are changed to ). This dataset does not contain sensitive attributes of the providers and there is no notion of provider. Our study considers movie directors as providers to reflect a real-world scenario. To link movies to their corresponding directors, we capitalized on the methods offered by TMDB APIs222https://developers.themoviedb.org/3. Specifically, we used the getCredits(tmdbId) method to retrieve data about people involved in the movie333Please note that the links.csv file in Movielens includes movieId-tmdbId associations.. We filtered records for individuals with “Director” as a role. Then, we called the getDetails(peopleId) method, passing the id retrieved for each director. The latter method outputs a list with the name and the gender of the director. Note that there are movies with more than one director. The representation of women directors in the catalog is around , while such a representation is reduced to in the observations.

COCO Course Collection (coco) Dessì et al. (2018) includes learners, who gave ratings to online courses. Similarly to ml-10m, ratings are binarized using a threshold (i.e., ratings equal to are marked as , the other ones are changed to ). We selected this threshold due to the extremely high imbalance among rating values, as reported in the original paper. In this scenario, we assume that instructors act as providers. Providers representing a company or an institution were removed, since there was no practical way to associate their items to gender representations. One or more instructors could cooperate in the same course. However, no information about their gender is reported. To extract this attribute, we considered their full names444We point out the challenges seeking to include genders determined by a name, considering that the retrieved gender might not match the expected gender for someone. Related to that issue is the problem of the assumption of a binary gender. Most datasets and tools only consider two genders, “male” and “female”, so we have actually no chance to also consider non-binary attributes. While keeping this in mind, we recognize all genders should be respectfully treated and our framework naturally adapts to multi-class attributes and non-binary genders; we believe that our study will deserve attention in this context.. Specifically, we used the methods offered by GenderAPIs555https://gender-api.com/, that allow to determine the gender by a full name, with a certain confidence. Such a practice has been conducted in prior work to deal with the absence of gender labels Chen et al. (2018); Mansoury et al. (2019). Only predictions with a confidence higher than were kept. The representation of women instructors in the catalog is around , reduced to in the observations.

5.1.2 Evaluation Metrics

In this section, we present the metrics we considered to assess the impact of our work. Although our goal was to bind relevance to contribution of the minority group, several other perspectives of the recommender system should be considered. Our study in this paper includes an assessment (i) of personalization in terms of recommendation utility, (ii) of disparate impacts on relevance, visibility, and exposure with respect to the minority-group contribution in the catalog, and (iii) of coverage of items for the various provider groups and as a whole.

Personalization. To evaluate personalization, we compute the utility of recommended lists via Normalized Discounted Cumulative Gain (NDCG) Järvelin and Kekäläinen (2002).

(10)
(11)

where is the item recommended to user at position , and the values in formalized in Section 2.1 are considered as user-item relevances, while computing . The ideal is calculated by sorting items based on decreasing true relevance (i.e., for an item, the true relevance is if the user interacted with the item in the test set, otherwise). The higher the better.

Disparate Impacts. To understand the interplay between our notion of equity and the concepts of visibility and exposure Singh and Joachims (2018) of the minority group in recommended lists, we measure the difference between the contribution in the catalog and the percentage of relevance (), visibility (), and exposure () achieved by items of the minority group. The lower the better.

(12)
(13)
(14)

where refers to the predicted relevance formalized in Section 2.1, while the terms and derive from Eqs. 2 and 1, respectively. Scores , , and refer to top- recommendations and range in , with lower values indicating more equity w.r.t. the contribution in the catalog.

Item Coverage. In addition to personalization and disparate impacts, we measure the total coverage of items () and of items delivered by providers in the minority () and the majority () group. Coverage is an important property Kaminskas and Bridge (2017), since an approach that only increases the recommendation of one item provider of the minority group would not likely fair within the minority group.

(15)
(16)
(17)

where is the set of items that have at least one provider belonging to the minority group. Each coverage score range in , with values close to for higher values of coverage. The higher the better.

5.1.3 Experimental Setting

We considered several optimization settings, each one characterized by a different combination of upsampling and regularization treatments, as proposed in Section 4. They are briefly identified as follows:

  • [leftmargin=*]

  • baseline: training without any upsampling and regularization treatment;

  • real: only real upsampling;

  • fake: only fake upsampling;

  • fake-by-pop: only fake-by-pop upsampling;

  • reg: only regularization;

  • real+reg: real upsampling, followed by regularization;

  • fake+reg: fake upsampling, followed by regularization;

  • fake-by-pop+reg: fake-by-pop upsampling, followed by regularization.

5.1.4 Implementation Details

For each dataset, a temporal train-test split was performed by including the last of observations released by a user into the test set, of observations were included into the validation set, and the remaining oldest ones into the training set Campos et al. (2014); Sánchez and Bellogín (2020). Embedding matrices, with vectors of size , were initialized with values uniformly distributed in . The optimization function was transformed to the equivalent minimization dual problem. During training, the model was served with batches of training triplets, chosen from a pre-computed set of triplets. To populate it, for each user , we create triplets per observed item ; the unobserved item

is randomly selected for each triplet. Before each epoch, we shuffle the training batches. The optimizer for gradient update was

Adam, with a learning rate of . The function was used to compute the similarity (i.e., the relevance) between user and item vector. Each model was trained until convergence on the validation set, for a maximum of epochs.

5.2 Experimental Results

(a) Real on coco
(b) Fake on coco
(c) Fake-by-pop on coco
(d) Real on ml-10m
(e) Fake on ml-10m
(f) Fake-by-pop on ml-10m
Figure 4: Influence of Upsampling Degree on Trade-off. The trade-off between Normalized Discounted Cumulative Gain (NDCG: red line with bullet markers) and Disparate Exposure (: blue line with star markers) based on the degree of upsampling, varying the upsampling techniques and datasets. Dotted lines indicate the degree of upsampling resulting in a good trade-off (i.e., high NDCG and low ). Disparate visibility and relevance showed similar patterns, and are omitted for the sake of clarity and readability.

5.2.1 Comparing Upsampling Techniques (RQ1)

With this experiment, we aim to understand to what degree upsampling influences recommendation utility and disparate impacts on group relevance, on visibility, and on exposure, and investigate how and how much we should up-sample to obtain a good trade-off among the metrics. Although our exploratory study revealed that paring the percentage of observations for the minority group with the percentage of contribution in the catalog may be the best choice, observations in real world show several imbalances that may distort the output relevance. Hence, we experiment with different degrees of upsampling, not just targeting a minority-group representation in the observations equal to its representation in the catalog.

To this end, for each dataset and upsampling technique, we created a range of model instances fed with a different amount of up-sampled data, using the upsampling techniques described in Section 4. Results in Figure 4 depicts NDCG and at increasing percentage of minority observation upsampling. Patterns related to and were similar to the ones obtained on , so we do not report them for conciseness and readability. The considered plots show us that NDCG tended to decrease, when the amount of up-sampled data became larger. The loss in recommendation utility depends on the dataset and the technique, with fake suffering from the largest loss. Conversely, we observed that achieved the lowest value for an upsampling between -, depending on the dataset. This latter behavior came from the fact that, for small upsampling amounts, the model tended to show a disparate impact in favor of the majority group. Increasing upsampling leads the minority to get more and more exposure; this can get to the point where the majority is affected by a disparate impact, i.e., the minority group is favored more than expected (e.g., in Figure 3(a), when upsampling is greater than ).

Moving to the comparison of the results with different datasets, coco experiences a lower loss in NDCG for the same upsampling technique against ml-10m. Interestingly, for small upsampling amounts, NDCG ends up increasing in coco, with respect to the baseline, which does not make use of upsampling. Furthermore, coco is more susceptible to the amount of upsampling, resulting in larger variations of . Considering the same dataset and observing patterns for different upsampling techniques, it can be observed that real preserves a good level of NDCG, even for high amounts of upsampling. Conversely, follows similar patterns for all the upsampling techniques. Exception is made for real on ml-10m, which showed a decreasing while noisy trend on . Therefore, while upsampling in general is beneficial for controlling , each of the techniques differently preserves the NDCG originally achieved by the model, changing the trade-off between recommendation utility and disparate impacts.

Observation 3. The upsampling of minority-group observations reduces disparate impacts, i.e., the inequality of exposure, visibility, and relevance with respect to the contribution of the minority group in the catalog. The loss in recommendation utility is negligible or even absent in many cases. The amount of needed upsampling depends on the dataset and the upsampling technique.

Data Type NDCG
coco baseline 0.0153 0.0770 0.0733 0.0686 0.2165 0.1413 0.2321
real 0.0157 0.0067 * 0.0077 * 0.0018 * 0.2523 0.2906 0.2443
fake 0.0140 * 0.0347 * 0.0351 * 0.0302 * 0.2494 0.2504 0.2491
fake-by-pop 0.0197 * 0.0231 * 0.0243 * 0.0129 * 0.2202 0.1444 0.2361
ml-10m baseline 0.0344 0.0253 0.0361 0.0347 0.1654 0.1224 0.1682
real 0.0302 * 0.0037 * 0.0047 * 0.0009 * 0.1734 0.1776 0.1732
fake 0.0343 0.0085 * 0.0077 * 0.0088 * 0.1725 0.1879 0.1715
fake-by-pop 0.0336 * 0.0188 * 0.0163 * 0.0171 * 0.1638 0.1069 0.1675
Table 1: Impact of Upsampling on Recommended Lists. Normalized Discounted Cumulative Gain (NDCG); Disparate Relevance (), Disparate Visibility () and Disparate Exposure () based on minority contribution in the catalog; Coverage of the catalog (), of items from () and of items from (). For each setting, we report results for the upsampling levels identified with dotted lines in Figure 4

. (‘*’) indicates scores statistically different with respect to the baseline (Paired t-test;

).

To characterize the peculiarities of each upsampling technique, Table 1 reports information on recommendation utility, disparate impact, and coverage for representative settings, which achieved a good trade-off. Results show us that, in general, upsampling brings benefits to disparate impacts and coverage, while preserving recommendation utility. Specifically, on coco, real experienced a disparate impact lower than at all levels (i.e., relevance, visibility, exposure) and doubles the coverage of minority-group items (i.e., column ). Conversely, fake-by-pop allowed us to improve the original recommendation utility, but disparate impact and coverage did not experience the same gains of real. On ml-10m, similar patters were observed for real, even though the loss in NDCG was larger. Compared with coco, fake and fake-by-pop achieved a better trade-off among metrics on ml-10m.

Observation 4. Upsampling real existing observations involving the minority (real) makes it possible to achieve the best trade-off among recommendation utility, disparate impacts, and coverage. This holds regardless of the dataset. Upsampling minority-group observations via fake user-item interactions (fake and fake-by-pop) is suitable when the minority group is very small.

5.2.2 Benchmarking Combined Treatments (RQ2)

Even though upsampling made it possible to achieve good trade-offs, there are still disparate impacts that should be reduced. Hence, in this experiment, we are interested in understanding the impact of regularization on the representative settings considered in the previous section. To this end, we applied the regularization described in Section 4 to each of the settings reported in Table 1. Given that the disparate impacts to get reduced are small, we adopt a as a regularization weight. Our empirical results with lower or larger values led to unreasonable variations in NDCG and/or disparate impact.

Results in Table 2 show us recommendation utility, disparate impact, and coverage achieved by the model instance trained with upsampling and regularization jointly. Comparing results between baseline and reg, it can be observed that a plain regularization, without upsampling, fails to bring a proper reduction of disparate impact. This is caused by the fact that the regularization depends on the amount of minority-group observations, and the amount of such a data is small, when upsampling is not performed. Conversely, the regularization can introduce benefits for the other settings, especially for fake and fake-by-pop settings. The observation we can draw is the following one.

Observation 5. Combining regularization and upsampling is crucial to fine-tune trade-offs achieved with the upsampling-only instance, especially when the up-sampled user-item observations are fake.

The regularization is essential to fine-tune the trade-off in cases where the upsampling alone does not allow to reduce the trade-off anymore. On both coco and ml-10m, this effect is observed for the fake and fake-by-pop. With a small loss in NDCG, disparate impact and coverage experienced significant improvements. Under the real scenario, the regularization helps to improve NDCG, with a small loss in the other metrics. In other words, each upsampling technique, combined with regularization, leads to a good trade-off between recommendation utility and disparate impacts. Indeed, an upsampling of real existing minority-group observations shows a wider coverage of minority-group items, with respect to the other settings.

While it is the responsibility of scientists to bring forth discussion about metrics for trade-offs, and possibly to design models to control them by turning parameters, it should be noted that it is ultimately up to the stakeholders to select the metrics and the trade-offs most suitable for their objectives.

Data Type NDCG
coco reg 0.1801 * 0.0664 * 0.0665 * 0.0654 * 0.2570 0.1963 0.2699
+0.0270 -0.0106 -0.0108 -0.0032 +0.0405 +0.0550 +0.0378
real+reg 0.0176 * 0.0137 * 0.0144 * 0.0086 * 0.2418 0.2723 0.2353
+0.0019 +0.0070 +0.0067 +0.0068 -0.0105 -0.0183 -0,0110
fake+reg 0.0136 0.0042 * 0.0061 * 0.0101 * 0.2580 0.2581 0.2579
-0.0004 -0.0305 -0.0290 -0.0201 +0.0086 +0.0076 +0.0088
fake-by-pop+reg 0.0190 0.0180 * 0.0193 * 0.0063 * 0.2601 0.1791 0.2772
-0.0007 -0.0054 -0.0050 -0.0066 +0.0401 +0.0347 +0.0411
ml-10m reg 0.0338 0.0213 * 0.0213 * 0.0198 * 0.1623 0.1207 0.1650
-0.0006 -0.0154 -0.0148 -0.0149 -0.0031 -0.0017 -0.0032
real+reg 0.0379 * 0.0033 0.0059 0.0028 0.1664 0.1599 0.1669
+0.0077 -0.0004 +0.0012 +0.0021 -0.0070 -0.0177 -0.0064
fake+reg 0.0334 0.0031 * 0.0023 * 0.0052 * 0.1764 0.1939 0.1752
-0.0009 -0.0054 -0.0052 -0.0031 +0.0041 +0.0064 +0.0037
fake-by-pop+reg 0.0327 0.0019 * 0.0020 * 0.0004 * 0.1684 0.1173 0.1718
-0.0007 -0.0172 -0.0140 -0.0167 +0.0043 +0.0122 +0.0043
Table 2: Impact of Regularization on Recommended Lists. Normalized Discounted Cumulative Gain (NDCG); Disparate Relevance (), Disparate Visibility () and Disparate Exposure () based on group contribution in the catalog; Coverage of the catalog (), of items from () and of items from (). We report the gain/loss of each regularized setting with respect to the non-regularized setting in Table 1. Bold values refers to positive gains after regularization. (‘*’) indicates scores statistically different with respect to the non-regularized version (Paired t-test; ).

5.2.3 Provider-level Walk-through Inspection of the Treatment (RQ3)

Next, we analyze how our treatment acts to the internal mechanisms of the user-item relevance learning step, and how these internal changes influence the recommended lists. To this end, we focus on a walk-through example of the problem and how our treatment addresses it. The goal is to understand where and how our treatment supports minority providers.

(a) Minority instances
(b) Provider triplets
(c) Provider margin
(d) Provider relevance
(e) Provider visibility
(f) Provider exposure
Figure 5: Walk-through Example. Internal and external properties concerning minority providers on coco, considering a baseline recommender and its corresponding treatments with fake upsampling ( of minority data) and a regularization (with ). (a) number of triplets where the minority group is involved for the observed/unobserved item; (b) average number of triplets where a minority provider is involved for the observed item; (c) average margin between observed and unobserved items in a triplet, for triplets involving observed items of a minority provider; (d-f) average relevance, visibility, and exposure proportion assigned to items of a minority provider.

To characterize our treatment, we consider the baseline recommender optimized on coco data. We are interested in showing how our treatment based on fake upsampling ( of minority data), followed by a regularization (with ), changes the internal and external properties shown by the baseline. Similar observations can be still applied to other settings. Figure 4(a) depicts the number of training triplets wherein an item delivered by a minority provider appears as a observed item (positive) or unobserved item (negative). Being under-represented in the observations, items of minority-group providers appear less frequently as an observed item under the baseline setting (left-most pair of bars). It follows that the average number of triplets per provider, where a given minority provider is involved for the observed item is limited, as reported in Figure 4(b) (left-most box plot). These imbalances strongly influence the ability of the pair-wise optimization of computing good margins between the observed and the unobserved item, when the former is delivered by a minority provider (Figure 4(c) - left-most box plot). With our upsampling, we introduce new user-item observations involving minority providers, with more triplets for the minority group and a higher number of triplets per minority provider, on average (Figure 4(a) and 4(b) - two right-most box plots). This results in larger positive margins between observed and unobserved items for items of a minority provider (see Figure 4(c), fake setting). Despite relying on the same up-sampled data, the regularized version further condenses the margins for observed items of minority providers around the average value (Figure 4(c), fake+reg setting). This treatment fundamentally changes the relevance assigned to items for each minority provider and, by extension, their visibility and exposure, as highlighted in Figure 4(d)-4(f). This gain is reflected at group level, as shown in the two previous sections.

5.3 Discussion

Our experiments demonstrate that our metric is feasible for measuring the degree of fairness conveyed by relevance scores with respect to the contribution of the providers on the catalog. Our metric can be also optimized.

Beyond our empirical work, we believe that our mapping approach

to associate providers’ sensitive attributes to items sheds light on new perspectives of fairness in recommender systems. Many platforms include a range of items, whose mapping with the sensitive attributes of the providers is not as direct as in the case of items representing individuals. Existing approaches would move towards this direction and future fairness-aware recommendation approaches would require to embed our mapping to realistically shape real-world conditions. Indeed, this aspect will also drive the creation of new evaluation metrics and protocols, that allow to investigate algorithmic facets so far underexplored.

Our study uncovered key connections among core components of optimization of recommendation models, while dealing with provider fairness. These results would promote even more the inspection of internal mechanisms in traditional strategies (e.g., pair-wise and point-wise), with a pro-active reaction to unfairness. Despite being relatively simple, our combination of upsampling and regularization provide fairness to target groups of providers, which could not be achieved individually by such components. Beyond being applied along, our treatment can be envisioned as a pre-processing step for procedures that seek to have a fine-grained control of fairness, acting directly on recommended lists. In this case, our adjusted relevance scores can be used in post-processing fairness-aware procedures, possibly leading to a new space of optimization between fairness and recommendation utility. Our treatment is flexible enough to incorporate other notions of fairness for controlling the relevance output by a recommendation algorithm, opening to interesting future-work directions.

Concerning possible limitations of our treatment, first, the validity of our fairness notion is dependent on the integrity of the platform catalog, requiring to audit the catalog curation for sampling bias against direct discrimination (e.g., an educational platform that refuses to add courses provided by female instructors to its database). Second, our empirical work dealt with scenarios with a very small minority, accounting for only , depending on the dataset. There are many domains (or attributes) without this kind of minority; and this may potentially lead to novel extensions and variants, starting from those suggested in this paper. Third, experiments were based on a binary gender construct, with datasets providing only two genders, “male” and “female”. Despite we had actually no chance of considering “non-binary” constructs, our formulation can be still applied to attributes with more than two genders. Fourth, to better characterize our contributions, we focused on a matrix factorization approach optimized via pair-wise comparisons. Other variants could be tested with our framework as well, since our treatment does not rely on any specific peculiarity of the pair-wise optimization (we used it as it better aligns to top- recommendation problems). Lastly, we fully recognize that our treatment does not directly guarantees other notion of fairness in the recommended lists; however, we showed that it allows us to obtain fairer relevance, and provide benefits to disparate impacts on visibility and exposure w.r.t. the contribution of the minority in the catalog. Further, our treatment can be used as a pre-processing step for relevance scores, for notions that claim fairness based on relevance (e.g., equity of attention Biega et al. (2018)).

Despite these limitations, we believe that the notion of equity of relevance and the treatments we devised contribute to shape a more complete design of recommender systems.

6 Related Work

Our research is inspired by two areas of fields impacting on recommender system research: (i) notions recently formalized in the context of fairness-aware rankings, and (ii) unfairness mitigation procedures on recommended lists.

6.1 Relation to Provider Fairness Notions in Ranking and Recommendation

Equity has been proposed as a norm for providers’ fairness, especially in people ranking or recommendation. Traditionally, fairness for individuals would ensure that exposure should be proportional to relevance for every provider, while fairness for groups requires that exposure should be equally distributed between groups characterized by sensitive attributes (e.g., gender, race). Biega et al. Biega et al. (2018), Singh and Joachims Singh and Joachims (2018), and Yadav et al. Yadav et al. (2019) consider a notion of fairness based on equity/proportions similar to ours. Despite working on provider groups, our work situates fairness in the context of recommender systems, allowing us to (i) account for situations where more providers lie behind an item and the same provider can appear more than once in a list, (ii) relate them with the objectives and formalism of recommendation metrics, and (iii) introduce a new experimentation of equity acting at relevance level for provider groups. Indeed, we control unfairness at an earlier stage, targeting a different equity means (i.e., catalog contribution, not system-predicted relevance). Further, provider unfairness is traditionally mitigated by assuming to have access to true unbiased relevances. In practice, these relevances are estimated via machine learning, leading to a biased estimate of the relevance scores. Recommender systems are known to be biased from several perspectives (e.g., popularity, presentation, unfairness for users and providers). With this in mind, we control how relevance scores are fairly distributed to groups.

Comparing an outcome distribution (e.g., ranked lists) with a population distribution was explored by Yang and Stoyanovich Yang and Stoyanovich (2017) and Sapiezynski et al. Sapiezynski et al. (2019). Differently from us, Sapiezynski et al. model uncertainty of group membership of a given individual, not dealing with contexts where more than one provider lies behind an item. Further, the outcome distribution is linked to a population distribution, assuming that the items the vendor chooses to show in the top- are a proportional representation of a subset of the catalog, sub-sampled via machine learning. This assumption may underestimate the real representation in the catalog, with respect to ours. Further, Yang and Stoyanovich compute the difference in the proportion of members of the protected group at top- and in the overall population. Compared to them, we target a proportion between relevance and contribution, not between items from a minority group in a top- and in the catalog. Their formulations complement our ideas, as they drive fairness optimization at different levels, and it would be interesting to see how these formulations can be combined.

Other fairness definitions in practice lead to enhanced fairness in exposure, for instance, by requiring equal proportions of individuals from different groups in ranking prefixes Celis et al. (2018); Zehlike et al. (2017); Zehlike and Castillo (2020). Mehrotra et al. Mehrotra et al. (2018) achieved fairness through a re-ranking function, which balances accuracy and fairness by adding a personalized bonus to items of uncovered providers. Similarly, Burke et al. Burke et al. (2018) define the concept of local fairness, and identified protected groups based on local conditions. In contrast to this literature, we study metrics that have clear link between contribution, observation, and relevance. The setup we study in this paper is very different, envisioning to interpret equity of relevance such that providers are supposed to get relevance and, possibly, visibility and exposure, according to their contribution in the catalog.

Furthermore, Patro et al. Patro et al. (2020) account for uniform exposure over providers, while we deal with a relevance proportional to the providers’ group contribution. Moreover, their definition assumes that items are not shareable, i.e., no item is allocated to multiple providers. Kamishima et al. Kamishima et al. (2018) models fairness an as independence between the predicted rating values and sensitive values of the providers, not taking into account any notion of equity between relevance (i.e., predicted rating) and contribution in the catalog. Beutel et al. Beutel et al. (2019) shapes fairness of providers in the context of pair-wise optimization, claiming fairness if the likelihood of an observed item being ranked above another relevant unclicked item is the same across both groups. Similarly, Narasimhan et al. Narasimhan et al. (2019) propose a notion of pair-wise equal opportunity, requiring pairs to be equally-likely ranked correctly regardless of the group membership of both items in a pair. Our notion of fairness differs from such a prior work, since it aims to bind relevance and catalog contribution. Considering that the fairness objective is different, it follows that the resulting fairness notions are not comparable alternatives, but complementary ways of modelling fairness.

6.2 Relation to Other Treatments for Provider Fairness

There are relationships between our treatment and existing approaches, even though it should be trivial to consider that treatments fundamentally vary due to the different fairness notion they are driven by.

Pre-processing for fairness in recommender systems has been considered in the context of consumer fairness. Rastegarpanah et al. Rastegarpanah et al. (2019) proposed to add new fake users who provide ratings on existing items, to minimize the losses of all user’s groups, computed as the mean squared estimation error over all known ratings in each group. Despite working on the provider side, our upsampling extends the observations of the real users and items, and aims at adjusting observations involving minority providers.

In-processing regularization in recommender systems has traditionally focused on point-wise scenarios. Kamishima et al. Kamishima et al. (2018) introduce a regularization requiring that the distance between the distribution of predicted ratings for items belonging to two different groups is as small as possible. However, this way of optimizing does not indicate much about the resulting recommended lists that users actually see, with respect to the pair-wise optimization we leveraged. Further, they do not take into account to what degree ratings of different groups are proportional according with the provider group contribution in the catalog.

Beutel et al. Beutel et al. (2019) targeted provider fairness optimization, under a pair-wise optimization scenario, similarly to us. However, while the pair-wise comparisons are fundamental to enable the Beutel’s treatment, our treatment is just tested under a pair-wise optimization scenario and does not leverage any peculiarity of this scenario. Further, while being both tested on binary attributes, we generalize to capture a wider variety of groups, and generalize to contexts where items are associated to more than one provider. Their training methodology is also very different. The fixed regularization term they added to the loss function is based on a correlation between the residual estimate and the group membership. Their base approach does not give any way to control the trade-off between fairness and accuracy. These conceptual and operative differences lead us to investigate clearly different under-explored facets. Further, compared to our work, they are driven by a different fairness objective, making the two treatments not directly comparable to each other. It would be interesting to see how they can be integrated, taking the benefits of both the notions, but this requires non-trivial extensions left as a future work.

Finally, other fairness-aware approaches, whose notions of fairness were presented in the previous section, are operationalized in quite different ways. Biega et al. Biega et al. (2018)

solve an integer linear program. Patro et al.

Patro et al. (2020) implement a Greedy-round-robin strategy. Similarly to us, Zehlike and Castillo Zehlike and Castillo (2020)

use stochastic gradient descent, but they operationalize it in a list-wise manner. The fact that our work is

driven by a different motivation and objective does not allow for a meaningful comparison against these existing methods.

7 Conclusions

In this paper, we introduce the concept of equity of relevance, requiring that the relevance given to a group of providers by a recommendation algorithm must be proportional to their contribution in the catalog of items. To operationalize this definition, we propose a treatment that combines upsampling of observations from the minority group and regularization of the equity of the respective relevance throughout the optimization process.

Our experimental study analyzes relevance scores and recommended lists generated by fifteen synthetic datasets that simulate specific situations of imbalance in the catalog and in the observations, and two real-world datasets that represent existent conditions in modern platforms. Our first exploratory results highlight that the discrepancy between the relevance given to provider groups by recommendation models and their contribution in the catalog is not negligible. This effect results in less (than expected) visibility and exposure given to the minority group. From these observations, we argue that improving equity of relevance is crucial, as it leads to less disparity in visibility and exposure as well, and can often be done without sacrificing much recommendation utility. Incorporating such fairness mechanisms allows to act directly on the output of the recommendation model and mitigate distortions at an earlier step, which would be useful also for post-processing fairness procedures.

Future work will embrace insights to provide mitigation methods that look at provider fairness promotion as a temporal process. The improvement in provider fairness might not be large immediately, and we argue that repeating our treatment over time will lead to more and more fair recommendations. This would better fit real-world situations and platforms. We will also investigate the relation between the recommendations returned by the algorithm and the tendency of each user to prefer items from different groups of providers. It is our goal to devise other regularization approaches that link internal parameters to metrics, and control the interpretability of the returned lists.

Acknowledgements.

Mirko Marras acknowledges Sardinia Regional Government for the financial support of his PhD scholarship (P.O.R. Sardegna F.S.E. Operational Programme of the Autonomous Region of Sardinia, European Social Fund 2014-2020 - Axis III Education and Training, Thematic Goal 10, Priority of Investment 10ii, Specific Goal 10.5). Ludovico Boratto acknowledges Agència per a la Competivitat de l’Empresa, ACCIÓ, for their support under project ”Fair and Explainable Artificial Intelligence (FX-AI)”.

References

  • A. Beutel, J. Chen, T. Doshi, H. Qian, L. Wei, Y. Wu, L. Heldt, Z. Zhao, L. Hong, E. H. Chi, and C. Goodrow (2019) Fairness in recommendation ranking through pairwise comparisons. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019, A. Teredesai, V. Kumar, Y. Li, R. Rosales, E. Terzi, and G. Karypis (Eds.), pp. 2212–2220. External Links: Link, Document Cited by: §1, §1, §1, §6.1, §6.2.
  • A. J. Biega, K. P. Gummadi, and G. Weikum (2018) Equity of attention: amortizing individual fairness in rankings. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2018, Ann Arbor, MI, USA, July 08-12, 2018, K. Collins-Thompson, Q. Mei, B. D. Davison, Y. Liu, and E. Yilmaz (Eds.), pp. 405–414. External Links: Link, Document Cited by: §1, §1, §1, §2.2, §2.4, §5.3, §6.1, §6.2.
  • R. Burke, N. Sonboli, and A. Ordonez-Gauger (2018) Balanced neighborhoods for multi-sided fairness in recommendation. In Conference on Fairness, Accountability and Transparency, FAT 2018, 23-24 February 2018, New York, NY, USA, S. A. Friedler and C. Wilson (Eds.), Proceedings of Machine Learning Research, Vol. 81, pp. 202–214. External Links: Link Cited by: §6.1.
  • R. Burke (2017) Multisided fairness for recommendation. CoRR abs/1707.00093. External Links: Link, 1707.00093 Cited by: §1.
  • P. G. Campos, F. Díez, and I. Cantador (2014) Time-aware recommender systems: a comprehensive survey and analysis of existing evaluation protocols. User Model. User-Adapt. Interact. 24 (1-2), pp. 67–119. External Links: Link, Document Cited by: §5.1.4.
  • L. E. Celis, D. Straszak, and N. K. Vishnoi (2018) Ranking with fairness constraints. In 45th International Colloquium on Automata, Languages, and Programming, ICALP 2018, July 9-13, 2018, Prague, Czech Republic, I. Chatzigiannakis, C. Kaklamanis, D. Marx, and D. Sannella (Eds.), LIPIcs, Vol. 107, pp. 28:1–28:15. External Links: Link, Document Cited by: §6.1.
  • J. Chen, H. Zhang, X. He, L. Nie, W. Liu, and T. Chua (2017) Attentive collaborative filtering: multimedia recommendation with item-and component-level attention. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 335–344. Cited by: §3.2.
  • L. Chen, R. Ma, A. Hannák, and C. Wilson (2018) Investigating the impact of gender on rank in resume search engines. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, CHI 2018, Montreal, QC, Canada, April 21-26, 2018, R. L. Mandryk, M. Hancock, M. Perry, and A. L. Cox (Eds.), pp. 651. External Links: Link, Document Cited by: §5.1.1.
  • D. Dessì, G. Fenu, M. Marras, and D. R. Recupero (2018) COCO: semantic-enriched collection of online courses at scale with experimental use cases. In Trends and Advances in Information Systems and Technologies - Volume 2 [WorldCIST’18, Naples, Italy, March 27-29, 2018], Á. Rocha, H. Adeli, L. P. Reis, and S. Costanzo (Eds.), Advances in Intelligent Systems and Computing, Vol. 746, pp. 1386–1396. External Links: Link, Document Cited by: §5.1.1.
  • C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. S. Zemel (2012) Fairness through awareness. In Innovations in Theoretical Computer Science 2012, Cambridge, MA, USA, January 8-10, 2012, S. Goldwasser (Ed.), pp. 214–226. External Links: Link, Document Cited by: §1.
  • F. M. Harper and J. A. Konstan (2016) The movielens datasets: history and context. ACM Trans. Interact. Intell. Syst. 5 (4), pp. 19:1–19:19. External Links: Link, Document Cited by: §5.1.1.
  • D. Jannach and M. Jugovac (2019) Measuring the business value of recommender systems. ACM Trans. Management Inf. Syst. 10 (4), pp. 16:1–16:23. External Links: Link, Document Cited by: §1.
  • K. Järvelin and J. Kekäläinen (2002) Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20 (4), pp. 422–446. External Links: Link, Document Cited by: §5.1.2.
  • M. Kaminskas and D. Bridge (2017) Diversity, serendipity, novelty, and coverage: A survey and empirical analysis of beyond-accuracy objectives in recommender systems. ACM Trans. Interact. Intell. Syst. 7 (1), pp. 2:1–2:42. External Links: Link, Document Cited by: §5.1.2.
  • T. Kamishima, S. Akaho, H. Asoh, and J. Sakuma (2018) Recommendation independence. In Conference on Fairness, Accountability and Transparency, FAT 2018, 23-24 February 2018, New York, NY, USA, S. A. Friedler and C. Wilson (Eds.), Proceedings of Machine Learning Research, Vol. 81, pp. 187–201. External Links: Link Cited by: §1, §1, §1, §6.1, §6.2.
  • Y. Koren, R. Bell, and C. Volinsky (2009) Matrix factorization techniques for recommender systems. Computer (8), pp. 30–37. Cited by: §3.2.
  • P. Lahoti, K. P. Gummadi, and G. Weikum (2019) Operationalizing individual fairness with pairwise fair representations. Proc. VLDB Endow. 13 (4), pp. 506–518. External Links: Link Cited by: §1, §2.2.
  • B. Liu, Y. Su, D. Zha, N. Gao, and J. Xiang (2019) CARec: content-aware point-of-interest recommendation via adaptive bayesian personalized ranking. Aust. J. Intell. Inf. Process. Syst. 15 (3), pp. 61–68. External Links: Link Cited by: §1.
  • M. Mansoury, B. Mobasher, R. Burke, and M. Pechenizkiy (2019) Bias disparity in collaborative recommendation: algorithmic evaluation and comparison. In Proceedings of the Workshop on Recommendation in Multi-stakeholder Environments co-located with the 13th ACM Conference on Recommender Systems (RecSys 2019), Copenhagen, Denmark, September 20, 2019, R. Burke, H. Abdollahpouri, E. C. Malthouse, K. P. Thai, and Y. Zhang (Eds.), CEUR Workshop Proceedings, Vol. 2440. External Links: Link Cited by: §5.1.1.
  • R. Mehrotra, J. McInerney, H. Bouchard, M. Lalmas, and F. Diaz (2018) Towards a fair marketplace: counterfactual evaluation of the trade-off between relevance, fairness & satisfaction in recommendation systems. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM 2018, Torino, Italy, October 22-26, 2018, A. Cuzzocrea, J. Allan, N. W. Paton, D. Srivastava, R. Agrawal, A. Z. Broder, M. J. Zaki, K. S. Candan, A. Labrinidis, A. Schuster, and H. Wang (Eds.), pp. 2243–2251. External Links: Link, Document Cited by: §6.1.
  • H. Narasimhan, A. Cotter, M. R. Gupta, and S. Wang (2019) Pairwise fairness for ranking and regression. CoRR abs/1906.05330. External Links: Link, 1906.05330 Cited by: §6.1.
  • G. K. Patro, A. Biswas, N. Ganguly, K. P. Gummadi, and A. Chakraborty (2020) FairRec: two-sided fairness for personalized recommendations in two-sided platforms. In WWW ’20: The Web Conference 2020, Taipei, Taiwan, April 20-24, 2020, Y. Huang, I. King, T. Liu, and M. van Steen (Eds.), pp. 1194–1204. External Links: Link, Document Cited by: §6.1, §6.2.
  • B. Rastegarpanah, K. P. Gummadi, and M. Crovella (2019) Fighting fire with fire: using antidote data to improve polarization and fairness of recommender systems. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM 2019, Melbourne, VIC, Australia, February 11-15, 2019, J. S. Culpepper, A. Moffat, P. N. Bennett, and K. Lerman (Eds.), pp. 231–239. External Links: Link, Document Cited by: §6.2.
  • S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme (2012) BPR: bayesian personalized ranking from implicit feedback. arXiv preprint arXiv:1205.2618. Cited by: §3.2, §3.2.
  • F. Ricci, L. Rokach, and B. Shapira (2015) Recommender systems: introduction and challenges. In Recommender Systems Handbook, F. Ricci, L. Rokach, and B. Shapira (Eds.), pp. 1–34. External Links: Link, Document Cited by: §1, §1.
  • P. Sánchez and A. Bellogín (2020) Applying reranking strategies to route recommendation using sequence-aware evaluation. User Model. User-Adapt. Interact.. Cited by: §5.1.4.
  • P. Sapiezynski, W. Zeng, R. E. Robertson, A. Mislove, and C. Wilson (2019) Quantifying the impact of user attention on fair group representation in ranked lists. CoRR abs/1901.10437. External Links: Link, 1901.10437 Cited by: §1, §6.1.
  • A. Singh and T. Joachims (2018) Fairness of exposure in rankings. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018, Y. Guo and F. Farooq (Eds.), pp. 2219–2228. External Links: Link, Document Cited by: §1, §1, §3.3, §5.1.2, §6.1.
  • E. Walster, E. Berscheid, and G. W. Walster (1973) New directions in equity research.. Journal of personality and social psychology 25 (2), pp. 151. Cited by: §1, §2.4.
  • J. Xiao, H. Ye, X. He, H. Zhang, F. Wu, and T. Chua (2017) Attentional factorization machines: learning the weight of feature interactions via attention networks. arXiv preprint arXiv:1708.04617. Cited by: §3.2.
  • H. Xue, X. Dai, J. Zhang, S. Huang, and J. Chen (2017) Deep matrix factorization models for recommender systems.. In IJCAI, pp. 3203–3209. Cited by: §3.2.
  • H. Yadav, Z. Du, and T. Joachims (2019) Fair learning-to-rank from implicit feedback. CoRR abs/1911.08054. External Links: Link, 1911.08054 Cited by: §6.1.
  • K. Yang and J. Stoyanovich (2017) Measuring fairness in ranked outputs. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management, Chicago, IL, USA, June 27-29, 2017, pp. 22:1–22:6. External Links: Link, Document Cited by: §1, §2.2, §2.2, §6.1.
  • S. Yao and B. Huang (2017) Beyond parity: fairness objectives for collaborative filtering. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 2921–2930. External Links: Link Cited by: §3.1.
  • M. Zehlike, F. Bonchi, C. Castillo, S. Hajian, M. Megahed, and R. Baeza-Yates (2017) FA*ir: A fair top-k ranking algorithm. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017, Singapore, November 06 - 10, 2017, E. Lim, M. Winslett, M. Sanderson, A. W. Fu, J. Sun, J. S. Culpepper, E. Lo, J. C. Ho, D. Donato, R. Agrawal, Y. Zheng, C. Castillo, A. Sun, V. S. Tseng, and C. Li (Eds.), pp. 1569–1578. External Links: Link, Document Cited by: §6.1.
  • M. Zehlike and C. Castillo (2020) Reducing disparate exposure in ranking: A learning to rank approach. In WWW ’20: The Web Conference 2020, Taipei, Taiwan, April 20-24, 2020, Y. Huang, I. King, T. Liu, and M. van Steen (Eds.), pp. 2849–2855. External Links: Link, Document Cited by: §6.1, §6.2.
  • I. Zliobaite (2017) Measuring discrimination in algorithmic decision making. Data Min. Knowl. Discov. 31 (4), pp. 1060–1089. External Links: Link, Document Cited by: §1.