A Survey of Research on Fair Recommender Systems

Recommender systems can strongly influence which information we see online, e.g, on social media, and thus impact our beliefs, decisions, and actions. At the same time, these systems can create substantial business value for different stakeholders. Given the growing potential impact of such AI-based systems on individuals, organizations, and society, questions of fairness have gained increased attention in recent years. However, research on fairness in recommender systems is still a developing area. In this survey, we first review the fundamental concepts and notions of fairness that were put forward in the area in the recent past. Afterward, we provide a survey of how research in this area is currently operationalized, for example, in terms of the general research methodology, fairness metrics, and algorithmic approaches. Overall, our analysis of recent works points to certain research gaps. In particular, we find that in many research works in computer science very abstract problem operationalizations are prevalent, which circumvent the fundamental and important question of what represents a fair recommendation in the context of a given application.

READ FULL TEXT VIEW PDF

Authors

page 6

page 10

page 11

page 19

03/13/2020

Exploring User Opinions of Fairness in Recommender Systems

Algorithmic fairness for artificial intelligence has become increasingly...
09/06/2021

Fairness via AI: Bias Reduction in Medical Information

Most Fairness in AI research focuses on exposing biases in AI systems. A...
03/25/2021

Fairness in Ranking: A Survey

In the past few years, there has been much work on incorporating fairnes...
07/16/2020

Facets of Fairness in Search and Recommendation

Several recent works have highlighted how search and recommender systems...
06/18/2021

Point-of-Interest Recommender Systems: A Survey from an Experimental Perspective

Point-of-Interest recommendation is an increasing research and developin...
09/18/2020

Examining the Impact of Algorithm Awareness on Wikidata's Recommender System Recoin

The global infrastructure of the Web, designed as an open and transparen...
01/21/2022

Consumer Fairness in Recommender Systems: Contextualizing Definitions and Mitigations

Enabling non-discrimination for end-users of recommender systems by intr...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recommender systems (RS) are one of the most visible and successful applications of AI technology in practice, and personalized recommendations—as provided on many modern e-commerce or media sites–can have a substantial impact on different stakeholders. On e-commerce sites, for example, the choices of consumers can be largely influenced by recommendations, and these choices are often directly related to the profitability of the platform. On news websites or social media, on the other hand, personalized recommendations may determine to a large extent which information we see, which in turn may shape not only our own beliefs, decisions, and actions, but also the beliefs of a community of users or an entire society.

In academia, recommenders have historically been considered as “benevolent” systems that create value for consumers, e.g., by helping them find relevant items, and that this value for consumers then translates to value for businesses, e.g., due to higher sales numbers or increased customer retention jannachjugovactmis2019. Only in the most recent years, more awareness was raised regarding possible negative effects of automated recommendations, e.g., that they may promote items on an e-commerce site that mainly maximize the profit of providers or that they may lead to an increased spread of misinformation on social media.

Given the potentially significant effects of recommendations on different stakeholders, researchers increasingly argue that providing recommendations may raise various ethical questions and should thus be done in a responsible way trattner2022aiethics. One important ethical question in this context is that of the fairness of a recommender system, see Burke2017Fairness; EkstrandFoundations2021, reflecting related discussions on the more general level of

fair machine learning

and fair AI mehrabi2021fairnesssurvey; barocas-hardt-narayanan.

During the last years, researchers have discussed and analyzed different dimensions in which a recommender system should be fair, or vice versa, may lead to a lack of fairness. Given the nature of fairness as a social construct, it however seems difficult (or even impossible EkstrandFoundations2021) to establish a general definition of what represents a fair recommendation. Beside the subjective nature of fairness, there are also often competing interests of different stakeholders to be considered in real-world recommendation settings abdollahpouri2020.

With this survey, our goal is to provide an overview of what has been achieved in this emerging area so far and to highlight potential research gaps. Specifically, drawing on an analysis of more than 130 recent papers in computer science, we investigate: (i) which dimensions and definitions of fairness in RS have been identified and established, (ii) at which application scenarios researchers target and which examples they provide, and (iii) how they operationalize the research problem in terms of methodology, algorithms, and metrics. Based on this analysis, we then paint a landscape of current research in various dimensions and discuss potential shortcomings and future directions for research in this area.

Overall, we find that research in computing typically assumes that a clear definition of fairness is available, thus rendering the problem as one of designing algorithms to optimize a given metric. Such an approach may however appear too abstract and simplistic, cf. Selbst2019Fairness, calling for more faceted and multi-disciplinary approaches to research in fairness-aware recommendation.

2 Background: Fairness in Recommender Systems

2.1 Examples of Unfair Recommendations

In the general literature in Fair ML/AI, a key use case is the automated prediction if a convicted criminal will recidivate. In this case, an ML-based system is usually considered unfair if its predictions depend on demographic aspects like ethnicity and when it then ultimately discriminates members of certain ethnic groups. In the context of our present work, such use cases of ML-based decision-support systems are not in the focus. Instead, we focus on common application areas of RS where personalized item suggestions are made to users, e.g., in e-commerce, media streaming, or news and social media sites.

At first sight, one might think that the recommendation providers here are independent businesses and it is entirely at their discretion which shopping items, movies, jobs, or social connections they recommend on their platforms. Also, one might assume that the harm that is made by such recommendations is limited, compared, e.g., to the legal decision problem mentioned above. There are, however, a number of situations also in common application scenarios of RS where many people might think that a system is unfair in some sense. For example, an e-commerce platform might be considered unfair if it mainly promotes those shopping items that maximize its own profit but not consumer utility. Besides such intentional interventions, there might also be situations where an RS reinforces existing discrimination patterns or biases in the data, e.g., when a system on an employment platform mainly recommends lower-paid jobs to certain demographic groups.

Questions of fairness in RS are however not only limited to the consumer’s side. In reality, a recommendation service often involves multiple stakeholders abdollahpouri2020. On a music streaming platforms, for example, we not only have the consumers, but also the artists, record labels and the platform itself, which might have diverging goals that may be affected by the recommendation service. Artists and labels are usually interested to increase their visibility through the recommendations. Platform providers, on the other hand might seek to maximize the engagement with the service across the entire user base, which might result in promoting mostly already popular artists and tracks with the recommendations. Such a strategy however easily leads to a “rich-get-richer” effect and reduces the chances of less popular artists to be exposed to consumers, which might be considered unfair to providers. Finally, there are also use cases where recommendations may have societal impacts, in particular on news and social media sites. Some may for example consider it unfair if a recommender system only promotes content that emphasizes one side of a political discussion or promotes misinformation that is suitable to discriminate certain user groups.

Some of the discussed examples of unfair recommendations might appear to be rather ethical or moral questions or related to an organization’s business model, e.g., when an e-commerce provider does not optimize for consumer value or when niche artists are not frequently exposed to users. However, note that being fair in the bespoke examples may also serve providers, e.g., when consumers establish long-term trust due to valuable recommendations or when they engage more with a music service when they discover more niche content. Finally, there are also legal guardrails that may come into play, e.g., when a large platform uses a monopoly-like market position to put certain providers into an inappropriately bad position. The current draft of the European Commission’s Digital Service Act111https://eur-lex.europa.eu/legal-content/en/TXT/?uri=COM:2020:825:FIN can be seen as a prime example where recommender systems and their potential harms are explicitly addressed in legal regulations.

Overall, a number of examples exist where recommendations might be considered unfair for different stakeholders. In the context of the survey presented in this work, we are particularly interested in which specific real-world problems related to unfair recommendations are considered in the existing literature.

2.2 Reasons for Unfairness

There are different reasons why a recommender system might exhibit a behavior that may be considered unfair, see EkstrandFoundations2021 and DBLP:journals/ipm/AshokanH21. One common issue mentioned in the literature is that the data on which the machine learning model is trained is biased. Such biases might for example be the result of the specifics of the data collection process, e.g., when a biased sampling strategy is applied. A machine learning model may then “pick up” such a bias and reflect it in the resulting recommendations.

Another source of unfairness may lie in the machine learning model itself, e.g., when it even reinforces existing biases or existing skewed distributions in the underlying data. Differences between recommendation algorithms in terms of reinforcing popularity biases and concentration effects were for example examined in

JannachLercheEtAl2015. In some cases, the machine learning model might also directly consider a “protected characteristic” (or a proxy thereof) in its predictions EkstrandFoundations2021. To avoid discrimination, and thus unfair treatment, of certain groups, a machine learning model should therefore not make use of protected characteristics such as age, color, or religion.

Unfairness that is induced by the underlying data or algorithms may arise unknowingly to the recommendation provider. It is however also possible that a certain level of unfairness is designed into a recommendation algorithm, e.g., when a recommendation provider aims to maximize monetary business metrics while at the same time keeping users satisfied as much as possible ghanem2022balancing; JannachAdomaviciusVAMS2017. Likewise, a recommendation provider may have a political agenda and particularly promote the distribution of information that mainly supports their own viewpoints.

Some works finally mention that the “world itself may be unfair or unjust” EkstrandFoundations2021, e.g., due to historical discrimination of certain groups. In the context of algorithmic fairness—which is the topic of our present work—such historical developments are however often not in the focus. Rather, the question is to what extent this is reflected into the data or how this unfairness influences the fairness goals, e.g., by implementing affirmative action policies. where the goal is to support traditionally underrepresented groups.

In general, the underlying reasons also determine where in a machine learning pipeline222In DBLP:journals/ipm/AshokanH21, Ashokan and Hass review where biases may occur in a typical machine learning pipeline from data generation, over model building and evaluation, to deployment and user interaction. interventions can or should be made to ensure fairness (or to mitigate unfairness). In a common categorization mehrabi2021fairnesssurvey; Shrestha19Fairness; pitoura2021fairness; zehlike2021fairness, this could be achieved (i) in a data pre-processing phase, (ii) during model learning and optimization, and (iii) in a post-processing phase. In particular, in the model learning and post-processing phase, fairness-ensuring algorithmic interventions must be guided by an operationalizable (i.e., mathematically expressed) goal. In case of affirmative action policies, one could for example aim to have an equal distribution of recommendations of members of the majority group and members of an underrepresented group. As we will see in Section 4

, such a goal is often formalized as a target distribution and/or in the form of an evaluation metric to gauge the level of existing or mitigated fairness.

2.3 Notions of Fairness

When we deal with phenomena of unfairness like those described, and when our goal is to avoid or mitigate such phenomena, the question naturally arises what we consider to be fair, in general and in a specific application context. Fairness, in general, fundamentally is a societal construct or a human value, which has been discussed for centuries in many disciplines like philosophy and moral ethics, sociology, law, or economics. Correspondingly, countless definitions of fairness were proposed in different contexts, see for example Verma et al. Verma18definitions; Verma20facets for a high-level discussion of the definition of fairness in machine learning and ranking algorithms, or Mulligan et al. Mulligan2019ThisThing for the relationship to social science conception of fairness.

One popular characterization can be found in mehrabi2021fairnesssurvey, where fairness in the context of decision making is considered as the “absence of any prejudice or favoritism towards an individual or a group based on their intrinsic or acquired traits”. This definition captures two common notions of fairness that are used in the recommender systems literature, where often a differentiation between individual fairness and group fairness is made. Individual fairness roughly expresses that similar individuals should be treated similarly, e.g., candidates with similar qualifications should be ranked similarly in a job recommendation scenario. How to determine similarity is key here for fairness and protected characteristics like religion or gender should not be factors that make candidates dissimilar. Group fairness, in contrast, aims to ensure that “similar groups have similar experience” EkstrandFoundations2021. Typical groups in such a context are a majority or dominant group and a protected group (e.g., an ethnical minority). Questions of group fairness were traditionally discussed in fair ML research in the context of classification problems, and are often referred to as different forms of statistical parity

. A fair classifier would therefore assign a member of a protected and an unprotected class with equal probability to the “positive” class, e.g., the class that is assumed to pay back a loan.

An in-depth discussion of these—sometimes even incompatible—notions of fairness is beyond the scope of this work, which focuses on an analysis of how scholars in recommender systems operationalize the research problem. For questions of individual fairness, this might relate to the problem of defining a similarity function. For certain group fairness goals, on the other hand, one has to determine which are the (protected) attributes that determine group membership. Furthermore, it is often required to define/indicate precisely some target distributions. Later, in Section 4, where we review the current literature, we will introduce additional notions of fairness and their operationalizations as they are found in the studied papers. As we will see, a key point here is that researchers often propose to use very abstract operationalizations (e.g., in the form of fairness metrics), which was identified earlier as a potential key problem in the broader area of fair ML in Selbst2019Fairness.

2.4 Related Concepts: Responsible Recommendation and Biases

Issues of fairness are often discussed within the broader area of responsible recommendation elahi2021aiethics; ekstrand2021fairness. In elahi2021aiethics, the authors in particular discuss potential negative effects of recommendations and their underlying reasons with a focus on the media domain. Specific phenomena in this domain include the emergence of filter bubbles and echo chambers. There are, however, also other more general potential harms such as popularity biases as well as fairness-related aspects like discrimination that can emerge in media recommendation settings. Fairness is therefore seen as a particular aspect of responsible recommendation in elahi2021aiethics. A similar view is taken in ekstrand2021fairness, where the authors review a number of related concerns of responsibility: accountability, transparency, safety, privacy, and ethics. In the context of our present work, most of these concepts are however only of secondary interest.

More important, however, is the use of the term bias in the related literature. As discussed above, one frequently discussed topic in the area of recommender systems is the problem of biased data chen2020bias; Baeza2018CACM. One issue in this context is that the data that is collected from existing websites—e.g., regarding which content visitors view or what consumers purchase—may not be “natural” but biased by what is shown to users through an already existing recommender system. This, in turn, then may lead to biased recommendations when machine learning models reflect or reinforce the bias, as mentioned above. In works that address this problem, the term bias is often used in a more statistical sense as done in ekstrand2021fairness. However, the use of the term is not consistent in the literature, as observed also in chen2020bias and in our work. In some early papers, bias is used almost synonymously with fairness. In Friedman1996Bias, for example, bias is used to “refer to computer systems that systematically and unfairly discriminate against certain individuals or groups of individuals in favor of others”. In our work, we acknowledge that biased recommendations may be unfair, but we do not generally equate bias with unfairness. Considering the problem of popularity bias in recommender systems, such a bias may lead to an over-proportional exposure of certain items to users. This, however, not necessarily leads to unfairness in an ethical or legal sense.

3 Research Methodology

In this section, we first describe our methodology of identifying relevant papers for our survey. Afterwards, briefly discuss how our survey extends previous works in this area.

3.1 Paper Collection Process

We identified relevant research papers in a systematic way Kitchenham04 by querying digital libraries with predefined search terms. Based on our prior knowledge about the literature, we used the following search terms in order to cover a wide range of works in an emerging area, where terminology is not yet entirely unified: fair recommend, fair collaborative system, fair collaborative filtering, bias recommend, debias recommend, fair ranking, bias ranking, unbias ranking, re-ranking recommend, reranking recommend. To identify papers, we queried DBLP and the ACM Digital Library333https://dl.acm.org, https://dblp.org in their respective search syntax, stating that the provided keywords must appear in the title of the paper.

From the returned results, we then removed all papers that were published as preprints on arXiv.org.444Note that DBLP indexes arXiv papers. and we removed survey papers. We then manually scanned the remaining 234 papers. In order to be included in this survey, a paper had to fulfill the following additional criteria:

  • It had to be explicitly about fairness, at least by mentioning this concept somewhere in the paper. Papers which, for example, focus on mitigating popularity biases, but which do not mention that fairness is an underlying goal of their work, were thus not considered.

  • It had to be about recommender systems. Given the inclusiveness of our set of query terms, a number of papers were returned which focused on fair information retrieval. Such works were also excluded from our study.

This process left us with 130 papers. The papers were read by at least two researchers and categorized in various dimensions, see Section 4.

3.2 Relation to Previous Surveys

A number of related surveys were published in the last few years. The recent monograph by Ekstrand et al. EkstrandFoundations2021 discusses fairness aspects in the broader context of information access systems, an area which covers both information retrieval and recommender systems. Their comprehensive work in particular includes a taxonomy of various fairness dimensions, which also serves as a foundation of our present work. The survey provided in chen2020bias focuses on biases in recommender systems, and connects different types of biases, e.g., popularity biases, with questions of fairness, see also Abdollahpouri2020Connection. A categorization of different types of biases is provided in the work along with a review of existing approaches to bias mitigation. Both works, EkstrandFoundations2021 and chen2020bias, are different from our present work as our goal is not to provide a novel categorization of fairness concepts or algorithms used in the literature. Instead, our main goal is to investigate the current state of existing research, e.g., in terms of which concepts and algorithmic approaches are predominantly investigated and where there might be research gaps.

Different survey papers were published also in the more general area of fair machine learning or fair AI, as mentioned above mehrabi2021fairnesssurvey; barocas-hardt-narayanan. Clearly, many questions and principles of fair AI apply also to recommender systems, which can be seen as a highly successful area of applied machine learning. Differently from such more general works, however, our present work focuses on the particularities of fairness in recommender systems.

4 A Landscape of Research

In this section, we categorize the identified literature along different dimensions to paint a landscape of current research and to identify existing research gaps.

4.1 Publication Activity per Year

Interest in fairness in recommender systems has been constantly growing through the past few years. Figure 1 shows the number of paper per year that were considered in our survey. Questions of fairness in information retrieval have been discussed for many years, see, e.g., Pedreshi2008Distriminationaware for an earlier work. In the area of recommender systems, however, the earliest paper we identified through our search, which only considers papers in which fairness is explicitly addressed, was only published in 2017.

Figure 1: Number of papers published per year.

4.2 Types of Contributions

Academic research on recommender systems in general is largely dominated by algorithmic contributions, and we correspondingly observe a large amount of new methods that are published every year. Clearly, building an effective recommender system requires more than a smart algorithm, e.g., because recommendation to a large extent is also a problem of human-computer interaction and user experience design JannachResnickEtAl2016; jannach2021aimagintro. Now when questions of fairness should be considered as well, the problem becomes even more complex as for example ethical questions may come into play and we may be interested on the impact of recommendations on individual stakeholders, including society.

In the context of our study, we were therefore interested in which general types of contributions we find in the computer science and information systems literature on fair recommendation. Based on the analysis of the relevant papers, we first identified two general types of works: (a) technical papers, which, e.g., propose new algorithms, protocols, and metrics or analyze data, and (b) conceptual papers. The latter class of papers is diverse and includes, for example, papers that discuss different dimensions of fair recommendations, papers that propose conceptual frameworks, or works that connect fairness with other quality dimensions like diversity.

We then further categorized the technical papers in terms of their specific technical type of contribution. The main categories we identified are (a) algorithm papers, which for example propose re-ranking techniques, (b) analytic papers, which for example study the outcomes of a given algorithm, and (c) methodology papers, which propose new metrics or evaluation protocols.

Figure 2 shows how many papers in our survey were considered as technical and conceptual papers. Non-technical papers cover a wide range of contributions, such as guidelines for designers to avoid compounding previous injustices DBLP:conf/um/Schelenz21, exploratory studies that investigate user perceptions of fairness Sonboli2021Fairness, or discussions about how difficult it is to audit these types of systems DBLP:conf/bias/KrafftHZ20.

Figure 2: Technical vs. Conceptual Papers.

We observe that today’s research on fairness on recommender systems is dominated by technical papers. In addition, we find that the majority of these works focuses on improved algorithms, e.g., to debias data or to obtain a fairer recommendation outcome through list re-ranking. To some extent this is expected as we focus on the computer science literature. However, we have to keep in mind that the concepts of fairness and unfairness or social constructs may depend on a variety of environmental factors in which a recommender system is deployed. As such, the research focus in the area of fair recommender systems seems rather narrow and on algorithmic solutions. As we will observe later, however, such algorithmic solutions commonly assume that a pre-existing and mathematically defined optimization goal is available, e.g., a target distribution of recommendations. In practical applications, the major challenges mostly lie (a) in establishing a common understanding and agreement on such a fairness goal and (b) in finding or designing an operationalizable optimization goal (e.g., a computational metric) which represents a reliable measure or proxy for the given fairness goal.

4.3 Notions of Fairness

In Li2021tutorial, a taxonomy of different notions of fairness was introduced. In the following, we review the literature following this taxonomy.

Group Fairness vs. Individual

A very common differentiation in fair recommendation is to distinguish between group fairness and individual fairness, as indicated before. With group fairness, the goal is to achieve some sort of statistical parity between protected groups Binns2019Apparent. In fair machine learning, a traditional goal often is to ensure that there are equal number of members of each protected group in the outcome, e.g., when it comes to make a ranked list of job candidates. The protected groups in such situations are commonly determined by characteristics like age, gender, or ethnicity. Achieving individual fairness in the described scenario means that candidates with similar characteristics should be treated similarly. To operationalize this idea, therefore some distance metric is needed to assess the similarity of individuals. This can be a challenging task, since there is no consensus on the notion of similarity, and it could be task-specific Dwork12. Ideas of individual fairness in machine learning were discussed in an early work in Dwork12, where it was also observed that achieving group fairness might lead to an unfair treatment at the individual level. In the candidate ranking example, favoring members of protected groups to achieve parity might ultimately result in the non-consideration of a better qualified candidate from a non-protected group. As a result, group and individual fairness are frequently viewed as trade-offs, which is not always immediately evident Binns2019Apparent.

Figure 3: Individual vs. Group Fairness.

Figure 3 shows how many of the surveyed papers focus on each category. The figure shows that research on scenarios where group fairness is more common than works that adopt the concept of individual fairness. Only in rare cases, both types of fairness are considered.

Group fairness entails comparing, on average, the members of the privileged group against the unprivileged group. One overarching aspect to identify research papers on groups fairness is the distinction between the (i) benefit type (exposure vs. relevance), and (ii) major stakeholders (consumer vs. provider). Exposure relates to the degree to which items or item groups are exposed uniformly to all users/user groups. Relevance (accuracy) indicates how well an item’s exposure is effective, i.e., how well it meets the user’s preference. For recommender systems, where users are first-class citizens, there are multiple stakeholders, consumers, producers and side stakeholders (see next section).

To perform fairness evaluation for item recommendation tasks, the users or items are divided into non-overlapping groups (segments) based on some form of attributes

. These attributes can be either supplied externally by the data provider (e.g., gender, age, race) or computed internally from the interaction data (e.g., based on user activity level, mainstreamness, or item popularity). Nonetheless, we give the most frequently used features in the recommendation fairness literature to operationalize. In Table 

1, we provide a list of the most commonly used attributes in the recommendation fairness literature, which can be utilized to operationalize the group fairness concept. They are divided according to Consumer fairness (C-Fairness), Producer Fairness (P-Fairness), and combinations (CP-Fairness).

Goal 1: Consumer Fairness Attribute
Target: Demographic parity – sensitive attributes are attained by birth and not under a user’s control. Gender deldjoo2021flexible; deldjoo2021explaining; Wu21decomposed; gorantla2021problem; wang2021practical; ghosh2021fair; wan2020addressing; edizel2020fairecsys; Tsintzou19BiasDisparity; DBLP:conf/recsys/MansouryMBP19; DBLP:conf/recsys/LinSMB19; DBLP:conf/kdd/GeyikAK19; DBLP:journals/kbs/XiaYXL19; DBLP:conf/fat/BurkeSO18; DBLP:conf/icwsm/ChakrabortyMBGG17 Race DBLP:conf/icml/GorantlaDL21; ghosh2021fair; DBLP:conf/um/ZhengDMK18; DBLP:conf/cikm/ZhuHC18; DBLP:conf/icwsm/ChakrabortyMBGG17 Age deldjoo2021flexible; DBLP:journals/ijimai/BobadillaLG021; suhr2021does; DBLP:journals/ipm/MelchiorreRPBLS21; gorantla2021problem Nationality Weydemann19location


Target: Merit-based fairness – attained through a user’s merit over time.
Education suhr2021does; GomezWinner2021 Income suhr2021does


Target: Behavior-oriented fairness – attained based on a user’s engagement with the system/item catalog.
User (in)activeness Hao21Pareto; Li2012UserOriented; Xiao20SocialActiv; Fu20Explainable; Chakraborty19Equality User (non)mainstreaminess Abdollahpouri2020Connection; Abdollahpouri21User


Target: Other emerging attributes
Physio/psychological wan2020addressing; htun2021perception Sentiment-based Lin21sentiment
Goal 2: Provider Fairness

Target: Item producer/creator – sensitive attribute based on who the item producer is.
News author GharahighehiVP21, music artist Ferraro19Music, movie director Boratto21Interplay
Target Producer’s demographic or general information – sensitive attribute based on to which demographic group the item producer belongs, e.g., male vs. female artists. Gender Kirnap21Estimation; Boratto21Interplay; Shakespeare20Exploring; xia2019we, geographical region GomezWinner2021
Target: Item information – sensitive attribute based on the item information itself. Price and brand deldjoo2021flexible; Dash21umpire, geographical region DBLP:conf/pakdd/LiuLTLCH20; DBLP:conf/fat/BurkeSO18
Target: Interaction-oriented fairness – sensitive attribute based on the interactions observed on items e.g., popularity. Popularity deldjoo2021flexible; DBLP:journals/access/DongXL21; DASILVA2021115112; Wundervald21Cluster; Borges21mitigating; Ge21Towards; Sun19Debiasing; Weydemann19location; Abdollahpouri19TheUnfairness; Zhu18FMSR, cold items Zhu21NewItems
Target: Other emerging attributes Premium membership Deldjoo19GCE, sentiment and reputation Lin21sentiment; Zhu20FARM
Target: Non-sensitive attributes Movie and music genre Tsintzou19BiasDisparity; DBLP:conf/recsys/LinSMB19; rastegarpanah2019fighting; Ferraro19Music
Goal 3: Consumer Provider Fairness
Target: Combinations of two targets from C-Fairness and P-Fairness. Same category of sensitive attributes for both users and items (e.g. behavior-oriented) naghiaei2022cpfair; rahmani2022unfairness; Lin21sentiment; Abdollahpouri19TheUnfairness; DBLP:conf/fat/BurkeSO18 Different categories of sensitive attributes deldjoo2021flexible; Deldjoo19GCE; DBLP:conf/recsys/MansouryMBP19; Tsintzou19BiasDisparity; Weydemann19location; xia2019we


Table 1: Overview of common attributes used when addressing fairness concepts from the perspectives of consumers, providers, or both.

Moreover, in the area of recommender systems, a number of people recommendation scenarios can be identified that are similar to classical fair ML problems. These include recommenders on dating sites, social media sites that provide suggestions for connections, and specific applications, e.g., in the educational context GomezWinner2021. In these cases, user demographics may play a major role. However, in many other cases, e.g., in e-commerce recommendation or media recommendation, it is not always immediately clear what protected groups may be. In Li2012UserOriented and other works, for example, user groups are defined based on their activity level, and it is observed that highly active users (of an e-commerce site) receive higher-quality recommendations in terms of usual accuracy measures. This is in general not surprising because there is more information a recommender system can use to make suggestions for more active users. However, it stands to question if an algorithm that returns the best recommendations it can generate given the available amount of information should be considered unfair. Recent studies have also focused on two-sided CP-Fairness, as illustrated in naghiaei2022cpfair; rahmani2022unfairness. In these works, the authors demonstrate the existence of inequity in terms of exposure to popular products and the quality of recommendation offered to active users. It is unknown if increasing fairness on one or both sides (consumer/producers) has an effect on the overall quality of the system. In naghiaei2022cpfair, an optimization-based re-ranking strategy is then presented that leverages consumer and provider-side benefits as constraints. The authors demonstrate that it is feasible to boost fairness on both the user and item sides without compromising (and even enhancing) recommendation quality.

Different from traditional fairness problems in ML, research in fairness for recommenders also frequently considers the concept of fairness towards items or their suppliers, see also Li2021tutorial, which differentiates between user and item fairness. In these cases, items can be related to users, for example artists on a music recommendation scenario, but they can also be arbitrary objects. In these research works, the idea often is to avoid an unequal (or: unfair) exposure of items of different groups. In some works, e.g., BORATTO2021102387, the popularity of items is considered an important attribute, and the goal is to give fair exposure to items that belong to the long tail. In other research works that focus on fair item exposure, e.g., in Gupta2021Online, groups are defined based on attributes that are in practice not protected, e.g., the price range of an accommodation; often, also synthetic data is used. The purpose of such experiments is usually to demonstrate the effectiveness of an algorithm if (any) groups were given. Nonetheless, in these cases it often remains unclear in which ways evaluations make sense with datasets from domains where there is no clear motivation for considering questions of fairness. Also, in cases where the goal is to increase the exposure of long-tail items, no particular motivation is usually provided about why recommending (already) popular items is generally unfair. There are often good reasons why certain items are unpopular and should not be recommended, for example, simply because they are of poor quality DBLP:journals/corr/abs-2109-07946.

Fairness for items at the individual level, in particular for cold-start items, is for example discussed in Zhu21NewItems. In general, as shown in Figure 3, works that consider aspects of individual fairness are rather rare and the definition from classical fair ML settings—similar individuals should be treated similarly—can not always be directly transferred to recommendation scenarios. In Edizel2019FaiRecSysMA, for example, the goal is to make sure that the system is not able to derive a user’s sensitive attribute, e.g., gender, and should thus be able to treat male and female individuals similarly. Most other works that focus on individual fairness address problems of group recommendation, i.e., situations where a recommender is used to make item suggestions for a group of users. Group recommendation problems have been studied for many years Masthoff2011GroupRS; felfernig2018group, usually with the goal to make item suggestions that are acceptable for all group members and where all group members are treated similarly. In the past, these works were often not explicitly mentioning fairness as a goal, because this was an implicit underlying assumption of the problem setting. In more recent works on group recommendation, in contrast, fairness is explicitly mentioned, e.g., in htun2021perception; kaya2020ensuring; Malecek2021Group, maybe also due to the current interest in this topic. Notable works in this context are htun2021perception and Want2021BiasFriend, which are one of the few works in our survey which consider questions of fairness perceptions.

Finally, we underline the resurgence of the notion of calibration recommendation or calibration fairness in recommender systems. In ML, calibration is a fundamental concept which occurs when the expected proportions of (predicted) classes match the observed proportions data points in the available data. Similarly, the purpose of calibration fairness is to reflect a measure of the deviation of users’ interests from the suggested recommendation in an acceptable proportion Oh2011; steck2018calibrated; JugovacJannachLerche2017eswa.

Single-sided and Multi-Sided Fairness

Traditionally, research in computer science on recommender systems has focused on the consumer value (or utility) of recommender systems, e.g., on how algorithmically generated suggestions may help users deal with information overload. Providers of recommendation services are however primarily interested in the value a recommender can ultimately create for their organization. The organizational impact of recommender systems has been, for many years, the focus in the field of information systems, see Xiao:2007:EPR:2017327.2017335 for a survey. Only in recent years we observe an increased interest on such topics in the computer science literature. Many of these recent works aim to shed light on the impact of recommendations in a multistakeholder environment, where typical stakeholders may include consumers, service providers, suppliers of the recommendable items, or even society abdollahpouri2020; jannach2021mcnamara.

In multistakeholder environments, there may exist trade-offs between the goals of the involved entities. A recommendation that is good for the consumer might for example not be the best for the profit perspective of the provider JannachAdomaviciusVAMS2017. In a similar vein, questions of fairness can be viewed from multiple stakeholders, leading to the concept of multisided fairness burke2017multisided. As mentioned above, there can be fairness questions that are related to the suppliers of the items. Again, there can also be tradeoffs, i.e., what may be a fair recommendation for users might be in some ways be seen to be unfair to item suppliers, e.g., when their items get limited exposure.

Figure 4 shows the distribution of works that focus on one single side of fairness and works which address questions of multisided fairness. The illustration clearly shows that the large majority of the works concentrates on the single-sided case, indicating an important research gap in the area of multisided fairness within multistakeholder application scenarios.

Figure 4: Fairness Notions: Single-sided vs. Multi-sided Fairness.

Among the few studies on multi-sided fairness, Abdollahpouri19Multi discusses techniques for CP-fairness in matching platforms such as Airbnb and Uber. Patro et al. patro2020fairrec model the fair recommendation problem as a constrained fair allocation problem with indivisible goods and propose a recommendation algorithm that takes producer fairness into consideration. Wu et al. Wu21TFROM propose an individual-based perspective, where fairness is defined as the same exposure for all producers and the same NDCG for all consumers involved.

Static vs. Dynamic Fairness

Another dimension of fairness research relates to the question whether the fairness assessment is done in a static or dynamic environment Li2021tutorial. In static settings, the assessment is done at a single point of time, as commonly done also in offline evaluations that focus on accuracy. Thus, it is assumed that the attributes of the items do not change, that the set of available items does not change, and that the analysis that is made at one point in time is sufficient to assess the fairness of algorithms or if an unfairness mitigation technique is effective.

Such static evaluations however have their shortcomings, e.g., as there may be feedback loops that are induced by the recommendations. Also, some effects of unfairness and the effects of corresponding mitigation strategies might only become visible over time. Such longitudinal studies require alternative evaluation methodologies, for example, approaches based on synthetic data or different types of

simulation

, such as those developed in the context of reinforcement learning algorithms, see

rohde2018recogym; mladenov2021recsimng; ghanem2022balancing; longitudinalimpact2021; AdomaviciusJannach2021 for simulation studies and related frameworks in recommender systems.

Figure 5: Fairness Notions: Static vs. Dynamic Evaluation.

Figure 5 shows how many studies in our survey considered static and dynamic evaluation settings, respectively. Static evaluations are clearly predominant: we only found 12 works that consider dynamically changing environments. In Ge21Towards, for example, the authors consider the dynamic nature of the recommendation environment by proposing a fairness-constrained reinforcement learning algorithm so that the model dynamically adjusts its recommendation policy to ensure the fairness requirement is satisfied even when the environment changes. A similar idea is developed in DBLP:conf/pakdd/LiuLTLCH20, where a long-term balance between fairness and accuracy is considered for interactive recommender systems, by incorporating fairness into the reward function of the reinforcement algorithm. On the other hand, works such as DBLP:conf/kdd/BeutelCDQWWHZHC19 and deldjoo2021flexible model fairness in a specific snapshot of the system, by simply taking the system and its training information as a fixed image of the interactions performed by the users on the system.

Associative vs. Causal Fairness

The final categorization discussed in Li2021tutorial contrasts associative and causal fairness. One key observation by the authors in that context is that most research in fair ML is based on association-based (correlation-based) approaches. In such approaches, researchers typically investigate the potential “discrepancy of statistical metrics between individuals or subpopulations”. However, certain aspects of fairness cannot be investigated properly without considering potential causal relations, e.g., between a sensitive (protected) feature like gender and the model’s output. In terms of methodology, causal effects are often investigated based on counterfactual reasoning KusnerCounterfactual2017; li-2021-towards-1.

Figure 6 shows that there are only three works investigating recommendation fairness problems based on causality considerations. In DBLP:conf/recsys/CornacchiaNR21, for example, the authors propose the use of counterfactual explanation to provide fair recommendations in the financial domain. An interesting alternative is presented in li-2021-towards-1, where the authors analyze the causal relations between the protected attributes and the obtained results.

Figure 6: Fairness Notions: Associative vs. Causal Fairness.

One additional dimension we have discovered through our literature analysis is the use of constrained-based approaches to integrate or model fairness characteristics in recommender systems. For example, Hao21Pareto address the issue of enforcing equality to biased data by formulating a constrained multi-objective optimization problem to ensure that sampling from imbalanced sub-groups does not affect gradient-based learning algorithms; the same work and others—including DBLP:conf/recsys/SeymenAM21a or DBLP:conf/sigir/YadavDJ21—define fairness as another constraint to be optimized by the algorithms. In DBLP:conf/sigir/YadavDJ21, for example, such a constraint is amortized fairness-of-exposure.

4.4 Application Domains and Datasets

Next, we look at application domains that are in the focus of research on fair recommendations. Figure 7 shows an overview of the most frequent application domains and how many papers focused on these domains in their evaluations. The by far most researched domain is the recommendation of videos (movies) and music, followed by e-commerce and finance. For many other domains shown in the figure (e.g., jobs, tourism, or books), only a few papers were identified. Certain domains were only considered in one or two papers. These papers are combined in the “Other” domain in Figure 7.

Figure 7: Application Domains.

Since most of the studied papers are technical papers and use an offline experimental procedure, corresponding datasets from the respective domains are used. Strikingly often, in more than one third of the papers, one of the MovieLens datasets is used. This may seem surprising as some of these datasets not even contain information about sensitive attributes. Generally, these observations reflect a common pattern in recommender systems research, which is largely driven by the availability of datasets. The MovieLens datasets are a notorious case and have been used for all sorts of research in the past

ML2015. Fairness research in recommender systems thus seems to have a quite different focus than fair ML research in general, which is often about avoiding discrimination of people.

We may now wonder which specific fairness problems are studied with the help of the MovieLens rating datasets.

What would be unfair recommendations to users? What would be unfair towards the movies (or their providers)? It turns out that item popularity is often the decisive attribute to achieve fairness towards items, and quite a number of works aim to increase the exposure of long-tail items which are not too popular, see, e.g., DBLP:journals/access/DongXL21. In terms of fairness towards users, the technical proposal in DASILVA2021115112 for example aims to serve users with recommendations that reflect their past diversity preferences with respect to movie genres. An approach towards fairness to groups is proposed in MISZTALRADECKA2021102519. Here, groups are not identified by their protected attribute, but by the recommendation accuracy that is achieved (using any metric) for the members of the group.

Continuing our discussions above, such notions of unfairness may not be undisputed. When some users receive recommendations with lower accuracy, this might be caused by their limited activity on the platform or their unwillingness to allow the system to collect data. Actually, one may consider it unfair to artificially lower the quality of recommendations for the group of highly active and open users. Also, a movie may simply not be popular, because it is of poor quality, as mentioned above. It is not clear why recommending to many users would make the system fairer, and equating bias (or skewed distributions) with unfairness in general seems questionable. Finally, also for the user fairness calibration approach from DASILVA2021115112 it is less than clear why diversifying recommendations according to user tastes would increase the system’s fairness. It may increase the quality of the recommendations, but a system that generates lower-quality recommendations for everyone is probably not one we would call unfair.

In several cases it therefore seems that the addressed problem settings are not too realistic or artificial. One main reason for this phenomenon in our view lies in the lack of suitable datasets for domains where fairness really matters. These could for example be the problem of job recommendations on business networks or people recommendations on social media which can be discriminatory. In today’s research, often datasets from rather non-critical domains or synthetic datasets are used to showcase the effectiveness of a technical solution Ge21Towards; Abdollahpouri21User; Yao2017BeyondParity; MISZTALRADECKA2021102519; Hao21Pareto; Tsintzou19BiasDisparity; Sun19Debiasing; DBLP:conf/kdd/GeyikAK19; Stratigi17health.

While this may certainly be meaningful to demonstrate the effects of, e.g., a fairness-aware re-ranking algorithm, such research may appear to remain quite disconnected from real-world problems. This phenomenon of an “abstraction trap” was discussed earlier in selbst2019.

4.5 Methodology

In this section, we review how researchers approach the problems from a methodological perspective.

Research Methods

In principle, research in recommender systems can be done through experimental research (e.g., with a field study or through a simulation) or non-experimental research (e.g., through observational studies or with qualitative methods) JannachZankerEtAl2010. In recommender systems research, three main types of experimental research are common: (a) offline experiments based on historical data, (b) user studies (laboratory studies), and (c) field tests (A/B tests, where different systems versions are evaluated in the real world). Figure 8 shows how many papers fall into each category. Like in general recommender systems research JannachZankerEtAl2012, we find that offline experiments are the predominant form of research. Note that we here only consider technical papers, and not the conceptual or theoretical ones that we identified. Only in very few cases, humans were involved in the experiments, and in even fewer cases we found reports of field tests. Regarding user studies, htun2021perception for example involves real users to evaluate fairness in a group recommendation setting. On the other hand, notable examples of field experiment are provided in DBLP:conf/kdd/GeyikAK19, where a gender-representative re-ranker is deployed for a randomly chosen 50% of the recruiters on the LinkedIn Recruiter platform (A/B testing), and in DBLP:conf/kdd/BeutelCDQWWHZHC19. We only found one paper that relied on interviews as a qualitative research method Sonboli2021Fairness. Also, only very few papers used more than one experiment type, e.g., Serbos17package were both a user study and an offline experiment were conducted.

Figure 8: Experiment Types.

The dominance of offline experiments points to a research gap in terms of our understanding of fairness perceptions by users. Many technical papers that use offline experiments assume that there is some target distribution or a target constraint that should be met. And these papers then use computational metrics to assess to what extent an algorithm is able to meet those targets. The target distribution, e.g., of popular and long-tail content, is usually assumed to be given or to be a system parameter. To what extent a certain distribution or metric value would be considered fair by users or other stakeholders in a given domain is usually not discussed. In any practical application, this question is however fundamental, and again the danger exists that research is stuck in an abstraction trap. In a recent work on job recommendations Want2021BiasFriend, it was for example found that a debiasing algorithm lead to fairer recommendation without a loss in accuracy. A user study then however revealed that participants actually preferred the original system recommendations.

Main Technical Contributions and Algorithmic Approaches

Looking only at the technical papers, we identified three main groups of technical contributions: (i) works that report outcomes of data analyses or which compare recommendation outcomes, (ii) works that propose algorithmic approaches to increase the fairness of the recommendations, and (iii) works that propose new metrics or evaluation approaches. Figure 9 shows the distribution of papers according to this categorization.

Figure 9: Technical Focus of Papers.