1. Introduction
Recommender systems are used in a variety of different applications including movies, music, ecommerce, online dating, and many other areas where the number of options from which the user needs to choose can be overwhelming. There are many different metrics to evaluate the performance of the recommender systems ranging from accuracy metrics such as precision, normalized discounted cumulative gain (NDCG), and recall to nonaccuracy ones like novelty and serendipity (kaminskas2016). One of the measures often used to evaluate the effectiveness of a given recommender system is how diverse the list of recommendations given to each user is (aka individual list diversity) (hurley2011). Recommending a diverse list of items is shown to improve user satisfaction as they give a wider range of options to the user (brynjolfsson2003).
The problem with individual list diversity is that it does not capture the extent to which an algorithm covers a diverse set of items across all users which is an important consideration for many applications. Aggregate diversity (adomavicius2011) is a notion to measure this characteristic of the recommender systems and several algorithms have been proposed for that matter by other researchers (adomavicius2011; antikacioglu2017). Note that a high individual list diversity of recommendations does not necessarily imply high aggregate diversity. For instance, if the system recommends to all users the same 10 items that are not similar to each other, the recommendation list for each user is diverse (i.e., high individual list diversity), but only 10 distinct items are recommended to all users (i.e., resulting in low aggregate diversity).
An algorithm with low aggregate diversity could be problematic for several reasons. On the one hand, it concentrates on a limited number of popular items which, in the long run, might negatively affect users’ experience in terms of item discovery. Users already know about popular items and recommending them would not add any new information. On the other hand, often items belong to different suppliers and, hence, covering fewer distinct items can indirectly result in an unfair distribution of items across recommendations from the suppliers’ perspective. Thus, a low aggregate diversity in recommendation results would have a negative impact on business success and profit (brynjolfsson2011; goldstein2006).
In this paper, we introduce FairMatch, a general graphbased algorithm that works as a postprocessing approach after recommendation generation (on top of any existing standard recommendation algorithm) for improving the aggregate diversity. The idea is to generate a list of recommendations with a size larger than what we ultimately want for the final list using a standard recommendation algorithm and then use our FairMatch algorithm to build the final list using a subset of items in the original list. In FairMatch, the main goal is to improve the visibility of highquality items that have a low visibility in the original set of recommendations. This is done by iteratively solving Maximum Flow problem on a recommendation bipartite graph which is built using the recommendations in the original list (left nodes are recommended items and right nodes are the users). At each iteration, the items that can be good candidates for the final list will be selected and removed from the graph, and the process will continue on the remaining part of the graph.
To show the effectiveness of our FairMatch algorithm on improving aggregate diversity and fair visibility of recommended items, we perform a comprehensive set of experiments on recommendation lists of different sizes generated by two standard recommendation algorithms on two publicly available datasets. We intentionally picked two algorithms from two different classes of algorithms (factorization and neighborhoodbased models), so our approach is not dependent on any certain type of recommendation algorithms.
Comparison with several stateoftheart baselines shows that our FairMatch algorithm is able to significantly improve the performance of recommendation results in terms of aggregate diversity and longtail visibility, with a negligible loss in the recommendation accuracy in some cases.
2. Related Work
The concept of aggregate diversity has been studied by many researchers often under different names such as longtail recommendation (recsys2017; yin2012challenging), Matthew effect (moller2018not) and, of course, aggregate diversity (liu2015trust; adomavicius2011improving) all of which refer to the fact that the recommender system should recommend a wider variety of items across all users.
Vargas and Castells in (vargas2011) proposed probabilistic models for improving novelty and diversity of recommendations by taking into account both relevance and novelty of target items when generating recommendation lists. In another work (vargas2014), they proposed the idea of recommending users to items for improving novelty and aggregate diversity. They applied this idea to nearest neighbor models as an inverted neighbor and a factorization model as a probabilistic reformulation that isolates the popularity components.
Adomavicius and Kwon (adomavicius2011) proposed the idea of diversity maximization using a maximum flow approach. They used a specific setting for the bipartite recommendation graph in a way that the maximum amount of flow that can be sent from a source node to a sink node would be equal to the maximum aggregate diversity for those recommendation lists. In their setting, given the number of users is , the source node can send a flow of up to to the left nodes, left nodes can send a flow of up to 1 to the right nodes, and right nodes can send a flow of up to 1 to the sink node. Since the capacity of left nodes to right nodes is set to 1, thus the maximum possible amount of flow through that recommendation bipartite graph would be equivalent to the maximum aggregate diversity.
A more recent graphbased approach for improving aggregate diversity was proposed by Antikacioglu and Ravi in (antikacioglu2017). They generalized the idea proposed in (adomavicius2011) and showed that the minimumcost network flow method can be efficiently used for finding recommendation subgraphs that optimizes the diversity. In this work, an integervalued constraint and an objective function are introduced for discrepancy minimization. The constraint defines the maximum number of times that each item should appear in the recommendation lists and the objective function aims to find an optimal subgraph that gives the minimum discrepancy from the constraint. This work shows improvement in aggregate diversity with a smaller accuracy loss compared to the work in (vargas2011) and (vargas2014). Similar to this work, our FairMatch algorithm also uses a graphbased approach to improve aggregate diversity. However, unlike the work in (antikacioglu2017) which tries to minimize the discrepancy between the distribution of the recommended items and a target distribution, our FairMatch algorithm has more freedom in promoting highquality items with low visibility since it does not assume any target distribution of the recommendation frequency.
3. FairMatch Algorithm
We formulate our FairMatch algorithm as a postprocessing step after the recommendation generation. In other words, we first generate recommendation lists of larger size than what we ultimately desire for each user using any standard recommendation algorithm and use them to build the final recommendation lists. FairMatch works as a batch process, similar to that proposed in (surer2018) where all the recommendation lists are produced at once and reranked simultaneously to achieve the objective. In this formulation, we produce a longer recommendation list of size for each user and then, after identifying highquality items (items closer to the top of the list) with low visibility (i.e. are not recommended frequently) by iteratively solving the maximum flow problem on recommendation bipartite graph, we generate a shorter recommendation list of size (where ).
Let be a bipartite graph of recommendation lists where is the set of left nodes (representing items), is the set of right nodes (representing users), and is the set of edges between left and right nodes showing an item in the left nodes is recommended to a user in the right nodes in recommendation lists of size . is initially a uniformly weighted graph, but we will update the weights for edges as part of our algorithm. We will discuss the initialization and our weighting method in section 3.2.
Given a weighted bipartite graph , the goal of our FairMatch algorithm is to find highquality items with low visibility and maximizing their visibility as much as possible without a significant loss in accuracy of the recommendations. Visibility is characterized by the degree of the node in the recommendation graph, while accuracy is captured by the rank position of the items in the original recommendation list. We develop our algorithm by extending the approach introduced in (bonchi2018) to improve the aggregate diversity of the recommender systems.
We use an iterative process to identify the subgraphs of that contain the highest quality items with low visibility for each user. After identifying a subgraph at each iteration, we remove from and continue the process of finding subgraphs on the rest of the graph (i.e., ). We keep track of all the subgraphs as we use them to generate the final recommendations in the last step.
Identifying at each iteration is done by solving a Maximum Flow problem (explained in section 3.3) on the graph obtained from the previous iteration. Solving the maximum flow problem returns the left nodes connected to the edges with lower weight (i.e., more relevant items with low visibility) on the graph. After finding those left nodes, we form subgraph by separating identified left nodes and their connected right nodes from . Finally, pairs in subgraphs are used to construct the final recommendation lists of size . We will discuss this process in detail in the following sections.
Algorithm 1 shows the pseudocode for FairMatch. Overall, our FairMatch algorithm consists of the following four steps: 1) Graph preparation, 2) Weight computation, 3) Candidate selection, and 4) Recommendation list construction.
3.1. Graph Preparation
Given long recommendation lists of size generated by a standard recommendation algorithm, we create a bipartite graph from recommendation lists in which items and users are the nodes (called, respectively, left and right nodes) and recommendations are expressed as edges. Since our FairMatch algorithm is formulated as a maximum flow problem, we also add two nodes, source () and sink (). The purpose of having a source and sink node in the maximum flow problem is to have a start and endpoint for the flow going through the graph. We connect node to all left nodes and also we connect all right nodes to . Figure 1 shows a sample bipartite graph resulted in this step.
3.2. Weight Computation
Given the bipartite recommendation graph, , the task of weight computation is to calculate the weight for edges between the source node and left nodes, left nodes and right nodes, and right nodes and sink node.
For edges between left nodes and right nodes, we define the weights as the weighted sum of item visibility and relevance. The visibility of each item is defined as the degree of the node corresponding to that item (excluding the edge with the source node). Item degree is the number of edges going out from that node connecting it to the user nodes and that shows how often it is recommended to different users. Relevance is based on the rank of the item in the original recommendation list for each user (lower rank is more relevant).
For computing the weight between and , we use the following equation:
(1) 
where is the number of edges from to right nodes (i.e., ), is the position of item in the recommendation list of size generated for user , and is a coefficient to control the tradeoff between accuracy and diversity (or visibility).
Note that in equation 1, and have different ranges. The range for is from 1 to (there are different positions in the original list) and the range of depends on the frequency of the item recommended to the users (the more frequent it is recommended to different users the higher its degree is). Hence, for a meaningful weighted sum, we normalize to be in the same range as .
Given weights of the edges between and , , total capacity of and would be which simply shows the sum of the weights of the edges connecting left nodes to the right nodes.
For computing the weight for edges connected to the source and sink nodes, first, we equally distribute to left and right nodes. Therefore, the capacity of each left node, , and right node, , would be as follow:
(2) 
where returns the ceil value of . Then, based on equal capacity assigned to each left and right nodes, we follow the method introduced in (bonchi2018) to compute weights for edges connected to source and sink nodes as follow:
(3) 
(4) 
where is the Greatest Common Divisor of the distributed capacity of left and right nodes. Assigning the same weight to edges connected to the source and sink nodes guaranties that all nodes in and are treated equally and the weights between them play an important role in our FairMatch algorithm.
3.3. Candidate Selection
The graph constructed in previous steps is ready to be used for solving the maximum flow problem. In a maximum flow problem, the main goal is to find the maximum amount of feasible flow that can be sent from the source node to the sink node through the flow network. Several algorithms have been proposed for solving a maximum flow problem. Wellknown algorithms are Ford–Fulkerson (ford1956), Pushrelabel (goldberg1988), and Dinic’s algorithm (dinic1970). In this paper, we use Pushrelabel algorithm to solve the maximum flow problem on our bipartite recommendation graph as it is one of the efficient algorithms for this matter.
In pushrelabel algorithm, each node will be assigned two attributes: label and excess flow. The label attribute is an integer value that is used to identify the neighbors to which the current node can send flow. A node can only send flow to neighbors that have lower label than the current node. Excess flow is the remaining flow of a node that can still be sent to the neighbors. When all nodes of the graph have excess flow equals to zero, the algorithm will terminate.
The pushrelabel algorithm combines operations that send a specific amount of flow to a neighbor, and operations that change the label of a node under a certain condition (when the node has excess flow greater than zero and there is no neighbor with label lower than the label of this node).
Here is how the pushrelabel algorithm works: Figure 2 shows a typical graph in the maximum flow problem and an example of push and relabel operations. In Figure 1(a), and are current flow and weight of the given edge, respectively. In Pushrelabel algorithm, a residual graph, , will be also created from graph . As graph shows the flow of forward edges, graph shows the flow of backward edges calculated as . Figure 1(b) shows residual graph of graph in Figure 1(a). Now, we want to perform a push operation on node and send its excess flow to its neighbors.
Given as excess flow of node , operation will send a flow of amount from node to node and then will decrease excess flow of by (i.e., ) and will increase excess flow of by (i.e., ). After operation, node will be put in a queue of active nodes to be considered by the pushrelabel algorithm in the next iterations and residual graph would be updated. Figure 1(c) shows the result of and on the graph shown in Figure 1(b). In , for instance, since and all of its neighbors have the same label value, in order to perform push operation, first we need to perform relabel operation on node to increase the label of by one unit more than the minimum label of its neighbors to guaranty that there is at least one neighbor with lower label for performing push operation. After that, node can send flow to its neighbors.
Given , , and in Figure 1(b), after performing relabel operation, we can only send the flow of amount 8 from to and the flow of amount 4 from to . After these operations, residual graph (backward flow from and to ) will be updated.
The pushrelabel algorithm starts with a ”preflow” operation to initialize the variables and then it iteratively performs push or relabel operations until no active node exists for performing operations. Assuming as the label of node , in preflow step, we initialize all nodes as follow: , , , and . This way, we will be able to send the flow from to as the left nodes have higher label than the right nodes. Also, we will push the flow of amount (where ) from to all the left nodes.
After preflow, all of the left nodes will be in the queue, , as active nodes because all those nodes now have positive excess flow. The main part of the algorithm will now start by dequeuing an active node from and performing either push or relabel operations on as explained above. This process will continue until is empty. At the end, each node will have specific label value and the sum of all the coming flows to node would be the maximum flow of graph . For more details see (goldberg1988)
An important question is: how does the Pushrelabel algorithm can find highquality nodes (items) with low degree (visibility)?
We answer this question by referring to the example in Figure 1(c). In this figure, assume that has a backward edge to . Since has excess flow greater than zero, it should send it to its neighbors. However, as you can see in the figure, does not have any forward edge to or nodes. Therefore, it has to send its excess flow back to as is the only reachable neighbor for . Since has the highest label in our setting, in order for to push all its excess flow back to , it should go through a relabel operation so that its label becomes larger than that of . Therefore, the label of will be set to for an admissible push.
The reason that receives high label value is the fact that it initially receives high flow from , but it does not have enough capacity (the sum of weights between and its neighbors is smaller than its excess flow. i.e. 8+4¡15) to send all that flow to them. In FairMatch, in step 3 (i.e. section 3.3), left nodes without sufficient capacity on their edges will be returned as part of the outputs from pushrelabel algorithm and are considered for constructing the final recommendation list in step 4 (i.e. section 3.4). These nodes are the ones that their edges received low weights by equation 1 in step 2 (i.e. section 3.2) because of their low degree (low visibility) and rank (high relevance) on the graph. Therefore, FairMatch aims at promoting those high relevance items with low visibility.
3.4. Recommendation List Construction
In this step, the goal is to construct a recommendation list of size by the pairs identified in previous step.
Given a recommendation list of size for user , , sorted based on the scores generated by a standard recommendation algorithm, candidate items identified by FairMatch connected to as , and visibility of each item, , in recommendation lists of size as , we use the following process for generating recommendation list for . First, we sort recommended items in based on their in ascending order. Then, we remove from the bottom of sorted and add items from to the end of .
This process will ensure that extracted items in the previous step will replace the frequently recommended items meaning that it decreases the visibility of the frequently recommended items and increases the visibility of rarely recommended items to generate a fair distribution on recommended items.
4. Experiments
We performed a comprehensive evaluation of the effectiveness of FairMatch in improving aggregate diversity of recommender systems. Our evaluation on two standard recommendation algorithms and comparison to various diversification methods to increase aggregate diversity as baselines on two datasets shows that FairMatch significantly improves item visibility with a negligible loss in the accuracy of recommendations.
4.1. Experimental Setup
Experiments are performed on two publicly available datasets: Epinions and MovieLens. The Epinions dataset was collected from Epinions web site which is an item reviewing system. It is a subset extracted from Epinions dataset in which each user has rated at least 15 items and each item is rated by at least 15 users (i.e core15). The MovieLens dataset (harper2015) is movie ratings data and was collected by the GroupLens research group. The characteristics of the datasets are summarized in Table 1.
Dataset  #users  #items  #ratings  density 

Epinions  5,531  4,287  186,995  0.789% 
ML1M  6,040  3,706  1,000,209  4.468% 
The initial longer recommendation lists of size are generated by two wellknown recommendation algorithms: listwise matrix factorization (ListRank) (shi2010) and userbased collaborative filtering (UserKNN) (Resnick:1994a). As mentioned earlier, we chose these two algorithms to cover different approaches in recommender systems: matrix factorization and neighborhood models. We performed gridsearch^{1}^{1}1For ListRankMF, we set all regularizers , , , and . For UserKNN, we set . on hyperprameters for each algorithm and selected the results with the highest precision value for our next analysis.
To show the effectiveness of the FairMatch algorithm in improving the aggregate diversity, we compare its performance with two stateoftheart algorithms and also two simple baselines.
algorithms  baselines  

P@10  C@10  G@10  E@10  P@10  C@10  G@10  E@10  P@10  C@10  G@10  E@10  
ListRankMF  Standard  0.015  24.4%  0.937  4.35  0.015  24.4%  0.937  4.35  0.015  24.4%  0.937  4.35 
Random  0.010  33.1%  0.882  5.23  0.006  45.0%  0.869  5.97  0.004  53.5%  0.856  6.39  
Reverse  0.005  37.3%  0.839  5.69  0.003  51.9%  0.867  5.58  0.002  60.7%  0.814  6.70  
FA*IR  0.013  28.8%  0.917  4.56  0.009  30.4%  0.934  5.01  0.008  33.8%  0.945  5.01  
DM  0.014  36.9%  0.907  4.45  0.011  56.6%  0.748  5.69  0.010  69.1%  0.680  6.07  
FairMatch  0.014  38.0%  0.884  4.72  0.010  61.4%  0.789  5.96  0.008  77.7%  0.720  6.53  
UserKNN  Standard  0.045  46.4%  0.925  5.19  0.045  46.4%  0.925  5.19  0.045  46.4%  0.925  5.19 
Random  0.035  53.6%  0.896  5.63  0.023  65.7%  0.875  6.26  0.016  75.6%  0.831  6.73  
Reverse  0.025  58.2%  0.870  5.96  0.013  73.3%  0.825  6.70  0.008  83.1%  0.753  7.20  
FA*IR  0.044  61.1%  0.868  5.62  0.038  65.0%  0.865  6.22  0.030  65.4%  0.867  6.41  
DM  0.044  64.1%  0.850  5.65  0.041  84.3%  0.732  6.40  0.037  95.4%  0.529  7.18  
FairMatch  0.044  67.0%  0.853  5.72  0.038  90.8%  0.732  6.67  0.029  98.1%  0.580  7.41 
algorithms  baselines  

P@10  C@10  G@10  E@10  P@10  C@10  G@10  E@10  P@10  C@10  G@10  E@10  
ListRankMF  Standard  0.152  14.0%  0.916  4.13  0.152  14.0%  0.916  4.13  0.152  14.0%  0.916  4.13 
Random  0.124  17.5%  0.861  4.76  0.089  24.6%  0.834  5.48  0.066  32.2%  0.809  5.97  
Reverse  0.097  19.0%  0.831  4.97  0.055  28.4%  0.786  5.73  0.037  37.9%  0.757  6.20  
FA*IR  0.143  14.2%  0.907  4.26  0.136  14.3%  0.937  4.34  0.128  16.5%  0.949  4.41  
DM  0.148  18.7%  0.850  4.41  0.138  28.4%  0.801  4.76  0.130  38.1%  0.764  5.02  
FairMatch  0.149  19.4%  0.870  4.40  0.138  30.0%  0.836  4.90  0.130  40.2%  0.834  5.10  
UserKNN  Standard  0.196  10.7%  0.884  4.37  0.196  10.7%  0.884  4.37  0.196  10.7%  0.884  4.37 
Random  0.163  12.3%  0.836  4.73  0.120  15.9%  0.805  5.29  0.094  19.4%  0.780  5.71  
Reverse  0.130  13.4%  0.791  4.99  0.082  17.7%  0.726  5.63  0.058  22.2%  0.703  6.01  
FA*IR  0.192  11.3%  0.855  4.60  0.181  12.3%  0.869  4.88  0.168  18.0%  0.858  5.19  
DM  0.192  13.8%  0.835  4.63  0.184  19.2%  0.800  4.98  0.180  25.0%  0.780  5.21  
FairMatch  0.193  13.9%  0.863  4.48  0.184  18.6%  0.872  4.69  0.170  23.6%  0.850  5.05 

FA*IR. This is the method introduced in (zehlike2017) and was mentioned in our related work section. The method was originally used for improving group fairness in job recommendation. However, we use this method for improving aggregate diversity in item recommendation. We define protected and unprotected groups as longtail and shorthead items, respectively. For separating shorthead from longtail items, we consider those top items which cumulatively take up K% of the ratings as the shorthead and the rest as longtail items. For experiments in this paper, we have tried different values of
. Also, we set the other two hyperparameters, proportion of protected candidates in the top
items^{2}^{2}2Based on suggestion from the released code, the range should be in and significance level^{3}^{3}3Based on suggestion from the released code, the range should be in , to and , respectively. 
Discrepancy Minimization (DM). This is the method introduced in (antikacioglu2017) and was explained in our related work section. For hyperparameter tuning, we followed the experimental settings suggested by the original paper for our experiments. We set the target degree distribution to and relative weight of the relevance term to .

Reverse. Given a recommendation list of size for each user generated by standard recommendation algorithm, in this method, instead of picking the items from the top, we pick them from the bottom of the list. In this approach, we expect to see an increase in aggregate diversity as we are giving higher priority to the items with lower scores to be picked first. However, the accuracy of the recommendations will decrease as we give higher priority to lower quality items.

Random. Given a recommendation list of size for each user generated by standard recommendation algorithm, we randomly choose items from that list and create a final recommendation list for that user. Note that this is different from randomly choosing items from all catalog to recommend to users. The reason we randomly choose the items from the original recommended list of items (size ) is to compare other postprocessing and reranking techniques with a simple random reranking.
FairMatch algorithm only involves one hyperparameter, , to control the balance between the node degree and relevance. For our experiments we try . A lower value for indicates more focus on maintaining the accuracy of the recommendations, while a higher value for indicates more focus on improving aggregate diversity. We also perform a sensitivity analysis to show how can play an important role in the accuracydiversity tradeoff.
For evaluation, we use the following metrics to measure the effectiveness of each method:

Precision (): The fraction of the recommended items shown to the users that are part of the users’ profile in the test set.

Coverage (): The percentage of items which appear at least once in the recommendation lists.

Gini index ()
: The measure of fair distribution of recommended items. It takes into account how uniformly items appear in recommendation lists. Uniform distribution will have Gini index equal to zero which is the ideal case (lower Gini index is better). Given all the recommendation lists for users,
, andas the probability of the
th least recommended item being drawn from calculated as (vargas2014):(5) where is the recommendation list for user . Now, Gini index of can be computed as:
(6) 
Entropy (): Given the distribution of recommended items, entropy measures the uniformity of that distribution. Uniform distribution has the highest entropy or information gain, thus higher entropy is more desired when the goal is increasing diversity.
(7) where is the observed probability value of item in recommendation lists .
We performed 5fold cross validation in our experiments, and we generated recommendation lists of size 10, 20, 50, and 100 for each user by each recommendation algorithm. Recommendation lists of size 10 are used for evaluating standard recommendation algorithms and longer recommendation lists of size 20, 50, and 100 are used as input for diversification techniques to generate recommendation lists of size 10 as output. Recommendation lists of size 10 generated by each diversification technique are evaluated by aforementioned metrics and their effectiveness is compared. We used librecauto and LibRec 2.0 for running experiments (mansoury2018automating; Guo2015).
4.2. Comparative Evaluation
Table 2 and 3 summarize the performance of FairMatch and other baselines on Epinions and MovieLens datasets, respectively. For each metric (ignoring Random and Reverse techniques), the bolded values show the best results and a statistically significant change from the second best baseline with .
As mentioned earlier, extensive experiments are performed by each diversification technique with multiple hyperparameter values and for the purpose of comparison, from each of three diversification algorithms (DM, FA*IR, and our FairMatch) the configuration which yields, more of less, the same precision loss is reported. These results enable us to better compare the performance of each technique on improving aggregate diversity while maintaining the same level of accuracy.
Based on experiments on Epinions dataset shown in table 2, FairMatch significantly outperforms all the baselines on various sizes of initial recommendation lists generated by both recommendation algorithms in terms of coverage (). The coverage of FairMatch is even higher than the Random and Reverse techniques without losing much accuracy which is indicative of its power in finding highquality items with minimum visibility. Again, the algorithm used here is randomly picking items from the original list and put them in the final list, so it is still possible that many popular items could end up being in the final list. In terms of fair distribution, the same improvement is also consistently observed on entropy. Entropy of FairMatch technique is significantly higher than other techniques in all cases showing that the recommendations generated by FairMatch are fairer and closer to a uniform distribution. However, in terms of Gini index, FairMatch generated comparable results to DM.
Table 3 shows the experimental results in MovieLens dataset. Based on these results, except for UserKNN with and , FairMatch provides higher coverage in all cases which is consistent with the results from Epinions dataset. In terms of entropy and Gini index, FairMatch was outperformed by DM in most of the cases.
It is worth noting that the Gini can be a misleading measure if it is looked at in isolation. For instance, if an algorithm recommends only a few items (low coverage) but does so by recommending each item exactly in an equal proportion, then it will achieve a perfect Gini. However, having a low coverage is not desired and therefore it is more reasonable to look at the coverage and Gini together.
4.3. AccuracyDiversity TradeOff
We also investigated the precision and diversity tradeoff in our FairMatch algorithm under various settings. Figure 3 shows the experimental results on Epinions (Figure 2(a)) and MovieLens (Figure 2(b)) datasets. In these plots, xaxis shows the precision and yaxis shows the entropy of the recommendation results at size 10. Similar results are also observed when Gini index or coverage metrics are used as diversity measures. Each point on the plot corresponds to a specific value and the black cross shows the performance of original recommendation lists at size 10.
Results in Figure 3 show that plays an important role in controlling the precisiondiversity tradeoff. As we increase the value, precision increases, while diversity decreases. According to equation 1, for a higher value, FairMatch will concentrate more on improving the accuracy of the recommendations, while for lower value, it will have a higher concentration on improving the diversity of the recommendations.
Also, it can be observed from Figure 3 that for longer initial recommendation lists (i.e., higher values for ), although the diversity of the recommendations increases, the precision decreases. These parameters allow system designers to better control the precisiondiversity tradeoff.
4.4. Longtail Coverage Analysis
Recommending more items by a given recommendation algorithm is a desired characteristic. However, it is important to check if the increase in item coverage comes from recommending more longtail items or it is just covering more popular items. Figure 4 shows the longtail coverage for different algorithms on Epinions (Figure 3(a)) and MovieLens (Figure 3(b)) datasets for different original recommendations sizes . For these experiments, we specified longtail items using the technique introduced in (celma2008). Except for the UserKNN on MovieLens dataset which our FairMatch algorithm covers fewer longtail items than the DM algorithm, in all other cases, the FairMatch algorithm outperforms all other algorithms on both datasets. In fact, on MovieLens, the FairMatch algorithm also beats DM algorithm with a slight margin when the size of the original recommendation is 20. In other words, when the time and space complexity become an issue (larger values for ) and a smaller is desired then the FairMatch algorithm outperforms every other algorithm in this experiment on both datasets.
4.5. Complexity Analysis
Solving the maximum flow problem is the core computation part of the FairMatch algorithm. We used Pushrelabel algorithm as one of the efficient algorithms for solving the maximum flow problem. This algorithm has a polynomial time complexity as where is the number of nodes and is the number of edges in bipartite graph. For other parts of the FairMatch algorithm, the time complexity would be in the order of the number of edges as it mainly iterates over the edges in the bipartite graph.
Since FairMatch is an iterative process, unlike other maximum flow based techniques (adomavicius2011; antikacioglu2017), it requires solving maximum flow problem on the graph multiple times and this could be one limitation of our work. However, except for the first iteration that FairMatch executes on the original graph, at the next iterations, the graph will be shrunk as FairMatch removes some parts of the graph at each iteration. Regardless, the upperbound for the complexity of FairMatch will be assuming in each iteration we still have the entire graph (which is not the case). Therefore, the complexity of FairMatch is certainly less than which is still polynomial.
5. Discussion and Future Work
In this section, we discuss the advantages that FairMatch provides on improving the performance of recommender systems. Also, we will discuss possible future work that can be considered for further improvement in FairMatch.
Generalization. In this paper, we studied the ability of FairMatch for improving aggregate diversity and one special case of supplier fairness under the assumption of each item belongs to one supplier (i.e., fair distribution on recommended items). However, FairMatch can be generalized to other definitions of fairness including supplierside fairness. In this scenario, we can create recommendation bipartite graph between users and suppliers (based on recommended items), and then assign weights to edges based on suppliers’ information (e.g., the probability of their items being shown in recommendation results and the quality of their items). At the third step, we can solve the maximum flow problem on this graph to extract highquality suppliers with unfair visibility in recommendation lists. Finally, we can reconstruct the final recommendation lists by adding highquality items from those suppliers according to each user’s preferences.
Similar settings can also be considered on FairMatch for improving user fairness (mansoury2020a). Considering the job recommendation domain where the task is recommending jobs to users, FairMatch can be formulated to fairly distribute ”good” jobs (e.g. highlypaying jobs) to each group of users based on sensitive attributes (e.g. men and women). We consider these scenarios in our future work.
Flexibility. Another potential interesting improvement on FairMatch is taking into account the item ranking in final recommendation lists. In this paper, we aimed at creating final recommendation lists to include highquality items with low visibility and we measured it in terms of precision. However, FairMatch allows to consider creating fair ranked lists by modifying the last step (recommendation construction). To do this, given extracted items from step 3 and top recommendation lists from standard recommendation algorithm, the goal is to find the fair position for extracted items in the top recommendation.
Finally, weight computation at step 2 also provides flexibility in optimizing FairMatch to capture some other aspects. For instance, considering the popularity of items for computing weights of edges on recommendation bipartite graph may help to further control popularity bias in recommender systems.
6. Conclusion
In this paper, we proposed a graphbased approach, FairMatch, for improving the aggregate diversity of recommender systems. FairMatch is a postprocessing technique that works on the top of any recommendation algorithm. In other words, it reranks the output from the standard recommendation algorithm such that it improves the aggregate diversity of final recommendation lists, while it maintains the accuracy of recommendations. Experimental results on two publicly available datasets showed that the FairMatch algorithm outperforms several stateoftheart methods in improving aggregate diversity. One of the limitations of our work is that our algorithm does not leverage the information about the popularity of items in rating data. We believe this information could play an important role in further improving aggregate diversity of the final recommendation lists because usually algorithms are biased toward popular items (himan2019b) and tackling this bias could increase the number of distinct recommended items, hence higher aggregate diversity. We intend to investigate this limitation in future work.
Comments
There are no comments yet.