Log In Sign Up

DeepFair: Deep Learning for Improving Fairness in Recommender Systems

by   Jesús Bobadilla, et al.
Universidad Politécnica de Madrid

The lack of bias management in Recommender Systems leads to minority groups receiving unfair recommendations. Moreover, the trade-off between equity and precision makes it difficult to obtain recommendations that meet both criteria. Here we propose a Deep Learning based Collaborative Filtering algorithm that provides recommendations with an optimum balance between fairness and accuracy without knowing demographic information about the users. Experimental results show that it is possible to make fair recommendations without losing a significant proportion of accuracy.


page 1

page 2

page 3

page 4


Deep Learning feature selection to unhide demographic recommender systems factors

Extracting demographic features from hidden factors is an innovative con...

Using Image Fairness Representations in Diversity-Based Re-ranking for Recommendations

The trade-off between relevance and fairness in personalized recommendat...

Beyond Parity: Fairness Objectives for Collaborative Filtering

We study fairness in collaborative-filtering recommender systems, which ...

New Fairness Metrics for Recommendation that Embrace Differences

We study fairness in collaborative-filtering recommender systems, which ...

ResBeMF: Improving Prediction Coverage of Classification based Collaborative Filtering

Reliability measures associated to machine learning model predictions ar...

Towards effective research recommender systems for repositories

In this paper, we argue why and how the integration of recommender syste...

Fighting Fire with Fire: Using Antidote Data to Improve Polarization and Fairness of Recommender Systems

The increasing role of recommender systems in many aspects of society ma...

1. Introduction

Fairness in Recommender System (RS) is a very important issue, since it is part of the path to get a fair society. Nowadays, recommendations come to us from a variety of online services such as Netflix, Spotify, TripAdvisor, Facebook, Amazon, etc. All these services rely on hybrid RS [Cano2017Jan] whose kernel is the Collaborative Filtering (CF). CF data is the set of the users’ preferences on the items: tens or hundreds of millions of ratings, likes, clicks, etc. It seems great, since in theory, the more the data the better the recommendations; unfortunately, this data is usually biased [Bellogin2017Dec, Gao2020Jan] and minority groups are the most damaged ones. Common minority groups are female (vs. male) and senior (vs. young); both groups tend to receive unfair recommendations from online services. This situation has a perverse effect: a cycle that feeds back, where unfair recommendations make minority users to lose confidence in the system, to decrease their interaction and, thus, to receive even more unfair recommendations. The time has come to increase research in fair RS as a way to reduce the digital gap [Fatehkia2018Jul, Santos2019] between minority and non-minority groups.

CF RS research has been traditionally focused in accuracy improvement [Portugal2018May], although some other objectives have increased the research attention in the last years: novelty [Mendoza2020Apr], reliability [Bobadilla2018May], diversity [Kunaver2017May] and serendipity [deGemmis2015Sep, Kotkov2016Nov] among them. Surprisingly, fairness has not been a main objective in the RS priorities. One of the reasons is the idea that improving fairness does not lead us to more valued recommendations, such as accuracy, novelty or diversity clearly do. Nevertheless, society needs to point in the opposite direction [Holstein2019], and a set of new quality goals are growing [Mehrotra2018]: relevance, fairness and satisfaction among them. The historical development of CF has not helped to the fairness research, either: when the k-Nearest Neighbors (kNN) algorithm [Herlocker2004Jan] dominated the field, it was less likely that a reduced set of neighbours produced biased recommendations. However, in a very short time the Matrix Factorization (MF) method prevailed as standard, and the fairness goal relevance grew up [Hernando2016Apr]. MF makes a compressed version of the ratings that belong to the dataset, catching the essence of them. The compressed models are sensible to the data biases such as the demographic ones: gender, age, etc. [mehrabi2019survey] making fairness a particularly relevant goal.

As a consequence of the CF research evolution, existing publications to improve fairness using the kNN algorithm are scarce; as an example, in [pmlr-v81-burke18a] authors look for balanced neighbourhoods as a mechanism to preserve personalization (accuracy) while enhancing the recommendations fairness. It is also remarkable the differentiation that takes place, in this context, between consumer-centred and provider-centred fairness. Fairness has been studied in the CF context in two main directions: a) finding that data biases really generates unfair recommendations, and b) providing quality measures or methods to quantify recommendations fairness. From the first block, in [10.1145/3184558.3186949] authors argue that improving recommendations diversity leads to discrimination among the users and unfair results. The response of CF algorithms to the demographic distribution of ratings is studied in [10.1145/3240323.3240373]; they find that common CF algorithms differ in the gender distribution of their recommendation lists. A preliminary experimental study on synthetic data was conducted in [DBLP:journals/corr/abs-1811-01461], where conditions under which a recommender exhibits bias disparity and the long-term effect of recommendations on data bias are investigated. From the second block (quality measures) in [DBLP:journals/corr/YaoH17] they claim that biased data can lead CF methods to make unfair predictions for users from minority, and they propose new metrics that help reducing fairness. Disparity scores has also been proposed [10.1145/3184558.3186949] to obtain fairness measures. Bias disparity can be defined as “how much an individual’s recommendation list deviates from his or her original preferences in the training set” [DBLP:journals/corr/abs-1811-01461], whereas average disparity measures how much preference disparity between training data and recommendation list for the minority group of users is different from that for the non-minority group [Mansoury2019Aug]. Fairness quality results in our paper implement these concepts.

Fairness in information retrieval has been focused on study data bias more than acting on the machine learning models: “teams typically look to their training datasets, not their machine learning models, as the most important place to intervene to improve fairness in their products” 

[Holstein2019]. The machine learning achievements in the fairness issue have been reviewed in [DBLP:journals/corr/abs-1810-08810], where they find some “frontiers” that machine learning has not crossed yet. The MF disadvantages in CF have been studied in [DBLP:journals/corr/YaoH17], where authors state that the MF model cannot manage the two main types of imbalanced data: population imbalance and observation bias. RS fairness has been even less covered in DL than in machine learning; as an example, in this current survey of RS based on DL [Mu2018Nov] the fairness goal is not mentioned, not even in its “possible research directions” section. The same happens with the current review paper [Batmaz2019Jun] where fairness is not mentioned despite the complete set of DL-based RS included in the publication. In fact, state of the art research in this area is focused on accuracy improvements [IJIMAI-3874, Bobadilla2020Jan] and it has not covered this subject. To afford a DL-based and fair RS is difficult due to the neural black box model [Choo2018Jul], that is not easy to explain or vary. Nevertheless, to tackle CF fairness using DL has the advantage of providing a starting base where accuracy is high [Wu2018Apr]; it is particularly convenient since the increase in fairness usually leads to the decrease in accuracy.

For the stated reasons, the hypothesis of this paper claims that it is possible to design a DL architecture that provides fair CF recommendations at the cost of reasonable decreases of accuracy. A DL approach to obtain fair recommendation provides a novel scenario in the RS field. This scenario opens the door to reach accurate and fair predictions, but it is not a straightforward how to make the architectural design: we have to deal not just with raw ratings data, but also with the necessary demographic information to determine the target minority groups: female vs. male, senior vs. young, etc. Moreover, the neural network learning model cannot be changed as easily as the kNN approach or even some machine learning algorithms. For all this, the proposed DL

approach relies on an enriched set of input data and a tailored loss function that minimizes not only the accuracy errors but also the fairness ones. Fairness errors can be measured using the disparity scores concept 

[10.1145/3184558.3186949], but how these scores are fed is a research open issue.

The proposed neural network learns from data that accomplish the current disparity concept: “deviation from the list of recommendations and the training data”. We have specified it into two related indexes: the items one, that assigns a minority value to each item (e.g. a femininity value to a film, that depends on the female and the male preferences on this movie), and the users one, that assigns a minority value to each user (e.g. a femininity value to a user, that depends on the femininity of the items preferred for this user). Once both indexes have been set, it is possible to design a neural network loss function that rewards equality between each user minority value and his/her recommended items minority values. An additional design decision we have taken is to choose a regression approach [Bobadilla2018May] instead a classification one [Bobadilla2020Jan]: since we need to simultaneously minimize accuracy and fairness errors in the loss function, it is straightforward to pack them into a combined value so that the neural network provides us with balanced fairness/accuracy regression results. Finally, we have chosen a combined MF and DL approach [Bobadilla2018May, 10.1145/3038912.3052569]; this design allows us to decouple the accuracy and the fairness abstraction levels by assigning accuracy to the MF and fairness to the DL stage.

A main advantage of the proposed architecture is that, once the model has learned, recommendations can be made to users that do not have associated demographic information; that is: we can fairly recommend to users without knowing its minority nature. It is possible because the neural network can learn the minority pattern in the same process that it learns to minimize the accuracy/fairness prediction error. It is a commercial advantage, since many users avoid filling in their personal data.

The rest of the paper has been structured as follows: in Section 3 the proposed method is explained and the experiments design is defined. Section 4 shows the experiments’ results and their discussions. Finally, Section 5 contains the main conclusions of the paper and the future work.

2. Research objective

As already discussed in Section 1, recommendation systems are primarily focused on providing recommendations with as high an accuracy as possible. This results in biased recommendations being provided to minority groups of users whose representation within the overall picture is very unbalanced. This focus on accuracy, coupled with the fact that there is a trade-off with the equity of recommendations, makes recommendations provided by a RS focused on accuracy unfair to some minority groups.

Our research objective is to study the possibility of finding a balance between accuracy and fairness when it comes to providing recommendations to users. To this end, we propose a CF approach capable of modulating the fairness within the recommendations.

3. Materials and Methods

The proposed architecture incorporates four different abstraction levels, as depicted in Figure 1, to get the desired fair recommendations: a) raw ratings and demographic information, b) minority indexes for both users and items, c) accurate predictions, and d) fair recommendations. Level ‘b’ just makes some simple statistical operations by combining ratings and demographic information; level ‘c’ uses the classical Probabilistic Matrix Factorization (PMF) model in order to obtain users and items hidden factors; finally, level ‘d’ makes use of a Multilayer Neural Network (MLN) to combine hidden factors and a ‘fairness’ parameter. This MLN generates the desired fair recommendations.

Figure 1. Architecture overview.

We will develop each of the three levels that make up our architecture: first, in the lowest level we create two related indexes: 1) items minority index (IM), and 2) users minority index (UM). The IM index will assign a minority value to each item in the dataset; e.g. when the minority group is ‘female’ we could call to the index ‘femininity’. It will contain values () where negative ones mean feminine preferences and positive ones mean masculine preferences. Then, when an item has been assigned a negative value it means that it has been rated better by women than men. Once the IM index has been created it contains the minority values of all the items. By using the IM index, we will create the UM index. The UM index will assign a minority value to each user in the dataset. It also will contain values (), where negative ones mean minority preferences and positive ones mean not minority preferences (masculine, in our example). A user assigned a negative UM value means that this user prefers negative IM items, and vice versa. Please note that, on many occasions, female users may have assigned positive UM values and male users may have assigned negative UM values, since there exist women with masculine preferences and men with feminine ones; same as young and older persons or any other minority versus majority groups. Thus, an important concept is that both the IM and UM indexes do not contain disjoint minority/majority demographic values; they contain minority/majority preferences. This design accurately fits the existing diversity of preferences contained in the CF based RS.

Now, we will explain the IM and UM indexes design that we will take as a base to get fair recommendations in the DL stage. First, we will differentiate between relevant and not relevant votes: relevant votes are those that indicate that the user liked the item; conversely not relevant votes (in our context) are those that indicate that the user did not liked the item. They can also exist votes that indicate indifference on the part of the user. In our formulation, relevant and not relevant votes are chosen by means of two thresholds; e.g. in a dataset where votes must be in the set we can establish as the relevant threshold and as the non-relevant threshold. In this way the relevant set is , the non-relevant set is and would be the ‘indifference’ set.

We define the IM index (equation 11) for each item as the majority score of minus the minority score of . The majority score (resp. minority score) of the item is the number of majority (resp. minority) users that voted as relevant minus the number of majority (resp. minority) users that voted as non-relevant, divided by the total amount of majority (resp. minority) users that did not consider as indifferent, see equations 10 and 9 (resp. equations 8 and 7). When the proportion of the minority user preferences exceeds the proportion of the non-minority ones, the IM index values are negative. In the gender example, equation 11 can be read as: “proportion of males that liked item minus males that did not like it, minus the proportion of females that liked item minus females that did not like it”. We have also set a minimum number of votes to consider both the minority and non-minority sides of equation 11.

Once the IM index has been created, we can use it to establish the UM index values. Each UM value corresponds to a user of the RS dataset, and it provides the minority value of the user. Each user minority value will be defined by the minority of his/her preferences: to obtain each user UM value we just make the average of the IM minority values of the items that the user has voted, weighting each IM minority value with its corresponding user rating. Equation 13 models the explained behaviour.

(1) Let
(2) Let
(3) Let
(4) Let

We will assign the following meanings to super index numbers, for minority and for non-minority:

(5) Let
(6) Let
(7) Let
(8) Let

The majority score is


The minority score is


The IM and UM indexes are


where means “not voted item” and is the maximum possible vote.

Figure 2. Data-toy example to get IM and UM minority values.
item value
Figure 3. Data-toy IM results

Figure 3 shows a data-toy example containing five users and four items. We will suppose that women are a minority group in this RS, compared to the men. We can observe that ‘item a’ is clearly ‘masculine’, since it has been voted as ‘relevant’ for all the male users and it has been voted as ‘non-relevant’ for all the female users. The opposite situation is stated in ‘item b’: it is a ‘feminine’ item according to the female relevant votes and the male non-relevant ones. ‘Item c’ is quite masculine, although a female user liked it. Finally, ‘item d’ shows the opposite situation to ‘item c’. According to it, the proposed IM equations return the following item minority values:

that fits with the explained behaviour (Figure 3). Once the items’ minority values IM are obtained, we can get the users minority ones (UM). First, we can observe how ‘male 2’ and ‘male 3’ users in the data-toy example have casted very ‘masculine’ ratings, since they have voted ‘relevant’ to the more ‘masculine’ items, and ‘non-relevant’ to the more ‘feminine’ items. This is not the case for the ‘male 1’ user, that has a ‘relevant’ vote casted on the ‘feminine’ ‘item d’. The female users comparative is more complicated: ‘female 1’ has casted all her votes in a ‘feminine’ way, whereas the ‘female 2’ vote to the ‘masculine’ ‘item c’ was ‘relevant’; nevertheless, the ‘female 2’ feminine votes are higher than the ‘feminine 1’ ones. In this way, we expect the following results: a) positive UM values to male users and negative ones to female users, and b) a more ‘minority’ (feminine) value be assigned to ‘male 1’ than to ‘male 2’ and ‘male 3’. Figure 3 shows the Figure 3 data-toy IM results and Table 1 shows the UM ones.

user value
male 1
male 2
female 1
female 2
male 3
Table 1. Data-toy UM results

Our architecture uses the PMF method to reduce the ratings matrix dimension and to get a condensed knowledge representation. From the condensed results we will be able to make accurate predictions. Equations 15 to 24 show the model formalization: the original ratings matrix is condensed in the two lower dimension matrices and (equation 15). is the users’ matrix and is the items’ matrix. Both and have a common dimension of hidden factors, where and (note that is numbers of users, and

the number of items). Once the model has learnt, each user will be represented by a vector

of factors, and each item will be also represented by a vector of factors. Each prediction of an item to a user is obtained by processing the dot product of these vectors (equation 16). Since the users and the items hidden factors share the same semantic, predictions will be relevant when high values (positive or negative) of the factors line up in each user and item.


The and factors will be used in our architecture to feed the DL process input as well as to set the output target labels. Factors are obtained by means of the gradient descent algorithm. The loss function just minimizes the prediction error: the difference between the predicted value and the existing rating (equation 17).


In order to achieve the gradient descent minimization process we obtain the partial loss derivatives: and (equations 19 and 18).


This gives rise to the corresponding gradient descent factors update Equations 21 and 20.


Finally, we can add a regularization term for controlling the growing of the factors during the learning process, which gives rise to the loss function and the update rules shown in Equations 22 to 24.

Figure 4. Training information for the proposed MLN.

The highest semantic level of the proposed architecture is based on an MLN. Our MLN (see figure 4) model will take input vectors containing the following information: a) user hidden factors , b) item hidden factors , and c) value. The parameter is used to balance fairness and accuracy in predictions and recommendations: high values will enhance accuracy, whereas low values will enhance fairness. This balance is a key objective of our method: “To obtain fair recommendations just losing an acceptable degree of accuracy”. Please note that we do not include demographic information to feed the MLN input, so once the MLN has learnt it will be able to make fair recommendations to users that have not filled demographic forms asking for gender, age, etc. This is an important commercial advantage, since it allows to make better marketing processes, to improve fairness, to focus prediction tasks, etc. It is also a challenge to the proposed machine learning framework, because it is more difficult to increase recommendation fairness when demographic data is missing. The learning process has been based on input vectors containing the specified three information sources: . We have set 11 input vectors to the MLN for each rating of the dataset:

The objective is to teach to the neural network on eleven fairness levels for each rating, as it can be seen in the left side of figure 4.

Once the MLN input vectors have been established, it is necessary to define their corresponding output labels in order to let the back propagation algorithm learn the pattern. In our case we will design a loss function that minimizes both the prediction error and the fairness error. Equation 25 shows the typical prediction loss function, as we did in equation 17. We define the fairness error as the distance between the user’s minority and the item’s minority; e.g. films recommended to a user (male or female) with an assigned UM femininity value should be as similar as possible to a IM in order to fit in the fairness issue. Since UM and IM vector values do not have the same distribution, we will apply a normalization in both of them and we will use the UM’ and IM’ names for the normalized versions. Then, to obtain the fairness error we establish equation 26. Finally, to combine equation 25 (accuracy) and equation 26 (fairness) the parameter is added (equation 27).


In the feed forward prediction stage, for each testing input data , the proposed neural network returns a real number whose meaning is the predicted loss error for the item to the user recommendation. The lower the predicted loss error, the better the combined values given the chosen accuracy vs. fairness balance. Once the network has learnt and the RS is in production phase, to make recommendations to an active user , first we fix the value and then we feed the MLN with all the inputs where runs over the set of items that the user has not voted (equation 28).


The set of recommendations for the user , is the collection of items with minimum loss function , where the function represents the MLN feed forward operation.

Experiments have been conducted using a well known dataset called MovieLens 1M [10.1145/2827872]. It contains 1,209,000 votes, 6040 users and 3952 items. We have used eleven different values of the parameter (from 0.0 to 1.0, step 0.2); consequently, the MLN has been trained using 13,299,000 input vectors and output target values. Training, validation and test sets have been established: 70%, 10% and 20%, respectively. The PMF process has been run using 30 hidden factors (), 80% training ratings, 20% testing ratings. Please note that these are the MLN parameters of the proposed method, different to the previously ones specified for the DL stage. The designed MLN contains an input layer of values (figure 4). The first MLN

internal layer has been set to 80 neurons (

relu activation), followed by a 0.2 dropout layer to avoid overfitting. The second internal layer has been set to 10 neurons (relu

activation) and, finally, the output layer contains just one neuron with no activation function. The chosen loss function has been

mae and the optimizer rmsprop.

4. Results

The experiments we have conducted are:

  • Item Minority Index (IM) and User Minority Index (UM) distributions.

  • User Minority Index (UM) comparative between each minority and non-minority group.

  • Fairness prediction improvement using the heuristic algorithm.

  • Fairness recommendation improvement using the heuristic algorithm.

  • Fairness error and accuracy error for recommendations using the proposed DL architecture.

Figure 5. Proportion of users in the MovieLens gender and age minority and non-minority groups.

This section contains a subsection for each of the above set of performed experiments. We have selected two types of minority sets: a) gender: female vs. male, and b) youth: young vs. senior. Results are provided showing both minority types in two separated graphs of each figure. The MovieLens dataset, like in many other CF RS happens, is biased towards female and young people. Thus, the chosen minority types are relevant and representative for this experimental study. Specifically, the MovieLens dataset contains more males than females; most of them are under 45 years old. Figure 5 shows the proportions.

Figure 6. Item Minority Index (IM) and User Minority Index (UM) distributions.

Equations 13 and 11 describe both indexes behaviour. The IM index semantic is simple and convincing, but it is necessary to be aware that we are not working with absolute values: in order to prevent data biases and to maintain the index values in a bounded range, we are working with preferences proportions; e.g. “proportion of male users that liked the items minus proportion of female users that liked the item”. Since we expect a significant number of items that both minority and non-minority groups simultaneously like or dislike, IM proportions will be similar for both groups and consequently a significant number of IM values will concentrate around the 0.0 value. Figure 6 shows the items and users minority indexes distributions, both for the gender and the youth minority groups.

The UM index values are obtained from the ratings that each user has casted to the items and from the IM value of each of those items. We can see in figure 6 that the users UM indexes (both for gender and youth) have a large concentration of values around 0. It provides us an important conclusion: “In the reference dataset, most users have similar preferences regarding to the chosen minority groups”. Looking at the UM distributions we can also yield another main conclusion: “Although users have similar preferences, there is a clear separation between minority groups” (left and right side of the graphs). Since the UM index is only used to feed internal DL processes the relevant information here is the proportion of the differences between values, and not their absolute values.

4.1. User Minority Index (UM) comparative between each minority and non-minority group

Figure 7. User Minority Index (UM) comparative.

In the above section we have confirmed two facts: 1) Users preferences are similar, even if they belong to different minority groups, and 2) Despite the previous conclusion, there is room to find minority behaviours of users. In this section we deepen in the minority UM values of users, to clear out our specific groups: male vs. female and senior vs. young. Figure 7 shows the results: we can observe, in both cases, that groups have different behaviours and also that they share a relevant number of preferences. Groups present different behaviours because they do not completely intersect their user minority values; as expected, minority groups return a mean less than zero whereas non-minority groups return it greater than zero. Groups share a relevant number of preferences because there exist a proportion of minority and non-minority users that share UM values (areas around 0.0 under both curves).

group type correct incorrect correct%
gender female 1147 562 67.11
male 3648 683 84.22
youth senior 1231 195 86.32
young 3144 1470 68.14
Table 2. Users classification attending to the minority/non-minority groups.

Due to the explained results, we can confirm that there is a not negligible proportion of minority users with non-minority preferences and vice versa. In any case, it varies depending on the specific minority group. As an example, we can observe in figure 7 how senior users have much less non-minority preferences than female ones, since there are small amounts of senior users whose minority value is greater than zero. Results show the convenience of using modern machine learning approaches to make fair recommendations to those users that share minority and non-minority preferences. Table 2

shows the specific number of users that have been classified as belonging to the minority or to the non-minority groups. Minority users (female, young) have an expected UM index less than zero. Non-minority users (male, senior) have an expected

UM index greater than zero.

4.2. Fairness prediction improvement using a heuristic algorithm

female male senior young
IM mean -0.014 0.041 -0.025 0.028
Table 3. Averaged IM values for the predictions made to each users’ group

Figure 7 and table 3 show us that the majority of the users are correctly grouped attending to their UM indexes, especially for seniors and males. They also show a considerable number of cases incorrectly classified, particularly for young and female groups. In this situation, we will obtain predictions from the test set and then check their quality in terms of the IM index. Table 3 contains these experiments results: the IM averages fit the expected ranges (negative IM average for minority users, and positive IM average for non-minority users). Despite these positive results, ranges can be too narrow to ensure fair predictions. On the other hand, there will be situations in which it is intended to force the recommendations of an RS to move towards minority items, or perhaps towards majority items, depending on the type of users and/or the company policy.

Figure 8. Groups quality improvement by filtering predictions. x axis: alpha values used to filter on the IM items index. y axis: averaged minority of the filtered predictions. Minority (female, senior) curves are drawn using their absolute values.

By filtering on the IM index, we can discard those predictions greater than a negative threshold and, in this way, increase the proportion of minority predictions. In the same way we can filter those predictions less than a positive threshold to increase the proportion of majority predictions. We have performed this experiment, calling alpha to the threshold. We can observe the expected behaviour in figure 8, where growing minority (and majority) IM values are obtained in predictions when the alpha parameter increases. It also can be seen that the non-minority users (male, young) always obtain better predictions due to the RS datasets biases. Finally, we can state that, in this case, minority values can reach the starting majority ones by using low values of the alpha parameter (0.025 for gender and 0.05 for age).

4.3. Fairness recommendation improvement using the heuristic algorithm

The previous section results show that it is possible to provide a heuristic method to improve recommendations fairness. To conduct the experiment, from the alpha filtered predictions (Figure 8), we extract the ones that provide higher prediction values, as usual in the CF operation. Thus, the complete recommendation method involves three sequential phases: 1) to obtain all the pairs, 2) to filter the pairs according to the minority threshold alpha parameter and each minority value, and 3) to select the filtered predictions that have the highest prediction value values.

Figure 9. Recommendation quality obtained by filtering predictions. x axis: alpha values used to filter on the IM items index. y axis: averaged error of the recommendations. Lower error values are the better ones.

Results in figure 9 show the existing correlation between recommendation errors and each chosen alpha value: the highest the alpha value, the better the recommendations fairness (Figure 8), but as expected, also the worst the recommendation accuracy (higher error values in figure 9). Of course, we pay an accuracy price when we force fairer recommendations.

We have chosen a value of recommendations to process the set of experiments. From figure 8 it can be observed that in the ‘youth’ experiment our method provides better results (lower errors) for the minority ‘senior’ group than for the ‘young’ one. This is a good indication of the proposed heuristic method functioning. The ‘gender’ experiments provide improvement in the minority female group from a specific value threshold (). All these results are consistent with Tables 2 and 1 values.

4.4. Fairness error and accuracy error for recommendations using the proposed Dl architecture

Results obtained in the previous subsection tell us that we have designed a method that correctly provides fair recommendations. It is a simple, functional and easy to implement machine learning approach. Nevertheless, it has some drawbacks:

  • Choosing the adequate parameter alpha requires a fine-tuning process.

  • Since the parameter alpha sign (less than or greater than zero) depends on the minority or non-minority nature of the recommended user, this recommendation method can only be applied to users with associated demographic information.

This subsection provides a DL approach that works without the above drawbacks. This method only needs the parameter : it is used to select the accuracy vs. fairness balance. The range is , whether 0 means 100% fairness and 0% accuracy, and 1 means 100% accuracy and 0% fairness. As it can be seen, to choose a value is straightforward and intuitive. Moreover: the chosen value does not change when the user is a minority one or he is not.

Figure 10. Recommendation results using the proposed DL approach. y axis: averaged recommendations error (normalized in the right graphs); x axis: balance between fairness and accuracy (0.0 means 100% fairness and 0% accuracy, and 1.0 means 100% accuracy and 0% fairness).

The proposed DL recommendation method explained in section 3 returns the results shown in figure 10. Graphs on the left of the figure contain the main information. Graphs on the right are scaled to find the optimum accuracy vs. fairness balances. The averaged error of the recommendations (equation 25) is plotted using black lines. Dotted and dashed lines show the minority errors (equation 26); that is: the distance between the minority value of each recommended user (UM) and the average of the minority values (IM) of their N recommended items. We are looking for recommended items in the minority range of the user; e.g. if a user (male or female) has an (quite masculine), recommended items near are the fairest ones, and they generate a low minority (‘femininity’) error.

‘Gender’ results are shown in the top-left graph of figure 10: as expected, accuracy increases (error decreases) as increases (more importance to accuracy). The price to pay for this accuracy improvement is the simultaneous increase in the fairness error values. As decreases (more importance to fairness), the opposite happens: higher prediction errors and lower fairness errors. ‘Youth’ results are shown in the low-left graph of figure 9: curve trends are similar to the ‘gender’ results. Graphs on the right of figure 9 show the same results by using a normalized y axis: in this way we can find the optimum values to balance accuracy and fairness in the recommendation task. To optimize results in this experiment, it is necessary to choose a value: a balanced selection, something scored to the fairness objective. This result tells us that the balanced option () can be the default one.

5. Conclusions

Attending to the obtained results, it is understood that designing methods to improve CF fairness is not a simple task, but it is possible to take it out. Due to the fact that an appreciable proportion of minority and non-minority users share preferences it is necessary to make use of modern machine learning approaches in order to make fair recommendations not only to the ‘purest’ minority or non-minority users, but also to the users that mix some proportion of minority and non-minority preferences.

State of the art shows a lack of DL approaches to tackle fairness in RS

, probably due to the neural networks black box model. The proposed method in this paper relies on an original loss function and input data to balance fairness and accuracy. This method combines several abstraction levels and it can serve as baseline to

DL future works in the field. An original architecture is provided, where machine learning and DL models are combined to obtain balanced accuracy vs. fairness recommendations. The architecture is based on two basement levels: statistical and machine learning, that provide the necessary information to train the DL model which constitutes the third architectural level. The proposed DL method provides a modern approach to tackle fairness in RS. We can easily balance accuracy and fairness, or we can automatically select the optimum trade-off. That is to say: the proposed method manages the inherent loss of accuracy when fairness is increased. Additionally, once the neural network is trained using demographic information, it can predict and recommend to users whose demographic information is unknown.

Results show adequate trends in the tested quality measures: improvement in fairness at the cost of an expected worsening in accuracy. The proposed machine learning-based heuristic approach and the DL model return similar quality results. Nevertheless, the proposed DL method does not need demographic information in the recommendation feed-forward process. It also is able to better balance and automatically balance fairness and accuracy.

Proposed future works are: a) architecture simplification, by removing the MF and transferring its functionality to the DL model, b) items and users minority indexes redefinition to better catch the minority versus non-minority differences, c) testing the methods behaviour in a variety of CF datasets, d) extending the experiments to different demographic groups (nationality, profession, studies), and e) testing the architecture on not demographic groups (users that share minority preferences).