Log In Sign Up

Deep Learning feature selection to unhide demographic recommender systems factors

Extracting demographic features from hidden factors is an innovative concept that provides multiple and relevant applications. The matrix factorization model generates factors which do not incorporate semantic knowledge. This paper provides a deep learning-based method: DeepUnHide, able to extract demographic information from the users and items factors in collaborative filtering recommender systems. The core of the proposed method is the gradient-based localization used in the image processing literature to highlight the representative areas of each classification class. Validation experiments make use of two public datasets and current baselines. Results show the superiority of DeepUnHide to make feature selection and demographic classification, compared to the state of art of feature selection methods. Relevant and direct applications include recommendations explanation, fairness in collaborative filtering and recommendation to groups of users.


page 5

page 7

page 8

page 13


DeepFair: Deep Learning for Improving Fairness in Recommender Systems

The lack of bias management in Recommender Systems leads to minority gro...

Data Poisoning Attacks on Neighborhood-based Recommender Systems

Nowadays, collaborative filtering recommender systems have been widely d...

Modurec: Recommender Systems with Feature and Time Modulation

Current state of the art algorithms for recommender systems are mainly b...

SAFS: A Deep Feature Selection Approach for Precision Medicine

In this paper, we propose a new deep feature selection method based on d...

Latent Factor Interpretations for Collaborative Filtering

Many machine learning systems utilize latent factors as internal represe...

Beyond Parity: Fairness Objectives for Collaborative Filtering

We study fairness in collaborative-filtering recommender systems, which ...

Job Recommender Systems: A Review

This paper provides a review of the job recommender system (JRS) literat...

1. Introduction

Recommender System (RS[28, 37] are playing an important role in our society: they provide useful information to the users by recommending highly demanded products and services. Remarkable examples of RS are: Amazon, Netflix, TripAdvisor and Spotify. RS are implemented by means of several filtering strategies, mainly the collaborative [28, 37], content [44], demographic [1], context [40] and social [35] ones. Most of the commercial RS are based on hybrid models that combine Collaborative Filtering (CF) with some other filtering approaches. In the early ages of RS research, CF was implemented using the K-Nearest Neighbours (KNN) algorithm [7]: it is easy to understand, to implement and to analyse, since it can be considered as a white-box method. This approach has also been updated and improved in the recent years with promising approaches like hybrid methods [2] or adding information theoretic quality measures [19]. Nevertheless, the KNN main drawbacks are its lack of scalability and its poor accuracy.

Due to the exposed KNN drawbacks, this memory-based algorithm has been replaced by model-based ones, mainly the Probabilistic Matrix Factorization (PMF[29] and its variations and improvements, such as the non-Negative Matrix Factorization (NMF[23], Bayesian NMF (BNMF) [14] of two-level MF (TLMF) [25]. Currently, research is also focusing on Deep Learning (DL[30, 6, 17] based approaches. Model-based approaches are scalable and accurate, but they act as a black box, making it difficult to address some RS goals such as recommendations explanation or improvements in the beyond accuracy goals such as fairness, diversity, reliability or serendipity. The explained research evolution on RS is relevant to this paper, since it makes use of an architecture that combines the Matrix Factorization (MF) and DL approaches, trying to unhide the MF black box model.

CF RS datasets are really sparse [8], and MF models make a reduction of dimensionality to obtain compressed and dense versions of them. In the MF models, each user is represented by a reduced number of factors (real numbers) that encode the user’s essence. Each dataset item is represented in the same way. Figure 1 shows the MF basic operational; on its top the compression is represented, where a sparse matrix of ratings is converted in two dense matrices of factors: the users’ and the items’ ones. To predict the rating of a user to an item, the dot product is used; recommendations to each user are just those items with the best predictions. The bottom of figure 1 shows the essence of the dot product: predictions will be high (the user will like the item) when the users’ factors and the item’s factors are significant, and they also match (they have similar values). In the figure 1 example, we can observe that first and the last factors have similar values: they match each other. The user and the item third factors match, but they are not relevant. Finally, the second factors do not match. Each user’s factor encodes some features combination; a simplistic view could state that the fifth factor encodes: the user is female and young, whereas third factor encodes that she is a female and she likes musical films. Please note that each feature can be coded in several factors. Items are also encoded in factors; e.g.: Avatar film factor fourth could encode ‘young’, ‘scify’ and ‘popular’, whereas the first factor could encode ‘scifi’ and ‘current’.

Figure 1. Matrix factorization and the dot product to make predictions

It would be great if the MF models could return the semantic meaning of each one of the chosen factors, but in fact none of the MF models can do it. MF can predict how much a user will like a not voted item, and even it can relate users or items by measuring the distance between their factors, but MF cannot directly establish why it predicts that you will like an item or you will not like another one. MF just acts on the ratings; it does not directly process demographic information (gender, age, etc.) because it has not been designed to do it. To better understanding this concept figure 2 shows, in grey color, the hidden factors, both the items and the users; it means that we do not know the semantics of the hidden factors. It would need some algorithmic process to show us all the MF factors from a different perspective, in the same way that infrared cameras show our environment. This algorithmic process is represented in figure 2 as a magnifier glass. The proposed method in this paper performs the represented magnifier glass function, and it can show the semantic meaning of MF factors in those features we have selected (usually demographic ones).

Inside the magnifier it is shown a coloured new information. It tells us about the different degrees of demographic features each factor encodes; e.g.: the user u factor 1 has a big proportion of female feature, followed by a less proportion of young user; in the same way, item i factor 2 mostly encodes a drama film that could like to Brad Pitt’s female fans. To get this additional information is important, since it opens the door to design improved methods in different RS research fields, such as recommendation to groups of users or recommendation to users who share minority preferences. In the former case a representative virtual user can be obtained by combining the factors values with the factor demographic proportions, it the latter case a new feature ‘minority’ can be created to identify minority users. Beyond the two previous examples, we have selected two main RS research fields where the proposed method can be particularly important: explanation of recommendations and fair recommendations. The right side of figure 2 shows the proposed strategies to address both objectives: recommendations explanation can be based on the impact that each demographic feature has in the recommendation; in this example we can inform to the user that the recommendation is mostly based on the feminine component of the film, and to a much lesser extent because it is a drama film that usually like Brad Pitt’s female fans. Fair recommendations can be obtained acting on the dot product stage, by positively weighting the desired demographic features; figure 2 shows an example where female factors are privileged with most of the prediction importance: it is an example of fair recommendations applied to the female minority group (most of the RS datasets are biased in gender and age).

Figure 2. Matrix Factorization hidden factors semantic and applications

The recommendation fairness approach is innovative in the field. Some of the CF fairness research has been focused on the KNN algorithm, since it is a white box approach that allows to design tailored solutions, such in [9] where fairness is obtained by choosing balanced neighborhoods. In the model-based CF data biases has been studied as a source of unfair recommendations [11]. Diversity in recommendations leads to unfair results and discrimination [24], but it is also necessary to balance different goals such as fairness, accuracy, diversity and novelty. Bias disparity has been defined as “how much an individual’s recommendation list deviates from his or her original preferences in the training set” [38]. Research in CF fairness has been focused on study the datasets bias rather than to design models to tackle the problem: “teams typically look to their training datasets, not their machine learning models, as the most important place to intervene to improve fairness in their products” [16]. A review of the RS fairness issue is presented in [10], where some frontiers in the field are highlighted. The MF method cannot easily manage the two main sources of imbalanced data: observation bias and population imbalance [43]. As it can be seen, no model-based CF approach have been conducted in the same line as the proposed one in this paper; additionally, our method makes use of DL technology, and this is a specific field with little research made in the RS fairness issue: in the DL based RS survey [30] fairness is not mentioned, not even in its research directions section. In the review paper [3] fairness is not addressed, either.

The recommendation explanations [32] research field has a KNN based area [45, 33] that is not relevant to our model-based approach. Several strategies have been designed to address CF explanations: graphs have been used to relate recommendations sources [27], explanations to group recommendations are also designed based on the group social reality looking for positive reactions from the members of each group [34].

Recommendations have also been explained by using temporal information of the ratings [4, 39]. Trees have been shown where neighbour users and related items are drawn around the recommended user position [15]. As far as we know there is not a published DL model to address the explanations of CF recommendations made through MF factors; nevertheless, there is a paper that emphasizes the importance of the demographic information versus content information in CF explanations [5].

Feature selection is also related to this paper, since we test the proposed method results by selecting the most promising factors, discarding the rest and measuring the impact of the filtering. Figure 3 shows the concept: we can compare the classification results on a demographic target (e.g: gender) by using the existing factors versus the classification results just using a subset of the factors (only 2 factors in this example). The more similar the classification accuracy, the best the performed feature selection. To claim the superiority of the DL proposed feature selection method we compare it with a set of popular feature selection baselines: logistic [31], entropy [18]

, variance 


and Principal Component Analysis (PCA) 

[21]. Fairness impact can also be tested, since focusing on demographic feature selection we can obtain demographic-based fair recommendations.

Figure 3. Feature selection based on demographic information

Finally, there is a DL research field we have borrowed from the image processing area to act as a kernel of the proposed method: the deep networks gradient-based localization. Grad-Cam [36] uses any target concept (say ‘cat’) in a classification network to generate a localization map. It active the relevant areas, in the image, that encode the concept. Grad-Cam generalizes the CAM research [46], where generic localizable DL representations are built. CAM uses the global average pooling as a structural regularizer [26]. Neural style transfer (NST) [20] is also a reference to the proposed approach; in this case, a source image is converted to the style of another image that acts as a target. This is made by minimizing the gradient between the source image and one or several chosen filters of a Convolutional Neural Network (CNN). Our method performs this operation, using a noisy source instead of a regular image. The NST was introduced by [13] using intermediate layers of the VGG-19 [42] network to catch different styles. The style representation has been based on the Gram matrix [12] by matching style and stylised images. To graphically show the concept, we have designed an NST and fed it with two Picasso’s pieces of art; as it can be seen in figure 4, the style image has been passed to the source image. We have chosen the ’block1_pool’ and ’block2_pool’ layers of the VGG19 network as style filters.

Figure 4. Example of Neural Style Transfer

The rest of the paper has been structured as follows: in Section 2 the proposed method is explained, and the experiments design is defined. Section 3 shows the experiments’ results and their discussions. Finally, Section 4 contains the main conclusions of the paper and the future works.

2. Model

The proposed method to unhide MF factors is inspired in the gradient-based localization [13]

and the neural transfer learning 

[42] techniques. Since it is a DL approach to unhide factors, we have called it DeepUnHide . First, we must design an architecture and then to apply the proposed method to it. Figure 5 shows the DeepUnHide architecture; it is composed of three abstraction levels: raw data, Machine Learning (ML) and DL. The raw data abstraction level feeds the architecture with the necessary information, in our case it just needs the CF matrix of ratings and the selected demographic information (gender, age, etc.). The ML abstraction level is in charge of providing the MF hidden factors. For that purpose, in the proposed model we used standard MF methods as PMF [29].

As it can be seen in figure 5, we will only take the users’ factors, since we want to explain recommendations based on demographic data related with users. It is also possible to explain recommendations based on demographic data related to items (genre, popularity, director, etc.), in this case we would take the items’ factors to feed the following architectural layer. Our last abstraction level is the DL one: we make use of a Multilayer Neural Network (MLN

) to classify users by demographic information. In this paper, as an example, we have chosen the male/female and young/senior groups. Notice that our objective is not to classify users: We train this

MLN to feed the proposed method with the learned weights of the neural network.

Figure 5. DeepUnHide Architecture

2.1. Gradient localization for image processing

Once the architecture is set, we can explain the details of the proposed DeepUnHide method, in which we will use the MLN information shown with the magnifier glass metaphor in figure 5. In the gradient based localization [13], conceptually we make a process similar to the one shown in figure 4, but there are some key differences; one of them is that we do not have a defined source image: we will use a noise source. Another difference is that instead of using images, our source is a list of hidden factors. To graphically explain the first concept, we make use of the image processing field: Figure 6 visually shows the learnt pattern of each filter in the VGG16 Block4_conv1 layer. These filters help to classify some of the images used to train the VGG16 Convolutional Neural Network (CNN[22]

. Each one of these map activations can serve to get the input pattern that best active the corresponding filter: this is the gradient-based localization key. To obtain the mentioned input patterns we apply an initial random noise image to the input of the classification neural network. Afterwards, we make use of the gradient descent algorithm to iteratively change the input image until the loss function is minimized. Here, the loss function is the distance between each activation map the input image generates and the chosen filter values.

Figure 6. Learnt patterns in the 64 filters of the VGG16 Block4_conv1 layer

Each row in figure 7 shows an example of the aforementioned process: rows contain the gradient descent result of applying a noisy image (left) to the CNN, by minimizing the loss differences with four of the filters in figure 6. Each of the right-most images of figure 7 shows the input pattern that maximizes its corresponding filter detection. They can be considered as representative patterns in some areas of different types of images. What is relevant to us is the concept that using gradient descent on a pre-trained MLN we can find representative input patterns of the output targets. Moving to the RS field and using the DeepUnHide Architecture (Figure 5) we can find representative patterns of demographic features; more precisely: we can find the user factors values that best represent the male, female, young, senior, etc. users.

Figure 7. Gradient descent intermediate images obtained from a noisy picture to four of the activation maps in figure 6

2.2. Gradient localization DeepUnHide 

The proposed DeepUnHide method is explained in figure 8. Starting from the trained MLN in the DeepUnHide architecture (Figure 5), an initial random list of factors, or an initial list of factors filled to 0 are presented to the MLN

(“initial list of noisy factors”). Using this starting vector, a feed forward process is conducted to obtain the prediction; then an output and a loss error are obtained; e.g.: we expect the 0 value in the female case or the 1 value in the male one. Then, the gradient descent algorithm obtains the input values that minimize the error, in the first iteration. As usual in the gradient descent operative, the process is repeated until the error reaches a threshold or until a prefixed number of iterations have been run. At the end of the process we get the factors values of the representative demographic user (male, female, young, etc.). Please note that repeating this process for a set of demographic features we can obtain the proportions shown in

figures 3 and 2; it demographically unhides the users’ factors. This process can also be done to demographically unhide the items’ factors.

Figure 8. DeepUnHide gradient based localization method

2.3. Mathematical formulation of DeepUnHide 

To be precise, suppose that we have an initial spare matrix of ratings , where is the rating that user assigned to item (say, in a discrete scale from to ). Denote the number of users in the model by and the number of items by so that is a matrix.

The objective of PMF is to find a dense matrix that coincides with as much as possible in the known ratings. For this purpose, we look for a factorization of the form where is a matrix and is a matrix. The interpretation of these matrices is that the -th row of , , is the -dimensional vector of hidden factors of the user ; and analogously for the -th row of , , for the hidden factors of the item . In this way, we want to minimize the cost function

where denotes that the rating of the user to the item is known, and denotes the usual euclidean distance between the vectors of known ratings. A standard gradient descent algorithm with regularization for minimizing the cost function leads to the update rule


are two hyperparameters of the training method (the steps of the gradient descent).

As mentioned above, in this paper we will focus on unhiding the users factors, , so we pull apart the items factors . Now, we focus on some demographic binary classification into majority/minority group (say male/female or young/senior). For that purpose, we consider a MLN,

trying to fit the perfect classification given by if belongs to the majority group and if belongs to the minority group. The neural network

is trained with the usual gradient descent optimization on its parameters (the so-called backpropagation method).

Once this DL step is completed, we look for the factors that maximize the expectancy of of predicting a given demographic group. Hence, we fix an objective target ( if we are focusing on the majority group and if we are interested in the minority group). Now, we define the cost function

That is, if and only if , which means that is the ‘archetypal’ user factors of a member of the demographic group . In order to minimize , we use a standard gradient descent algorithm. For this purpose, observe that the gradient of is given by

Observe that the gradient can be easily computed in terms of the internal weights of the MLN by means of the usual backpropagation method. Therefore, the usual gradient descent method leads to the update rule

Here, is a hyperparameter of the training process that corresponds to the step of the gradient descent. The initial guess for

can be taken as a random vector drawn from a uniform distribution, or just as the zero vector. This process is the so-called gradient localization in the image processing literature.

As a result of this optimization step, we get two preferred user factors for the majority and the minority group. As mentioned above, these can be understood as the factors of a representative user of each demographic group. Let us we write the components of these vectors as and . In order to interpret these vectors as amount of affinity, we normalize them to take values in the interval as

where , , and . In this way, a value of (resp. ) near to shows that the -th factor characterizes a hidden characteristic that is like-minded to the majority (resp. minority) group whereas a value near to evidences that the -th factor measures a characteristic that is typically disliked by the majority (resp. minority) group.

This idea leads to a feature selection criterion of relevant factors for the majority (resp. minority) group by sorting the factors by decreasing value of (resp. ). In this way, fixed a number of desired factors , we can obtain the subsets and of the most relevant factors for the majority and minority group, respectively.

Moreover, this information can also be used for proving an absolute measure of the importance of each factor to the dichotomy majority/minority, as the distance of this factor between the majority archetypal user and the minority archetypal user. Hence, we take

In this way, high values of evidences that the -th factor has typically a large variation from a demographic group to another (say, it is high in the majority group and low in the minority group, or vice-versa), whereas low values of point out that this factor is similar in both demographic groups. Therefore, factors with high relevancy are the best indicators of the membership of an user to a group. Again, this relevancy can also be used as a feature selection criterion for choosing the factors that are more relevant for the associated classification problem.

2.4. Implementation of the model

Algorithm 1 implements the DeepUnHide

 internals by using Keras and Python. Since it is a really short piece of code it has been considered useful to include the algorithm in this paper in order to explain the method, to easily reproduce the experiments and to base some future works on it. Previous to running the shown procedure, we have trained a

MLN using Keras. In our example, we have chosen an architecture with 2 layers, being ’dense_1’ the hidden layer and ’dense_2’ the output layer (see figure 8). Line 2 establishes the output of the model: “dense_2” layer, in the neural network drawn in figure 8. Line 3 sets the loss function: in our case, the MLN correctly predicts demographic features. Line 4 makes the hard work, obtaining the gradients of the input with regard to the loss. Line 5 just normalizes the gradients. Line 6 returns the loss and the gradient obtained from the input (input factors). These input factors are initialized in line 7. Then, a gradient descent loop is set, in line 8, to run each established iteration and to obtain the new gradient values, in line 9. Finally, the input factors are updated in little 0.1 steps, in line 10. This line of code generates the subsequent input factors shown on the left of figure 8, where the result is shown in grey background.

1def factors(gender):
2    output = model.get_layer("dense_2").output
3    loss = 1/2*(gender-output)**2
4    gradient = K.gradients(loss, model.input)[0]
5    gradient /= (K.sqrt(K.mean(K.square(gradient))) + 1e-5)
6    iteration = K.function([model.input], [loss,gradient])
7    input_factors = np.expand_dims(np.zeros(userFactors.shape[1]), axis=0)
9    for i in range(20):
10        loss_value, gradient_value = iteration([input_factors])
11        input_factors += gradient_value * 0.1
12    return input_factors
14    MALE, FEMALE = 1., 0.
15    male_reference = factors(MALE)
16    female_reference = factors(FEMALE)

3. Experiments and results

This section tests the proposed DL

method and architecture on two representative and public datasets. Five consolidated feature selection baselines are used: logistic, entropy, variance and PCA. The classification accuracy score has been selected to measure the quality of the results. Most of the experiments compare the results obtained by choosing different numbers of selected features. Four classification models are used in the experiments: neural networks, logistic regression, SVM and random forest. Cross validation has been implemented by using a 70% training set, a 10% validation set and a 20% testing set. The chosen datasets to make the experiments are the popular MovieLens and the MyAnimeList. Both of them contain demographic information. MyAnimeList contains more than five million of ratings, and the selected MovieLens version has only 100,000 ratings; in this way we will test the proposed method on two datasets with very different sizes. Some relevant dataset facts are shown in

table 1. We will show all the results from the MovieLens dataset, and the more representative ones from the MyAnimeList dataset. Two demographic features have been tested: gender and age; we have found little differences in their results. To maintain the paper in a reasonable size, and to avoid including redundant information, figures in this section are restricted to the gender results.

#users #items #ratings scores
MovieLens 943 1682 100,000 1 to 5
MyAnimeList 69,600 9,927 5,788,207 1 to 10
Table 1. Datasets used in the experimentation

Throughout the performed experiments, MF has been processed by using a large number of factors: ; it makes possible to spread features among them. Once the MF has been run on both datasets, the following step is to train the MLN included in the DeepUnHide Architecture (Figure 5). We have classified users for both the gender and the age demographic features. The age groups are under 40 years old and 40 or more years old. Figure 9 shows the classification accuracy obtained in the MovieLens dataset for both the gender and the age demographic features. Regarding the MyAnimeList dataset, it reaches a 0.82 gender accuracy. The designed MLN

for MovieLens contains a 100 neurons input layer, a 10 neurons hidden layer (ReLU activation), a 0.3 dropout layer and finally the 2 neurons output layer to encode gender and age, using sigmoid activation. The chosen loss function is binary cross-entropy, and the optimizer is RMSprop. In the case of the MyAnimeList dataset the used

MLN is similar to the MovieLens one, with the only difference that the hidden layer contains 20 neurons.

Figure 9. Training and test classification accuracy reached in the MLN of the DeepUnHide architecture. Gender classification (a), age classification (b). MovieLens dataset.

Once the DeepUnHide architecture MLN (Figure 5) has been trained, we run DeepUnHide (Figure 6) to obtain the demographic proportions of each user factor, as seen in Figure 2. MovieLens gender (male, female) results are shown in figure 10: its top graph draws the male (blue) and the female (red) proportions that each factor encodes (-axis: factors). Representative factors are those which mostly encode male or mostly encode female: it helps to make a feature selection. Middle graph in figure 10 shows the normalized absolute difference of the female and male proportions: the largest the absolute difference, the better the factor distinguishes the feature. From these values we can select those whose normalized absolute difference exceed a threshold, obtaining the most relevant factors (selected features).

As an example, graph in the bottom of figure 10 shows the factors that overtake the 0.5 threshold: they are the result of the proposed DeepUnHide feature selection. Making the same process to the age (young, senior) demographic feature, in the MovieLens dataset we obtain the results shown in the top graph of figure 11. Please note that the same factor can be relevant to two different demographic features (such as factor 0 in figure 10 and figure 11), although it will not be the usual situation when the number of the MF factors is high. Bottom graph in figure 11 confronts the proportions of the demographic gender and age features for each MF factor. Explanation of recommendations can be done using this information: since we know each demographic importance for each hidden factor, it is possible to extract this information from the prediction dot product, as shown in figure 2.

Figure 10. Male and female proportions encoded in each of the MF factors; MovieLens dataset. -axis: factor number; top graph: male and female proportions; middle graph: normalized absolute difference between male and female proportions; bottom graph: more relevant factors to the gender demographic feature
Figure 11. Top graph: more relevant factors to the age (young, senior) demographic feature; bottom graph: gender versus age proportions for each MF factor. Movielens dataset. -axis: factor number

Please note that the bottom graph in figure 10 shows the set of factors that best discriminate the gender demographic feature, whereas the top graph in figure 11 shows the set of factors that best discriminate the age demographic feature. All these factors have been obtained by using a threshold value. Another approach is to select the factors that best discriminate the desired demographic feature: instead of using the indicated threshold, we just take the most promising factors. The first two experiments in this section compare the classification quality results obtained by using different values (different number of factors). These are feature selection experiments; it is expected that the more the value, the best the quality results. It is also expected that a reduced number of factors can provide accurate classification values. Finally, the proposed DeepUnHide method should show better scores and trends than the baselines do. The first experiment makes use of an range from 1 to 20; the second experiment uses an range from 5 to 70; finally, the third experiments fixes to 50.

Once the DeepUnHide feature selection is made, we have designed three experiments to test that it is correct and that it improves the state of the art. The three experiments test the classification accuracy quality measure, and all of them compare the proposed approach with several state of art feature selection methods: logistic [31], entropy [18], variance [41] and PCA [21]. Additionally, a random baseline is used. Each of the three experiments is explained in a separated subsection. The first experiment is focused on the four generated users: the representative male, female, young and senior; it uses an MLN to classify these four vectors of factors. The second experiment classifies all the datasets users by means of an MLN. Finally, the third experiment classifies all the datasets users by means of several ML classification models.

3.1. Classification of the representative users applying a neural network

This experiment uses the factors of the representative male and female (Figure 8, iteration ). The hypothesis is that correct classification can be achieved by using a reduced set of the selected features. Classification is performed running forward (predicting) the same MLN of the DeepUnHide architecture (Figure 8). Predictions near to the value 1 can be considered users classified as ‘male’, whereas predictions near to the value 0 can be considered as ‘female’ users (same with ‘young’ and ‘senior’). Different classification processes are made using different numbers of factors (from 1 to 20). Figure 12 shows the obtained results for the tested datasets: MovieLens (top graph) and MyAnimeList (bottom graph) when the gender feature is chosen. The male (blue color) correct classification value is the number 1 (-axis) and the female (red color) correct classification value is the number 0 (-axis). Both graphs in figure 12 show, as expected, that increasing the number of selected factors the classification accuracy raises. Solid lines in figure 12 correspond to the proposed DeepUnHide method (“Deep”, for short in legends). It can be seen that our method works fine even with a very reduced number of selected factors: using just 3 factors it can correctly classify with small errors. All the baselines need a larger number of factors than the proposed method to reach a similar accuracy; logistic is the best baseline on MovieLens, whereas entropy is better in MyAnimeList. None of them can compete with DeepUnHide . Please note that experiments that selects a very low number of factors (one to three factors) can return an ambiguous classification result. They do not manage enough information to correctly classify users; e.g.: the 0.5 and 0.6 classification values in figure 12, when only one factor is selected. The demographic age classification results are very similar to the gender ones.

Figure 12. Classification results of the proposed DeepUnHide method (Deep for short) compared to the feature selection baselines: logistic [31], entropy [18], variance [41] and PCA [21]. -axis: number of selected factors, -axis: correct classification values (0 for female and 1 for male)

3.2. Classification of all the users applying a neural network

The previous experiment did not classify all the testing users in the dataset. It just classified the representative male user and the representative female user (by using different numbers of selected factors). Experiments in this section make use of an MLN to classify all the users attending to their gender demographic feature. Hypothesis here is the same than in the previous experiment one: correct classification can be achieved by using a reduced set of selected features. Experiments in this section have been performed varying the number of selected factors. Figure 13 shows the obtained results: as expected, accuracy increases when the number of selected factors grows. The MyAnimeList dataset reaches better classification results; DeepUnHide (Deep for short) improves all the baselines accuracy for all the tested number of selected factors.

Figure 13. Classification accuracy of the users based on their gender; a) Movielens, b) MyAnime. -axis: number of selected factors, -axis: accuracy. Proposed method: DeepUnHide (Deep for short). Baselines: logistic [31], entropy [18], variance [41] and PCA [21]

3.3. Classification of all the users applying several machine learning models

Since the proposed feature selection method and its architecture are based on the DL model, and in the previous section it was tested the accuracy by means of a DL classifier, it has been considered convenient to make some quality testing based on different classification models. In particular, the ML logistic regression, SVM and random forest have been chosen. In this section we compare the accuracy score obtained using the mentioned models and applying them to both the proposed DeepUnHide method and the selected baselines. Figure 14

shows the results obtained in the: a) MovieLens dataset and b) MyAnimeList dataset. The number of selected factors has been fixed to 50 (half of the whole available factors). In the same line that previous experiments, the proposed Deep feature selection gets better accuracy than the baselines in all the cases. MyAnimeList reaches better classification accuracy than MovieLens, and random forest is the

ML model with better results, although they are worse than the DL model (comparing figures 14 and 13).

Figure 14. Classification accuracy of the users; a) Movielens, b) MyAnime. -axis: classification model, -axis: accuracy. Proposed feature selection: DeepUnHide (Deep for short). Baselines: logistic [31], entropy [18], variance [41] and PCA [21]. Demographic feature: gender

3.4. Unhiding the hidden factors in the MovieLens dataset

DeepUnHide has been designed to facilitate the understanding of hidden factors in a MF based RS by extracting demographic information from users. This information can be used to explain the recommendations performed by the RS. It can be done by making use of the hidden factors of the majority archetypal user and the minority archetypal user (see section 2.3). By selecting their highest hidden factors, we can obtain the most representative factors of the majority group, , and the minority group, .

In this experiment we define an index to assign a minority value to the -th item (say, femininity), , and equivalently an other index to assign a majority value to each item (say, masculinity), . These two weights can be obtained by comparing the archetypal users and with the hidden factors of the item , . To be precise, we define the minority weight of each item as

Analogously, we define a majority weight of each item as

We focus on the MovieLens dataset, by assigning male users to the majority group and female users to the minority group. Using the previously computed archetypal male user, , and the archetypal female user, , we compute the coefficients and for each item of the dataset. The 10 most representative movies of the minority group (i.e. the top 10 items with the highest ) are the following:

  1. Evita (1996)

  2. Crucible, The (1996)

  3. Dirty Dancing (1987)

  4. Nell (1994)

  5. Rosewood (1997)

  6. Dante’s Peak (1997)

  7. Jungle2Jungle (1997)

  8. On Golden Pond (1981)

  9. My Best Friend’s Wedding (1997)

  10. Little Women (1994)

On the other hand, the following 10 movies as the most representative of the majority group (i.e. the top 10 items with the highest ):

  1. Fifth Element, The (1997)

  2. Trainspotting (1996)

  3. Crumb (1994)

  4. Die Hard (1988)

  5. Clerks (1994)

  6. Aliens (1986)

  7. Miller’s Crossing (1990)

  8. Lost Highway (1997)

  9. Brazil (1985)

  10. Dances with Wolves (1990)

For this experiment we have used the top hidden factors of the archetypal users to compute and and all movies with less than 75 ratings has been filtered out to avoid cold start situations.

4. Conclusions

An innovative approach to unhide demographic features in matrix factorization is presented. It uses the gradient based localization concept, borrowed from the deep learning image processing. The obtained representative user vector for each demographic feature (say ‘gender) serves to make the feature selection. Results show an important improvement in the classification accuracy score when the selected features are applied, compared to the baseline methods. Thus, we can assure that the proposed deep learning method and architecture accurately catch the hidden semantic of the matrix factorization factors. The obtained results open the door to reach improvements on several representative research fields in the recommender systems area. Recommendation explanation can be addressed by translating from the obtained demographic information to a visual representation of demographic features. Fairness is another important research field where the proposed method has a direct application: fair recommendations can be made by weighting those factors that belong to the biased group of users. The proposed method can also be applied to recommendation of groups of users, making use of the gradient-obtained representative user that can act as virtual user for the group.


  • [1] M. Y. H. Al-Shamri. User profiling approaches for demographic recommender systems. Knowledge-Based Systems, 100:175–187, May 2016.
  • [2] A. B. Barragáns-Martínez, E. Costa-Montenegro, J. C. Burguillo, M. Rey-López, F. A. Mikic-Fonte, and A. Peleteiro.

    A hybrid content-based and item-based collaborative filtering approach to recommend TV programs enhanced with singular value decomposition.

    Information Sciences, 180(22):4290–4311, Nov 2010.
  • [3] Z. Batmaz, A. Yurekli, A. Bilge, and C. Kaleli. A review on deep learning for recommender systems: challenges and remedies. Artificial Intelligence Review, 52(1):1–37, Jun 2019.
  • [4] H. Bharadhwaj and S. Joshi. Explanations for Temporal Recommendations. Künstl. Intell., 32(4):267–272, Nov 2018.
  • [5] M. Bilgic and R. J. Mooney. Explaining recommendations: Satisfaction vs. promotion. In Beyond Personalization Workshop, IUI, volume 5, page 153, 2005.
  • [6] J. Bobadilla, F. Ortega, A. Gutiérrez, and S. Alonso. Classification-based deep neural network architecture for collaborative filtering recommender systems. IJIMAI, 6(1):68–77, 2020.
  • [7] J. Bobadilla, F. Ortega, A. Hernando, and A. Gutiérrez. Recommender systems survey. Knowledge-Based Systems, 46:109–132, Jul 2013.
  • [8] J. Bobadilla and F. Serradilla. The effect of sparsity on collaborative filtering metrics. Proceedings of the Twentieth Australasian Conference on Australasian Database - Volume 92, pages 9–18, 2020.
  • [9] R. Burke, N. Sonboli, and A. Ordonez-Gauger. Balanced neighborhoods for multi-sided fairness in recommendation. In S. A. Friedler and C. Wilson, editors, Proceedings of the 1st Conference on Fairness, Accountability and Transparency, volume 81 of Proceedings of Machine Learning Research, pages 202–214, New York, NY, USA, 23–24 Feb 2018. PMLR.
  • [10] A. Chouldechova and A. Roth. A snapshot of the frontiers of fairness in machine learning. Commun. ACM, 63(5):82–89, 2020.
  • [11] M. D. Ekstrand, M. Tian, M. R. I. Kazi, H. Mehrpouyan, and D. Kluver. Exploring author gender in book rating and recommendation. Proceedings of the 12th ACM Conference on Recommender Systems, pages 242–250, 2020.
  • [12] L. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 262–270. Curran Associates, Inc., 2015.
  • [13] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016

    , pages 2414–2423. IEEE Computer Society, 2016.
  • [14] A. Hernando, J. Bobadilla, and F. Ortega. A non negative matrix factorization for collaborative filtering recommender systems based on a bayesian probabilistic model. Knowledge-Based Systems, 97:188–202, Apr 2016.
  • [15] A. Hernando, J. Bobadilla, F. Ortega, and A. Gutiérrez. Trees for explaining recommendations made through collaborative filtering. Information Sciences, 239:1–17, Aug 2013.
  • [16] K. Holstein, J. W. Vaughan, H. D. III, M. Dudík, and H. M. Wallach. Improving fairness in machine learning systems: What do industry practitioners need? In S. A. Brewster, G. Fitzpatrick, A. L. Cox, and V. Kostakos, editors, Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI 2019, Glasgow, Scotland, UK, May 04-09, 2019, page 600. ACM, 2019.
  • [17] T. Huang, D. Zhang, and L. Bi. Neural embedding collaborative filtering for recommender systems. Neural Comput. &. Applic., pages 1–15, Jun 2020.
  • [18] F. Jiang, Y. Sui, and L. Zhou. A relative decision entropy-based feature selection approach. Pattern Recognit., 48(7):2151–2163, Jul 2015.
  • [19] M. Jiang, Z. Zhang, J. Jiang, Q. Wang, and Z. Pei. A collaborative filtering recommendation algorithm based on information theory and bi-clustering. Neural Comput. &. Applic., 31(12):8279–8287, Dec 2019.
  • [20] Y. Jing, Y. Yang, Z. Feng, J. Ye, Y. Yu, and M. Song. Neural style transfer: A review. IEEE Transactions on Visualization and Computer Graphics, pages 1–1, 2019.
  • [21] I. T. Jolliffe. Principal Component Analysis. Springer-Verlag New York, 2002.
  • [22] A. Krishnaswamy Rangarajan and R. Purushothaman. Disease Classification in Eggplant Using Pre-trained VGG16 and MSVM. Sci. Rep., 10(2322):1–11, Feb 2020.
  • [23] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 556–562. MIT Press, 2001.
  • [24] J. Leonhardt, A. Anand, and M. Khosla. User Fairness in Recommender Systems. Companion Proceedings of the The Web Conference 2018, pages 101–102, 2020.
  • [25] F. Li, G. Xu, and L. Cao. Two-level matrix factorization for recommender systems. Neural Comput. &. Applic., 27(8):2267–2278, Nov 2016.
  • [26] M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, 2014.
  • [27] V. Lully, P. Laublet, M. Stankovic, and F. Radulovic.

    Enhancing explanations in recommender systems with knowledge graphs.

    Procedia Comput. Sci., 137:211–222, Jan 2018.
  • [28] K. Madadipouya and S. Chelliah. A Literature Review on Recommender Systems Algorithms, Techniques and Evaluations. BRAIN. Broad Research in Artificial Intelligence and Neuroscience, 8(2):109–124, Jul 2017.
  • [29] A. Mnih and R. R. Salakhutdinov. Probabilistic matrix factorization. In Advances in neural information processing systems, pages 1257–1264, 2008.
  • [30] R. Mu. A Survey of Recommender Systems Based on Deep Learning. IEEE Access, 6:69009–69022, Nov 2018.
  • [31] A. Y. Ng. Feature selection, L1 vs. L2 regularization, and rotational invariance. Proceedings of the twenty-first international conference on Machine learning, page 78, 2020.
  • [32] I. Nunes and D. Jannach. A systematic review and taxonomy of explanations in decision support and recommender systems. User Model. User-Adap. Inter., 27(3):393–444, Dec 2017.
  • [33] A. Papadimitriou, P. Symeonidis, and Y. Manolopoulos. A generalized taxonomy of explanations styles for traditional and social recommender systems. Data Min. Knowl. Disc., 24(3):555–583, May 2012.
  • [34] L. Quijano-Sanchez, C. Sauer, J. A. Recio-Garcia, and B. Diaz-Agudo. Make it personal: A social explanation system applied to group recommendations. Expert Syst. Appl., 76:36–48, Jun 2017.
  • [35] A. Rezvanian, B. Moradabadi, M. Ghavipour, M. M. D. Khomami, and M. R. Meybodi. Social Recommender Systems. SpringerLink, pages 281–313, 2019.
  • [36] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. 2017 IEEE International Conference on Computer Vision (ICCV), pages 618–626, Oct 2017.
  • [37] S. S. Sohail, J. Siddiqui, and R. Ali. Classifications of recommender systems: A review. Journal of Engineering Science and Technology Review, 10(4):132 –153, 2017.
  • [38] V. Tsintzou, E. Pitoura, and P. Tsaparas. Bias Disparity in Recommendation Systems. Preprint: arXiv, Nov 2018.
  • [39] P. Valdiviezo-Diaz, F. Ortega, E. Cobos, and R. Lara-Cabrera. A Collaborative Filtering Approach Based on Naïve Bayes Classifier. IEEE Access, 7:108581–108592, Aug 2019.
  • [40] N. M. Villegas, C. Sánchez, J. Díaz-Cely, and G. Tamura. Characterizing context-aware recommender systems: A systematic literature review. Knowledge-Based Systems, 140:173–200, Jan 2018.
  • [41] X. Wang and X. Qian. Total variance based feature point selection and applications. Comput.-Aided Des., 101:37–56, Aug 2018.
  • [42] L. Wen, X. Li, X. Li, and L. Gao. A new transfer learning based on vgg-19 network for fault diagnosis. In 2019 IEEE 23rd International Conference on Computer Supported Cooperative Work in Design (CSCWD), pages 205–209, 2019.
  • [43] S. Yao and B. Huang. Beyond parity: Fairness objectives for collaborative filtering. In Advances in Neural Information Processing Systems, pages 2921–2930, 2017.
  • [44] H. Zamani and A. Shakery. A language model-based framework for multi-publisher content-based recommender systems. Information Retrieval Journal, 21(5):369–409, Oct 2018.
  • [45] M. Zanker and D. Ninaus. Knowledgeable Explanations for Recommender Systems. IEEE, pages DateofConference:31Aug.–3Sept.2010, 2020.
  • [46] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning Deep Features for Discriminative Localization. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2921–2929, Jun 2016.