RS is an artificial intelligence field that provides methods and models to predict and recommend items to users (e.g.: films to persons, e-commerce products to costumers, services to companies,Quality of Service (QoS) to Internet of Things (IoT) devices, etc.) (beel2013research). Current popular Recommender Systems are Spotify, Netflix, TripAdvisor, Amazon, etc. RS are usually categorized attending to their filtering strategy, mainly demographic (bobadilla2021deep), content-based (deldjoo2020recommender), context-aware (kulkarni2020context), social (shokeen2020study), Collaborative Filtering (CF) (bobadilla2020deep; beel2013research) and filtering ensembles (forouzandeh2021presentation; ccano2017hybrid). CF is the most accurate and widely used filtering approach to implement RSs. CF models have evolved from the K-Nearest Neighbors (KNN) algorithm to the Probabilistic Matrix Factorization (PMF) (mnih2007probabilistic), the non-Negative Matrix Factorization (NMF) (fevotte2011algorithms) and the Bayesian non-Negative Matrix Factorization (BNMF) (hernando2016non). Currently, deep learning research approaches are growing in strength: they provide improvement in accuracy compared to the Machine Learning (ML)-based Matrix Factorization (MF) models (rendle2020neural). Additionally, deep learning architectures are usually more flexible than the MF-based ones, introducing combined deep and shallow learning (he2017neural), integrated content-based ensembles (narang2018deep), generative approaches (bobadilla2020deepfair; gao2021recommender), among others.
DeepMF (xue2017deep) is a neural network model that implements the popular MF concept. DeepMF was designed to take as input a user-item matrix with explicit ratings and nonpreference implicit feedback, although current implementations use two embedding layers whose inputs are, respectively, user and items. The experimental results evidence the Deep Matrix Factorization (DeepMF) superiority over the traditional approaches based on ML-focused RS, particularly the most used MF models: PMF, NMF, and BNMF. Currently, DeepMF is a popular model that is rapidly replacing the traditional MF models based on classical ML. Additionally, DeepMF has been used in the RS field to combine social behaviors (clicks, ratings,…) with images (wen2018visual), and a social trust-aware RS has been implemented by using DeepMF to extract features from the user-item rating matrix for improving the initialization accuracy (wan2020deep). QoS predictions have also been addressed by using DeepMF (zou2020ndmf). To learn attribute representations, a DeepMF model has been used that creates a low-dimensional representation of a dataset that lends itself to a clustering interpretation (trigeorgis2016deep). Finally, the classical matrix completion task has been addressed by using the DeepMF approach (fan2018matrix).
The not so widely spread Neural Collaborative Filtering (NCF) model (he2017neural) may be seen as an augmented DeepMF model, where deeper layers are added to the ‘Dot’ one. Additionally, the ‘Dot’ layer can be replaced by a ‘Concatenate’ layer. Figure 1 shows the explained concepts. NCF slightly outperforms the DeepMF accuracy results, but it increases the required runtime to train the model and to run the forward process: it is necessary to execute the ‘extra’ Multi-Layer Perceptron (MLP) on top of the ‘Dot’ or ‘Concatenate’ layers. Moreover, compared to DeepMF, the NCF
architecture adds new hyper-parameters to set: mainly the number of hidden layers (depth) and their size (number of neurons in each layer) of theMLP architecture.
In a different setting, Variational AutoEncoders act as regular autoencoders. They aim to compress the input raw values into a latent space representation by means of an encoder neural network, whereas the decoder neural network makes the opposite operation seeking to decompress from latent space to output raw values. The main difference between classical autoencoders and VAEs is the latent space design, meaning, and operation. Classical autoencoders do not generate structured latent spaces, whereas VAEs introduce a statistical process that forces them to learn continuous and structured latent spaces. In this way, VAEsFigure 2. From the parameters in the multivariate distribution, we draw a random sample and a latent space sample is obtained for each training input (center of fig. 2). The stochasticity of the random sampling improves the robustness and forces the encoding of continuous and meaningful latent space representations, as it can be seen in fig. 3, where it is shown the difference between a regular autoencoder latent space representation and its equivalent VAE one.
Due to their properties, VAEs have been used as generative deep learning models in the image processing field. Reconstruction of a multispectral image has been performed by means of a VAE (liu2020multispectral) that parameterizes the latent space of Gaussian distribution parameters. VAEs have been also used to create superresolution images as in liu2020unsupervised
, where a model is proposed to encode low-resolution images in a dense latent space vector that can be decoded for target high resolution image denoising. The blur image problem usingVAE is tackled in liu2020photo by adding a conditional sampling mechanism that narrows down the latent space, making it possible to reconstruct high resolution images. Moreover, in zhang2021online, the authors propose a flexible autoencoder model able to adapt to varying data patterns with time. By importing the VAE concept from image processing, several papers have used these models to improve RS results. For instance, denoising and variational autoencoders are tested in liang2018variational, where the authors reported the superiority of the VAE option against other models, or in nisha2019social, where variational autoencoders are combined with social information to improve the quality of the recommendations.
The aim of this paper is to propose a neural architecture that joins the best of the DeepMF and NCF models with the VAE concept. This novel models will be called, respectively, Variational Deep Matrix Factorization (VDeepMF) and Variational Neural Collaborative Filtering (VNCF). In contrast with the autoencoder and Generative Adversarial Network (GAN) approaches in the CF field (gao2021recommender), we shall not use the generative decoder stage and we maintain the regression output layer presented in the DeepMF and the NCF models. The main advantage in the use of the VAE operation is the robustness that it confers to the latent representation. This robustness can be seen by observing fig. 3
. If we consider each dot drawn as a train sample representation in the latent space, then test samples are most likely to be correctly classified in theVAE model (right graph in fig. 3) than being correctly classified in the regular autoencoder model (left graph in fig. 3). In short, the variational approach stochastically ‘spreads’ the samples in the latent space, improving the chances of classifying correctly the training samples.
In our proposed RS CF scenario, we expect that rating values can be better predicted when a variational latent space has been learnt, because this space covers a wider, more robust, and more representative latent area. Whereas with a traditional autoencoders each sample would be coded as a value in the latent space (white circle in fig. 4), the VAE encodes the parameters of a multivariate distribution (e.g. mean and variance of both the blue and the orange Gaussian distributions in fig. 4). From the learnt distribution parameters, random sampling is carried out to generate stochastic latent space values (gray circles in fig. 4
). Each epoch in the learning process generates a new set of latent space values. Once the proposed model has been trained, when atuple is presented to the model, the obtained latent space value (green circle in fig. 4) can be better predicted in the VAE scenario than in the regular autoencoder scenario: the random sampled values (gray circles) of the enriched latent space will help to associate the predicted sample (green circle) with their associated training samples (white circle), making the prediction process much more robust and accurate.
Current CF-based variational autoencoders usually obtain raw augmented data: mainly synthetic ratings from user to items or generated relevant versus not relevant votes from users to items (liang2018variational; gao2021recommender). This strategy forces us to sequentially run two separated models: the generative model (GAN or VAE) that provides augmented data, and the regression CF model that makes predictions and recommendations. This approach presents three main drawbacks: 1) complexity, as two separate models are necessary, 2) large time consumption, and 3) sparsity management. As we will explain deeper in the following section, our proposed model does not generate raw augmented data. On the contrary, its innovation is based on the use of a single model to internally manage both augmentation and prediction aims. Particularly significant is the way in which the proposed model addresses the sparsity problem: we do not make augmentation on the sparse raw data (ratings cast from users to item), but an internal ‘augmentation’ process in the dense latent space of the model (figs. 4 and 3). Each sample that is randomly generated from the latent space feeds the model regression layers. Thereby, we propose a model that first generates stochastic variational samples in a dense latent space, and then these generated samples act as inputs of the regression stage of the model.
To test these ideas, the hypothesis considered in this paper is that the augmented samples will be more accurate and effective if they are generated in an inner and dense latent space rather than in a very sparse input space. It is important to realize that enriching the inner latent space can improve the recommendation results, but it also injects noise to the latent space that may potentially worsen the results. It is expected that the proposed approach will work better with poor latent spaces, whereas when it is applied to rich spaces, the spurious entropy added by the variational stage could worsen recommendations. Thus, medium-size CF
datasets, or large and complex ones are better candidates to improve their results when the variational proposal is applied, whereas large datasets with predictable data distributions will probably not benefit from the noise injection of the variational architecture.
2 Proposed model
The proposed neural architecture will mix the VAE and the DeepMF (or the NCF) models. From the VAE we take the encoder stage and its variational process, and from the DeepMF or the NCF model we use its regression layers. This is an innovative approach in the RS field, since the VAE and GAN neural networks have only been used as a posteriori stage to make data augmentation, i.e. to obtain enriched input datasets to feed the CF DeepMF or NCF models. Hence, the traditional approach needs to separately train two models, first the VAE and then the DeepMF/NCF networks.
In sharp contrast, our proposed approach efficiently joins the VAE and the Deep CF regression concepts to obtain improved predictions with a single training process. In the learning stage, the training samples feed the model (left hand side of fig. 5). Each training sample consists of the tuple (rating casted by the user to the item). In the DeepMF/NCF architecture, each user is represented by his/her vector of voted ratings, and each item is represented by its vector of received ratings. The model learns the ratings (third element in the tuples) casted by the users to the items (first and second elements in the tuples). In other words, the ratings are outputs of the neural network (right hand side of fig. 5).
2.1 Formalization of the model
The architectural details of the proposed models are shown in fig. 6. For simplicity, only the Variational Deep Matrix Factorization (VDeepMF) architecture is shown in this figure. The corresponding model for NCF, named Variational Neural Collaborative Filtering (VNCF), is analogous to the VDeepMF one: it has the same ‘Embedding’ and ‘Variational’ layers and we should only replace the ‘Dot’ layer of DeepMF by a ‘Concatenate’ layer followed by a MLP.
To fix the notation, let us suppose that our dataset contains users and items. In general, the aim of any deep learning model for CF-based prediction is to train a (stochastic) neural network that implements a function
This function operates as follows. Let us codify the -th user of the dataset (resp. the
-th item) using one-hot-encoding as the-th canonical basis vector (resp. the -th canonical basis vector ). Then, seeks to predict the score that the -th user would assign to the -th item. To train this function , in the learning phase the neural network is fed with a set of training tuples of user that rated item with a score and the function is trained to fit .
Our proposal for the VDeepMF consist on decomposing has a combination of a ‘Embedding’, followed by a ‘Variational’ stage and a final ‘Dot’ layer, as shown in fig. 6). The first ‘Embedding’ layer (left hand side of fig. 6) is borrowed from the natural language processing field (he2017neural). The idea is that this layers provides a fast translation of users and items into their respective representations in the latent spaces. To be precise, this layer implements a function Embedding that maps a pair into a pair of dense vectors that represents the -th user and the -th item, being the dimension of the representations.
It is worth mentioning that, even though from a conceptual point of view the ‘Embedding’ layer is a regular MLP dense layer, to save time and space, these ‘Embedding’ layers are typically implemented through lookup tables. In this way, instead of feeding the network with the one-hot encoding of the user (resp. the item ), we input it via its ID as user (resp. as item). The lookup table efficiently recovers the -th (resp. -th) column of the embedding matrix that contains (resp. ) so that the translation can be conducted in a more efficient way than with a standard MLP layer by exploiting the sparsity of the input.
The variational process is carried out by the ‘Variational’ stage (labeled as ‘variational layers’ at the middle of fig. 6). From the latent space representation of the the -th user and the -th item, two separated dense layers return the mean and variance parameters of two gaussian multivariate distribution. In this way, if fix a latent space dimension , the first part of this ‘Variational’ stage (left part of the middle rectangle of fig. 6) computes a map
where represent the means of the associated gaussian distributions to the user and the item respectively, and their variance. Thus, the output of the ‘Variational’ stage (left right of the middle rectangle of fig. 6) is a pair of random vectors where
Here, denotes a
-dimensional multivariate normal distribution of mean vectorand diagonal covariance matrix , so that our covariance matrix is always diagonal. Each time a sample is drawn, the ‘Variational’ stage thus returns a pair , which represents the stochastic latent representations associated to .
The final ‘Dot’ layer (labeled as ‘regression layer’ at right hand side of fig. 6) in the VDeepMF model is very simple. It is a linear layer that simply computes the dot product of the latent vectors and . Therefore
2.2 Implementation of the model
The model described in Section 2.1
has been implemented in Keras(chollet2015keras), a widely used Python library for deep learning and neural computing. For the sake of reproducibility, the code framework that implements the architecture shown in fig. 6 (both in their VDeepMF and NCF versions) and the experiments explained in the next section is available at the GitHub repository111https://github.com/KNODIS-Research-Group/deep-variational-models-for-collaborative-filtering.
Additionally, as an example, Listing LABEL:lst:vdeepmf shows the source code of the proposed VDeepMF kernel: lines 8 to 13 implement the user side of the fig. 6 architecture, whereas lines 15 to 20 do the same job on the item side of fig. 6. Please note the use of the Keras Embedding layers in lines 9 and 16. Lines 10-12 and 17-19 carry out the ‘Variational’ stage. In particular, both the user and the item Lambda layers (lines 12 and 19) run the variational process. They use the sampling function (lines 3 to 6) to combine the mean and variance latent values, which make use of the Keras backend random_normal procedure to implement the stochasticity (line 5). Finally, the latent values of users and items are combined by means of the ‘Dot’ layer (line 22) to produce the final output.
3 Empirical evaluation
3.1 Experimental setup
The experimental evaluation has been performed over four different datasets to measure the performance of the proposed method over different environments. The selected datasets are: FilmTrust (guo2013novel)
, an small dataset that contains the ratings of thousands of items to movies; MovieLens 1M(harper2015movielens), the gold standard dataset in CF-based RS; MyAnimeList (myanimelist), a dataset extracted from Kaggle222www.kaggle.com that contains the ratings of thousands of users to anime comics; and Netflix (bennett2007netflix), a popular dataset with hundred of millions ratings used in the Neflix Prize competition. Table 1 show the main parameters of these datasets. The corpus of these datasets has been randomly splitted into training ratings (80% of the ratings) and test ratings (20% of the ratings).
|Dataset||Number of users||Number of items||Number of ratings||Scores||Sparsity|
|FilmTrust||1,508||2,071||35,494||0.5 to 4.0||98.86%|
|MovieLens||6,040||3,706||1,000,209||1 to 5||95.53%|
|MyAnimeList||69,600||9,927||6,337,234||1 to 10||99.08%|
|Netflix||480,189||17,770||100.480.507||1 to 5||98.82%|
The evaluation of the proposed method has been analyzed from three different points of view: the quality of the predictions, the quality of the recommendations, and the quality of the recommendation lists.
To measure the quality of the predictions, we have compared the real rating of an user to an item of the test split with the predicted one, . These comparison has been carried out in three ways: using the Mean Absolute Error (MAE) as in eq. 1, using the Mean Squared Error (MSE) as in eq. 2 and computing the proportion of the explained variance as in eq. 3. Notice that, in (eq. 3), denotes the mean of the ratings contained in the test split.
To measure the quality of the recommendations, we have analyzed the impact of the top recommended items to the user , collected in the list . Using precision (eq. 4), we measure the proportion of relevant recommendations (i.e. the user rated the item with a rated equal or greater than a threshold ) among the top . Here denotes the set of user in the test split. In a similar vein, using recall (eq. 5), we measure the proportion of the test items rated by the user , , that were relevant to him or her and were included into the recommended items . For the conducted experiments, the used thresholds are for FilmTrust, for MovieLens and Netflix, and for MyAnimeList.
Finally, to measure the quality of the recommendation lists we use the normalized Discounted Cumulative Gain (nDCG). Suppose that the recommendation list of the user , , is sorted decreasingly so that the items predicted as more relevant are placed in the first positions. Given , let be the position of the item in the recommendation list. Analogously, suppose that the real top recommendations to user , , as sorted decreasingly and denote by the position of the item in the list. In this setting, the Discounted Cumulative Gain (DCG) and the Ideal Cumulative Gain (IDCG) of the user are defined as in (eq. 6).
Due to the stochastic nature of the variational embedded space of the proposed method, the test predictions used to evaluate the proposed method have been computed as the average of the predictions performed for each pair of user and item .
3.2 Experimental results
Table 2 includes the quality of the predictions performed by the proposed model. Best values for each dataset are highlighted in bold. Table 1(a) contains the MAE (eq. 1), table 1(b) contains the MSE (eq. 2), and table 1(c) contains the score (eq. 3). We can observe that the proposed variational approach improves the prediction capability of DeepMF in all datasets except of Netflix and reports worse predictions when it is applied to NCF.
We justify these results by taking into account the features of the deep learning models used and the properties of each dataset. On the one hand, the larger the size of the dataset, the less necessary it is to enrich the votes with the proposed variational approach. In other words, when the dataset is small, the amount of Shannon entropy (shannon1949mathematical) that it contains might be quite limited. By using a variational method to generate new samples, we add some extra entropy that enriches the dataset, giving the chance to the regressive part of exploiting this extra data. However, large datasets usually present a large entropy in such a way that the regressive models can effectively extract very subtle information from them. In this setting, if we add a variational stage, instead of adding new relevant variability to the dataset, we only add noise that muddies the underlying patterns. For this reason, the variational approach is of no benefit in huge datasets like Netflix.
On the other hand, the NCF model is more complex than the DeepMF one, so data enrichment has less impact for complex models that are able to find more sophisticated relationships between data than simpler models. In fact, based on these results, we can assert that including the variational approach into a simple model such as DeepMF is equivalent to using a more complex model such as NCF.
Furthermore, Figure 7
contains the precision and recall results. In FilmTrust (fig. 6(a)) we can observe that the proposed variational approach reports a huge benefit for the DeepMF model and significantly worsens the results of the NCF model. In MovieLens (fig. 6(b)) and MyAnimeList (fig. 6(c)) the same tendency than in FilmTrust is observer, but, in this case, the proposed VDeepMF model is the model that computes the best recommendations for these datasets. In Netflix (fig. 6(d)) the proposed variational approach decreases the quality of the recommendations. These results are consistent with those analyzed when measuring the quality of the predictions. Consequently, it is evident that the proposed variational approach works adequately when the dataset is not too large and the model used is not too complex.
Additionally, Figure 8 contains the nDCG results. From it, we can observe the same trends as those shown in fig. 7. In FilmTrust (fig. 7(a)), the quality of the recommendation lists do not vary independently of whether the variational approach is used or not. In MovieLens (fig. 7(b)) and MyAnimeList (fig. 7(c)), the combination of the variacional approach with simple modeling such as DeepMF, provides the best results. In Netflix (fig. 7(d)), the variational approach significantly worsens the quality of the recommendation lists.
Finally, table 3 shows the total time and epochs required by each model to be fitted to each dataset using a Quadro RTX 8000 GPU. Best time for each dataset is in bold. We can observe that including a variational layer to the model significantly reduces the required time for fitting. Variational models are able to generate Shannon entropy that is transferred to the regression stage, leading to a more effective training that requires fewer epochs to be fitted. Therefore, the fitting time needed to reach acceptable results is substantially lower.
|VDeepMF||61s (15 epochs)||601s (6 epochs)||7629 (9 epochs)||12655 (3 epochs)|
|DeepMF||75s (25 epochs)||677s (10 epochs)||13217s (20 epochs)||15697 (4 epochs)|
|VNCF||35s (7 epochs)||1030s (9 epochs)||9945 (9 epochs)||12650 (3 epochs)|
|NCF||56s (15 epochs)||876 (10 epochs)||12111s (15 epochs)||16896 (4 epochs)|
In the latest trends, accuracy of RSs is being improved by using deep learning models such as deep matrix factorization and neural collaborative filtering. However, these models do not incorporate stochasticity in their design, unlike variational autoencoders do. Variational random sampling has been used to create augmented input raw data in the collaborative filtering context, but the inherent collaborative filtering data sparsity makes it difficult to get accurate results. This paper applies the variational concept not to generate augmented sparse data, but to create augmented samples in the latent space codified at the dense inner layers of the proposed neural network. This is an innovative approach trying to combine the potential of the variational stochasticity with the augmentation concept. Augmented samples are generated in the dense latent space of the neural network model. In this way, we avoid the sparse scenario in the variational process.
The results show an important improvement when the proposed models are applied to middle-size representative collaborative filtering datasets, compared to the state-of-art baselines, testing both prediction and recommendation quality measures. In contrast, testing on the huge Netflix dataset not only leads to no improvement, but the recommendation quality worsens: increasing Shannon entropy in rich latent spaces causes that the negative effect of the introduced noise exceeds its benefit. Therefore, the proposed deep variational models should be applied for seeking to a fair balance between their positive enrichment and their negative noise injection. The results presented in this work can be considered as generalizable, since they were analyzed in four representative and open CF datasets. Researchers can reproduce our experiments and easily create their own models by using the provided framework referenced in section 2. The authors of this work are committed to reproducible science, so the code used in these experiments is publicly available.
Among the most promising future works, we propose the following: 1) Introducing the variational process in the alternative inner layers of the relevant architectures in the collaborative filtering area, 2) Screening the learning evolution in the training process, since it is faster than the classical models but it also requires early stopping in the training stage, 3) Providing further theoretical explanations of the properties of the CF datasets, in terms of Shannon entropy or other statistical features, that ensure a good performance of the proposed models, 4) Applying probabilistic deep learning models in the CF field to capture complex non-linear stochastic relationships between random variables, and 5) Testing the impact of the proposed concept when recommendations are made to groups of users.
Á. G.-P. acknowledges the hospitality of the Department of Mathematics at Universidad Autónoma de Madrid where part of this work was developed. This work was partially supported by Ministerio de Ciencia e Innovación of Spain under the project PID2019-106493RB-I00 (DL-CEMG) and the Comunidad de Madrid under Convenio Plurianual with the Universidad Politécnica de Madrid in the actuation line of Programa de Excelencia para el Profesorado Universitario.
Conflict of interest
The authors declare that they have no conflict of interest.