## 1 Introduction

Recommender systems have demonstrated great commercial value in the era of information overload, because they help users filter our their favorite items precisely from large repositories. No matter in traditional matrix factorization (MF for short) based models [NMF, SVD++] or in recent deep neural models [NCF, NFM], users and items are generally represented as low-dimensional vectors, also known as *embeddings*, which are learned from observed user-item interactions or user/item features. In these models, a user/item representation is a single fixed point of the continuous vector space, which represents a user’s preferences or an item’s characteristics. Then, the final recommendation results are generated through computing the correlations between user embeddings and item embeddings, such as inner product of two embeddings [NAIS]

or feeding them into multi-layer perceptron (MLP for short)

[DRM, NCF].Despite their successes, one unneglectable limitation of these embedding-based models is the lack of handling uncertainty. In a recommender system, users may induce uncertainty due to some reasons. One reason is the lack of discriminative information [zhu2018deep], especially for those users who have very few or even no observed user-item interactions, e.g., historical ratings or reviews for items. Even for the users who have sufficient interactions, uncertainty may also be caused by diversity [bojchevski2018deep], e.g., some users exhibit many and very distinct genres of preferences. We illustrate the example in Figure 1 to explain why the embeddings corresponding to fixed points can not well handle such cases. Suppose user has rated movie and with high scores and these two movies belong to very distinct genres which are labeled with different colors. If we use fixed embeddings learned from observed user-movie interactions to represent users and movies, ’s position in the embedding space (mapped into a 2D map) may locate in the middle of and . If the recommendation is made based on the distance between the embedding positions of and the candidate movies, may be recommended to with movie of the genre different from and , instead of of the same genre as , because is closer to than . There is another case that ’s position may be closer to than to , then still has fewer chances to be recommended to .

In recent years, some researchers have employed *Gaussian embeddings* to learn the representations of words [vilnis2014word] and nodes in a graph [bojchevski2018deep, zhu2018deep]

because of their advantage on capturing the uncertainty in representations. It motivates us to employ Gaussian embeddings in recommender systems to represent users and items more flexibly. In Gaussian embedding based mechanism, each user or item embedding corresponds to a Gaussian distribution of which the mean and the variance are learned from observed user-item interactions. In other words, each user or item is represented by a density instead of a fixed point in latent feature space. The variance of the Gaussian distribution of a user/item measures uncertainty associated to the user/item’s representation. Recall the example in Figure

1, if and all movies are represented by Gaussian embeddings, their positions in the space are the distribution ranges labeled by the dashed ellipses rather than fixed points. As depicted in the figure, ’s range may overlap ’s range other than ’s range. Thus precise recommendation results for the users with uncertainty are achieved.Most of existing Gaussian embeddings based models are learned based on ranking-based loss [dos2017gaussian, vilnis2014word]

, which is not applicable to the tasks other than learning to rank, such as predicting a value or classification. This is because metrics used in previous work, such as KL-divergence, take on a more limited range of values, which is not enough for the input of a classifier

[vilnis2014word]. Besides, models for rate prediction rely on an absolute, rather than relative manner. Therefore it is not feasible to employ such a ranking scheme for the recommendation tasks other than ranking candidate items.In this paper, we propose a recommendation framework in terms of implicit feedback [NCF, NAIS] rather than only ranking candidate items. As a result, we adopt a learning principle different from previous ranking-based Gaussian embedding models. Specifically, our framework first learns a Gaussian distribution for each user and item. Then, according to the distribution, a group of samples is generated through *Monte-Carlo* sampling [MC] for the objective user and the candidate item, respectively. The generated samples are used to compute the correlation between the user and the item, based on which precise recommendation results are achieved. Furthermore, in order to compute the correlation between the user and the item effectively, our framework incorporates convolutional neural network (CNN for short) to extract and compress the features from the user-item sample pair. Our experiment results have proven such convolutional operation is more effective than previous average-based method [joon2019iclr].

In our framework, if the user and the item are regarded as two objects, the correlation computed based on their Gaussian embeddings actually quantifies the matching degree of the two objects. Therefore, our framework can be extended to other machine learning tasks such as link prediction and classification.

In summary, the contributions of our work include:

1. We employ Gaussian embeddings into recommender systems to represent users and items, which is more effective than traditional embeddings of fixed points.

2. We adopt convolutional operations to learn the correlation between the Gaussian embeddings of an objective user and a candidate item efficiently, which is critical to generating precise recommendation results.

3. The extensive experiments conducted on two benchmark datasets justify that our proposed Gaussian embeddings capture the uncertainty of some users well, and demonstrate our framework’s superiority over the state-of-the-art recommendation models.

The rest of this paper is organized as follows. We present the design details of our framework in Section 2, and show our experimental results in Section 3. In Section 4, we introduce related work and conclude our work in Section 5.

## 2 Methodology

### 2.1 Problem Statement

In the following introduction, we use a bold uppercase to represent a matrix or a cube, and a bold lowercase to represent a vector unless otherwise specified,.

#### 2.1.1 Implicit Feedback

We design our framework in terms of implicit feedback which is also focused in many recommendation models [CKE, NCF, NAIS]. Given a user and an item , we define observed ’s implicit feedback to as

The task of our framework is predicting a given objective user ’s implicit feedback to a candidate item

, which is essentially a binary classification model. Accordingly, our framework should estimate the probability that

’s implicit feedback to is 1, which is denoted as in this paper.#### 2.1.2 Gaussian Embedding

In our framework, each user or item is represented by a Gaussian distribution consisting of a expectation embedding (vector) and a covariance matrix, i.e., where , and is embedding dimension. Such latent representations should preserve the similarities between users and items in the embedding space, based on which the correlations between users and items are evaluated. As a result, given a user and an item , our framework tries to evaluate based on and .

More specifically, to limit the complexity of the model and reduce the computational overhead, we assume that the embedding dimensions are uncorrelated [bojchevski2018deep]. Thus is considered as a diagonal covariance matrix and can be further represented by a -dimensional array.

#### 2.1.3 Algorithm Objective

Before describing the details of our proposed framework, we first summarize our algorithm’s objective. Formally, we use to denote the probability that the matching degree between user and item is . In our scenario of implicit feedback, is either 1 or 0, and we denote as . Therefore, the of high value indicates that we should recommend to . If is labeled with a rating score, can be used to predict ’s rating score on , which reflects the degree of ’s preference to . Moreover, can be used to indicate a classification task if is regarded as class label.

According to the aforementioned problem definition, is estimated with

. Recall that Gaussian distribution is a probability distribution of a random variable, so we calculate

as(1) |

where are the vectors of random variables sampled based on Gaussian distribution and , respectively.

To approximate the integration in Eq.1, we adopt Monte-Carlo sampling [MC]. Specifically, suppose that we sample and for times, then we have

(2) |

The calculation of Eq.2 is challenging. On one hand, a large incurs unbearable computational cost. On the other hand, a small incurs bias, resulting in unsatisfactory recommendation results. What is more, it is not trivial to compute . In fact, we can rewrite Eq.2 as

(3) |

This formula implies that is computed based on correlations of vector pair . Inspired by CNN’s power on extracting and compressing features in image processing, we choose a CNN fed with the vector pairs to compute Eq.3, in which the convolution kernels are used to learn the pairwise correlations of . The computation details will be introduced in the next subsection. Our experiment results will prove that the CNN-based computation of Eq.3 is more effective than computing the mean of .

### 2.2 Framework Description

#### 2.2.1 Embedding Layer

The first layer is the *embedding layer*. At first, a user and an item are represented as a one-hot vector of dimensions, denoted as and , respectively. Besides ’s ID, the dimensions of value 1 in can also correspond to ’s feature IDs. In our experiments, we only input user/item IDs into our framework. Furthermore, we initialize four embedding matrices and where and are user (or user feature) number and item (or item feature) number, respectively. Then, we have the Gaussian mean vector and variance vector of and through the following lookup operation,

(4) | |||

(5) |

where is an array filled with value 1 and is Exponential Linear Unit. Both of them are used to guarantee that every element of variance vector is non-negative. Thus, we get and .

#### 2.2.2 Interaction Layer

The second layer is the *interaction layer*, in which samples are sampled for and according to the Monte-Carlo sampling under and

, respectively. In order to perform backpropagation, we use the reparameterization trick in

[kingma2013auto] to obtain the embedding of ’s -th sample as follows(6) |

where is an auxiliary noise variable and varies in each sample. So does .

As stated in subsection 2.1.3, is computed based on the correlations of sample pairs. Hence in this layer we construct a big *interaction map* consisting of units. Each unit represents the correlation of a sample pair , which has the following expression,

(7) |

where is the concatenation of two vectors. As a result, is actually a cube of dimension. Then, we should utilize to compute , which is implemented in the next layer.

We note that other interaction operations of two vectors, such as inner product and element-wise product, are also widely used in other recommendation models. But our empirical results show that concatenation outperforms other functions consistently. One possible explanation is concatenation preserves original feature of two vectors and thus neural networks can better learn their proximity.

#### 2.2.3 Feature Extraction Layer

The input of this layer is the output of the preceding interaction layer, i.e., the cube . In fact, contains rich features that are beneficial to compute Eq.3. It is analogous to an image containing pixel features except that the number of channels is . Inspired by the usage of CNN for extracting and compressing object features which has been proven effective in the field of computer image processing [krizhevsky2012imagenet]

, we employ a multi-layer CNN followed by an MLP in this feature extraction layer.

Specifically, for each layer of the CNN, we apply filters to extract specific local patterns where is kernel (window) size of the filter, and is channel number. We feed into the first layer of our CNN to generate its output as follows

(8) |

where , is convolution operator and is bias. In general, is set to a large number such as 32 or 64, which helps us learn more than one correlation of each vector pair. Besides, one filter computes the correlation of exactly one vector pair if we set =1. Otherwise, the filter extracts features from adjacent vector pairs. In different layers of the CNN, we can set different s. Our empirical results show that a larger kernel size greatly reduces computing cost but contributes little to overall accuracy. Another reason for adopting convolution is that it can reduce the dimensions with fewer parameters.

For each of the rest layers of the CNN, its input is the output of the preceding layer. Suppose is the output of the CNN’s last layer, all of ’s features are flattened through concatenation to be fed into an MLP to obtain final output of this feature extraction layer, i.e.,

(9) |

where and is the flattened array of feature map corresponding to the -th filter. In the following evaluation of our framework, we adopted a CNN of two layers, i.e., =2. The first layer’s is set to 1, and the second layer’s is set to 2.

#### 2.2.4 Prediction Layer

The last layer of our framework is the prediction layer, which accepts the output of the preceding CNN, i.e., , to generate the final prediction score . In this layer, we feed

into a single layer perceptron and use Sigmoid function

to compute as follows(10) |

where is the weight matrix and

is a bias vector. According to

, we can decide whether deserves being recommended to .#### 2.2.5 Model Learning

To learn our model’s parameters including all embeddings mentioned before, we use binary cross-entropy loss since it is suitable for binary classification. Specifically, we have

(11) |

where denotes the set of observed interactions (), and denotes the set of negative instances which are sampled randomly from unobserved interactions. In our experiments, we use Adam algorithm [Adam] to optimize Eq.11

, because it has been proven to be powerful optimization algorithm for stochastic gradient descent for training deep learning models.

Please note that our framework can be applied to various recommendation tasks, including personalized ranking and rating prediction, through simply modifying the loss function.

## 3 Experiments

In this section, we conduct extensive experiments to answer the following research questions.

*RQ1*: Which hyper-parameters are critical and sensitive to our framework and how do they impact the final performance?

*RQ2*: Does our recommendation framework outperform the previous state-of-the-art recommendation models in terms of predicting implicit feedback?

*RQ3*: Can our proposed Gaussian embeddings well capture the preferences of the users with uncertainty, further resulting in better recommendation performance?

### 3.1 Experimental Settings

#### 3.1.1 Dataset Description

We evaluated our models on two public benchmark datasets: MovieLens 1M (ml-1m)^{2}^{2}2https://grouplens.org/datasets/movielens/,and Amazon music (Music)^{3}^{3}3http://jmcauley.ucsd.edu/data/amazon/. The detailed statistics of the two datasets are summarized in Table 1. In ml-1m dataset, each user has at least 20 ratings. In Music dataset, we only reserved the users who have at least 1 rating record given its sparsity.

Dataset | # user | # item | # interaction | Sparsity |
---|---|---|---|---|

ml-1m | 6040 | 3706 | 1000209 | 0.9553 |

Music | 1776 | 12929 | 46087 | 0.9980 |

#### 3.1.2 Evaluation Protocols

Following [NAIS, NCF], we adopted the leave-one-out evaluation. We held out the latest one interaction of each user as the positive sample in test set, and paired it with 99 items randomly sampled from unobserved interactions. For each positive sample of every user in training set, we randomly sampled 4 negative samples. We then predicted and evaluated the 100 user-item interactions of each user in test set. We used two popular metrics evaluation measures, i.e., Hit Ratio (HR) and Normalized Discounted Cumulative Gain (nDCG) [nDCG] to evaluate the recommendation performance of all compared models. The ranked list is truncated at 3 and 10 for both measures. Compared with Hit Ratio, nDCG is more sensitive to rank position because it assigns higher scores for top position ranking.

#### 3.1.3 Baselines

1. *MF-BPR*: This model optimizes the standard MF with the pairwise Bayesian Personalized Ranking (BPR for short) loss [BPR].

2. *NCF*: This model [NCF] has been proven to be a powerful DNN-based CF framework consisting of a GMF (generalized matrix factorization) layer and an MLP (multi-layer perceptron). Both GMF and MLP are fed with user and item representations initialized in random. NCF parameters are learned based on obtained user-item interactions.

3. *ConvNCF*: This is an improved version [he2018outer] of NCF which uses outer product to explicitly model the pairwise correlations between the dimensions of the fixed point embedding space, and then applies multi-layer CNN to extract signal from the interaction map.

4. *DeepCF*: This is a deep version [DCF] of CF, aiming to fuse representation learning based methods and matching function based methods. It employs MLP to learn the complex matching function and low-rank relations between users and items.

5. *NAIS*: In this framework [NAIS], a user’s representation is the attentive sum of his/her historical favorite items’ embeddings. A historical item’s attention is computed based on the similarity between it and the candidate item. Thus such representations especially for the users with many historical favorite items, are also flexible w.r.t. different candidate items.

6. *GER*: To the best of our knowledge, this baseline [dos2017gaussian] is the only Gaussian embedding based recommendation model. It replaces dot product of vectors by inner product between two Gaussian distributions based on BPR framework. As we stated before, such ranking-based loss is not to applicable to other recommendation tasks.

7. *MoG*: This is a variant of the model in [joon2019iclr], which averages predefined soft contrastive loss between vector pairs to obtain matching probability between stochastic embeddings. We set its stochastic mappings to Gaussian embeddings. We compared MoG with our framework to highlight the effectiveness of computing matching probability based on convolutional operations.

In addition, we denote our framework as *GeRec*. In order to achieve a fair comparison, we set the embedding dimension =64 in all above baselines. The code package of implementing our framework is published on
*https://github.com/JunyangJiang/gaussian-recommender*.

### 3.2 Experimental Results

#### 3.2.1 Hyper-parameter Tuning

At first, we try to answer RQ1 through the empirical studies of hyper-parameter tuning in our framework. Due to space limitation, we only display the results of tuning three critical hyper-parameters of our framework GeRec, i.e., embedding dimension , Monte-Carlo sampling number and our CNN’s kernel number , which were obtained from the evaluation on MovieLens dataset. Compared with previous deep models, only is additionally imported into our framework. Please note that when we tuned one hyper-parameter, we set the rest hyper-parameters to their best values. Table 2 displays our framework’s performance of movie recommendation under different hyper-parameter settings. In general, larger and result in better performance. But we only selected and in our experiments given model training cost. And we set in the following comparison experiments according to the results in Table 2. In addition, is also set to 64.

#### 3.2.2 Global Performance Comparisons

To answer RQ2, we compared our framework with the baselines in terms of recommendation performance. The results listed in Table 3 were the average scores of 5 runs, showing that our framework GeRec performs best on the two datasets. Specifically, GeRec’s advantage over MF-BPR, NCF, ConvNCF and DeepCF shows that Gaussian embeddings represent users and items better than the embeddings of fixed points, resulting in more precise recommendation results. GeRec’s advantage over NAIS shows that although attention-based user representations are also flexible embeddings, they do not perform as well as Gaussian embeddings. GeRec’s superiority over GER and MoG justifies that, our CNN-based evaluation of the correlations between the Gaussian samples of users and items is more effective than the operations in GER and MoG.

Model | MovieLens 1M | Amazon Music | ||||||
---|---|---|---|---|---|---|---|---|

HR@3 | nDCG@3 | HR@10 | nDCG@10 | HR@3 | nDCG@3 | HR@10 | nDCG@10 | |

MF-BPR | 0.3996 | 0.3085 | 0.6760 | 0.3978 | 0.1536 | 0.1198 | 0.2711 | 0.1448 |

NCF | 0.4739 | 0.3685 | 0.7288 | 0.4652 | 0.1777 | 0.1336 | 0.3358 | 0.1913 |

ConvNCF | 0.4772 | 0.3622 | 0.7290 | 0.4597 | 0.1758 | 0.1431 | 0.3370 | 0.1990 |

DeepCF | 0.4755 | 0.3823 | 0.7326 | 0.4680 | 0.1798 | 0.1396 | 0.3416 | 0.1952 |

NAIS | 0.4497 | 0.3618 | 0.7182 | 0.4418 | 0.1703 | 0.1317 | 0.2852 | 0.1721 |

GER | 0.4016 | 0.3171 | 0.7018 | 0.4264 | 0.1541 | 0.1258 | 0.2953 | 0.1489 |

MoG | 0.4586 | 0.3669 | 0.7245 | 0.4625 | 0.1716 | 0.1309 | 0.3196 | 0.1791 |

GeRec | 0.4841 | 0.3846 | 0.7474 | 0.4807 | 0.1863 | 0.1429 | 0.3464 | 0.2034 |

#### 3.2.3 Effectiveness on Capturing User Uncertainty

To answer RQ3, we evaluated our framework particularly against the users with uncertain preferences. At first, we introduce how to categorize such users. As stated in Sec. 1, we focus on two kinds of users with uncertainty in this paper. The first kind of such users are those with sparse observed user-item interactions, because very little information about their preferences can be obtained from their historical actions. The second kind of such users are those having many distinct preferences, because we can not identify which genre of preference is their most favorite one.

1st kind of uncertain users | 2nd kind of uncertain users | ||

variance | variance | ||

1.1 1.5 | 0.790 | 0.91 | 1.057 |

1.5 1.9 | 0.796 | 0.8 0.9 | 1.003 |

1.9 2.3 | 0.778 | 0.7 0.8 | 0.855 |

2.3 2.7 | 0.746 | 0.6 0.7 | 0.801 |

2.7 3.1 | 0.549 | 0.5 0.6 | 0.754 |

3.1 3.5 | 0.435 | 0.4 0.5 | 0.754 |

Inspired by [zhu2018deep], we identified these two kinds of uncertain users according to two metrics, respectively. Specifically, for the first kind of users, we filtered out six user groups according to a metric . The users of are those who have observed user-item interactions. Thus small indicates the users with more the first kind of uncertainty. For the second kind of users, we also filtered out six user groups according to metric . We compute for a given user as follows. For each pair of movies rated by , suppose and are the genre sets of and , respectively. Then, we set . Finally, we use average of all movie pairs as ’s . As a result, large indicates more preference diversity, i.e., the second kind of uncertainty. For space limitation, we only display the results of MovieLens users in Table 4. In the table, the displayed variances are the average Gaussian variances learned by our framework, showing that our proposed Gaussian embeddings assign larger variances to the users with more uncertainty. Thus, such distribution based embeddings represent the users with uncertainty well, resulting in better recommendation performance.

## 4 Related Work

MF-based models constitute one important family of recommendation models, such as latent factor model [MF] and non-negative matrix factorization [NFM]. Based on these traditional MF-based models, some improved versions had been proposed and proven more effective. For example, SVD++ [SVD++] improves SVD through taking into account the latent preferences of users besides explicit user-item interactions. MF-BPR optimizes standard MF with pairwise Bayesian Personalized Ranking [BPR] loss. Factorization Machine (FM) [FM] captures the interactions between user/item features to improve performance of model learning. All these models represent users and items by a vector containing the latent factors of users/items, of which the values are fixed once they are learned from user-item interactions, so are not adaptive to the users/items with uncertainty.

In recent years, many researchers have justified that traditional recommendation models including CF and MF-based models, can be improved by employing DNNs. For example, the authors in [AutoRec]

proposed a novel AutoEncoder (AE) framework for CF. DeepMF

[DMF] is a deep version of MF-based recommendation model. In addition, NCF model [NCF] integrates generalized matrix factorization model (GMF) and multiple-layer perceptron (MLP) to predict CF-based implicit feedback. DeepCF [DCF] also employs MLP to learn the complex matching function and low-rank relations between users and items, to enhance the performance of CF. In general, these deep models also represent users/items by embeddings which are used to feed the neural networks, and their embeddings also correspond to fixed points in embedding space without flexibility. Although the models in[DRM, NAIS] import attention mechanism to make user representations more flexible, such attention-based embeddings were proven not so good as Gaussian embeddings by our experiments.Gaussian embeddings are generally trained with ranking objective and energy functions, such as probability product kernel and KL-divergence. The authors in [vilnis2014word] first used a max-margin loss to learn word representations in the space of Gaussian distributions to model uncertainty. Similarly, [he2015learning] and [dos2016multilabel]

learn Gaussian embeddings for knowledge graphs and heterogeneous graphs, respectively;

[dos2017gaussian] uses Gaussian distributions to represent users and items in ranking-based recommendation. To improve graph embedding quality, [bojchevski2018deep] takes into account node attributes and employs a personalized ranking formulation, and [zhu2018deep] incorporates 2-Wasserstein distance and Wasserstein Auto-Encoders. All these methods employ ranking function and thus can not be applied to other recommendation tasks easily. Recently, [joon2019iclr] learns stochastic mappings of images with contrastive loss and also uses Gaussian embeddings.## 5 Conclusion

In this paper, we propose a unified recommendation framework in which each user or item is represented by a Gaussian embedding instead of a vector corresponding to a single fixed point in feature space. Moreover, convolutional operations are adopted to effectively evaluate the correlations between users and items, based on which precise recommendation results of both personalized ranking and rating prediction can be obtained. Our extensive experiments not only demonstrate our framework’s superiority over the state-of-the-art recommendation models, but also justify that our proposed Gaussian embeddings capture the preferences of the users with uncertainty very well.

Comments

There are no comments yet.