Variational Bandwidth Auto-encoder for Hybrid Recommender Systems

05/17/2021 ∙ by Yaochen Zhu, et al. ∙ IEEE 0

Hybrid recommendations have recently attracted a lot of attention where user features are utilized as auxiliary information to address the sparsity problem caused by insufficient user-item interactions. However, extracted user features generally contain rich multimodal information, and most of them are irrelevant to the recommendation purpose. Therefore, excessive reliance on these features will make the model overfit on noise and difficult to generalize. In this article, we propose a variational bandwidth auto-encoder (VBAE) for recommendations, aiming to address the sparsity and noise problems simultaneously. VBAE first encodes user collaborative and feature information into Gaussian latent variables via deep neural networks to capture non-linear user similarities. Moreover, by considering the fusion of collaborative and feature variables as a virtual communication channel from an information-theoretic perspective, we introduce a user-dependent channel to dynamically control the information allowed to be accessed from the feature embeddings. A quantum-inspired uncertainty measurement of the hidden rating embeddings is proposed accordingly to infer the channel bandwidth by disentangling the uncertainty information in the ratings from the semantic information. Through this mechanism, VBAE incorporates adequate auxiliary information from user features if collaborative information is insufficient, while avoiding excessive reliance on noisy user features to improve its generalization ability to new users. Extensive experiments conducted on three real-world datasets demonstrate the effectiveness of the proposed method. Codes and datasets are released at



There are no comments yet.


page 1

page 3

page 4

page 5

page 6

page 7

page 8

page 9

Code Repositories


Collaborative variational bandwidth auto-encoder (VBAE) for recommender systems.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Corresponding author: Zhenzhong Chen, E-mail:

In the era of information overload, people have been inundated by large amounts of online content, and it becomes increasingly difficult for them to discover interesting information. Consequently, recommender systems play a pivotal role in modern web applications due to their ability to help users discover items that they may be interested in from a large collection of candidates. Based on how recommendations are made, existing recommender systems can be categorized into three classes [55]: collaborative-based methods, content-based methods, and hybrid methods. Collaborative-based methods [45, 13] predict user preferences by exploiting their past activities, such as clicks or ratings, where the recommendation quality relies heavily on peers with similar behavior patterns. Content-based methods [41, 50], on the other hand, make recommendations based on users or items that share similar features. Hybrid methods [28, 51, 53] combine the advantages of both worlds where the collaborative information and user/item features are comprehensively considered to generate more precise recommendations.

Recent years have witnessed an upsurge of interest in employing auto-encoders [37] to both collaborative and content-based recommender systems, where compact representations of sparse ratings [27, 25] or high-dimensional user/item features [46, 59]

can be learned to more effectively exploit the similarity patterns between users or items for recommendation. As a Bayesian version of auto-encoder where the encoded latent representations are modeled as random variables, VAE has demonstrated superiority compared to other forms of auto-encoders, such as the contractive auto-encoder

[56] and the denoising auto-encoder [49]. Among them, Collaborative variational auto-encoder (CVAE) [26] first used a VAE to infer latent item content embeddings from tf-idf textual features, and then iteratively fine-tune the embeddings with rating information via matrix factorization. Multi-VAE [27], in contrast, used the VAE in the collaborative setting to learn compact user embeddings from discrete user ratings. Recently, Macrid-VAE [29, 30] further extended Multi-VAE, where learned user representations were constrained to disentangle at both the macro and the micro -levels to improve the robustness and interpretability of learned embeddings.

However, it is difficult to generalize the VAE to a hybrid recommender system, due to various challenges from both the collaborative and content-based components (Fig. 1

). As users tend to vary in their activity levels and tastes, user embeddings learned through collaborative filtering bear different degrees of uncertainty, which hinders good recommendations for users with unreliable collaborative embeddings (

e.g. user #4 and #5 in Fig. 1). The uncertainty mainly comes from three aspects: (1) Sparsity: First, for a user with sparser interactions (user #4), her associated embedding is more unreliable due to the information insufficiency in her historical interactions, which makes the similarity measure induced by the embedding space less informative compared to users with denser interactions. (2) Diversity

: Even if a user has denser interactions, we cannot safely conclude that we can estimate her preferences with more confidence, because her ratings may focus on a few types of items, which makes the collaborative information conveyed by different items correlated. However, if she rated items of more diverse types, we can have more confidence in the estimate of her preferences based on the rating information because the item space is more thoroughly explored. (3)

Overlapping: In addition, the uncertainty of a user embedding may also be large if the items the user has interacted with are seldomly visited by other users (user #5). Considering two users who click the same number of items, the items that the former user click are also clicked by many other users, while the latter user only clicks items that no other users have yet clicked, the embedding uncertainty of the latter user would be larger than that of the former user.

Figure 1: Challenges associated with collaborative and feature components of a hybrid recommendation system: (left) users whose ratings are sparse or who rate only rarely-rated items have large uncertainty in their collaborative embeddings; (right) user/item features contain large amounts of information that are irrelevant for recommendations.

Although the user/item features could be exploited to reduce the uncertainty incurred by the sparsity of ratings, the main obstacle for utilizing them lies in the heavy noise that may outweigh the useful information. Here, by noise, we mean any pattern that is irrelevant for the recommendation purpose, which should be distinguished from the low-level noises such as image blur, audio aliasing, and textual typos. Considering, for example, recommending academic articles to researchers, the related work and the empirical study sections are less informative compared to the abstract of methodology, and generally they should be regarded as noise in recommendations, although they both contain valuable information once a researcher gets attracted by the abstract and decides to delve deeper into the article. Moreover, since different users may consider varied aspects when rating an item, such noise exhibits a personalized characteristic that makes it difficult to eliminate [15, 54]. A similar analysis could be made for the user features, since the collection of certain user attributes, such as location, etc., may raise up privacy issues, and a widely adopted surrogate strategy is to empirically combine the features of the items that the users have interacted with to build up the user profiles for recommendations. In fact, the consensus in the community is that collaborative filtering is more reliable than feature-based methods for large-scale web recommendations if user interactions are sufficient to leverage [39]. Therefore, a good hybrid recommender system should manage to avoid unnecessary reliance on the noisy user/item features depending upon the sufficiency level of the collaborative information in the ratings (e.g. user #1 #2, #3 in Fig. 1), such that the noise in these features would not outweigh the useful information and hurt the model generalization ability.

To address the above challenges, we propose an information-driven Bayesian generative model called variational bandwidth auto-encoder (VBAE) for hybrid recommendations. The model first jointly learns the generation of user features and ratings from latent collaborative and user-feature embeddings. These embeddings are modeled as Gaussian random variables and inferred via deep neural networks through the auto-encoding variational Bayes (AEVB) algorithm [21]. Furthermore, observing that the extracted user features could be extremely noisy, we start by considering the fusion of the collaborative and feature embeddings from an information-theoretic perspective, i.e., as a virtual communication channel. We then introduce a novel user-dependent channel that dynamically controls the amount of information that is allowed to be accessed from the user features based on the collaborative information already contained in the user ratings. A quantum-inspired uncertainty measurement of the hidden rating embeddings is proposed accordingly to infer the channel bandwidth by disentangling the uncertainty information in the ratings from the semantic information. Through this mechanism, sufficient auxiliary information can be accessed from user features if collaborative information is inadequate, while unnecessary dependence of the model on noisy user features can be avoided otherwise to make it generalize better to the user features. The main contribution of this article is summarized as follows:

  • We present VBAE, a unified information-driven recommendation framework where the generation of user ratings and features is parameterized via deep Bayesian networks and their fusion is modeled as a personalized virtual communication channel, such that the rating sparsity and feature noise problems can be simultaneously addressed.

  • A novel quantum-inspired uncertainty measurement of the hidden rating embedding is proposed to infer the bandwidth of the user-dependent channel, which enhances the model generalization ability by dynamically controlling the information allowed to be accessed from user features based on the sufficiency level of collaborative information.

  • Two kinds of channel implementations with different desired properties, i.e.,

    Bernoulli and Beta channels, are thoroughly discussed, with the corresponding optimization objectives derived with distribution approximation and variational inference to make them amenable to stochastic gradient descent.

  • The proposed VBAE empirically out-performs state-of-the-art hybrid recommendation baselines. We also discover that the inferred bandwidth of the channel variable can well distinguish users with different sufficiency levels of collaborative information.

2 Related Work

As a special kind of deep neural network, auto-encoders aim to learn compact representations of inputs by reconstruction [2]

. Since both user ratings and features are high-dimensional sparse vectors that make the direct manipulation of them in the original space difficult, much effort has been dedicated to improve representation learning strategies with auto-encoders by researchers recently

[37]. Generally, auto-encoder-based recommenders can be categorized into two main classes: user-oriented auto-encoders (UAEs) [23, 49, 27, 36] and item-oriented auto-encoders (IAEs) [37, 57], based on whether the auto-encoder is used to tackle the user or the item side information.

2.1 Item-oriented Auto-encoders

The advent of IAE predates that of UAE, where item content auto-encoders are built on top of matrix factorization (MF) -based collaborative backbones, such as weighted matrix factorization [32], to incorporate auxiliary content information into the factorized item collaborative embeddings. Two exemplar methods from this category are CDL [47] and CVAE [26], where an item offset variable is introduced to tightly couple the Bayesian stacked denoising auto-encoder (SDAE) [42] or variational auto-encoder (VAE) [21] with MF to enhance its performance. The MF and item content auto-encoder are trained in an iterative manner. Recently, auto-encoders have been exploited to capture the item collaborative information. Among them, DICER [57] was proposed to capture non-linear item similarity based on their user ratings, and from which disentangles the content information for a better generalization. Since in collaborative IAEs, the input dimension and the number of trainable weights is proportional to the number of users, and the number of training samples equals the number of items, these methods generally require a large items-to-user ratio such that a good representation of items can be learned for satisfactory recommendations.

2.2 User-oriented Auto-encoders

Compared with IAEs, UAEs have attracted more attention among researchers because they break the long-standing bottleneck of the linear collaborative modeling limitation of MF and allow modeling users in a deeper manner [23, 49, 58]. Instead of factorizing the rating matrix into user and item embeddings, UAE-based recommenders take the historical ratings of users as the inputs, embed them into hidden user representations with a deep encoder network, and from which reconstruct the ratings with a deep decoder network. The reconstructed ratings for unrated items are then ranked for recommendations. Since UAE-based recommenders eliminate the need of modeling item latent variables and reconstruct the whole ratings directly from the user latent embedding, another advantage of UAEs over MFs is that they are efficient to fold in new users for whom historical ratings have been recorded, since recommendations can be made with a single forward propagation. The first UAE-based recommender system is the collaborative denoise auto-encoder (CDAE) [49], where the input ratings are randomly masked with zeros to simulate the rating missing process. Afterwards, Multi-VAE [27] was proposed where a VAE with multinomial likelihood on ratings is used instead of the DAE, which demonstrates clear advantages. However, one key problem for these collaborative UAEs is that if the ratings of certain users are sparse, the recommendation performance could be severely degenerated due to the lack of collaborative information.

2.3 Hybrid Recommendation Techniques

Due to the wide availability of user features and various methods that build user profiles with item features, hybriding UAE with auxiliary user features to address the sparsity problem has become a new trend in the recommendation community. A simple but effective method to incorporate user feature information into UAE is to adopt the early fusion strategy similar to [22], where the user features are concatenated with ratings as the input for the UAE where only the ratings are reconstructed for recommendations. In this way, the first dense layer of the UAE can be viewed as calculating the weighted combination of user rating and features, which may be over-parameterized and is susceptible to overfitting. A more sophisticated approach is the conditional VAE (CondVAE) [33], where the user features are exploited to calculate the conditional prior of user latent variables, which are then updated into posterior by the collaborative information, i.e., ratings, for recommendations. However, all these methods treat the relative importance of collaborative and content information as fixed for all users, ignoring the individual differences both in the reliability of their extracted features and in the sufficiency level of historical rating information [54]. This is problematic, since generally, user features contain much irrelevant information and noise, and a good recommender system should avoid unnecessary dependence on these features when collaborative information is sufficient to improve generalization. Therefore, it motivates us to design the information-driven variational bandwidth auto-encoder with a user-dependent channel to fuse the user feature and collaborative information.


3 Methodology


Figure 2: (Left): the probabilistic graphical model (PGM) of the proposed VBAE. (Right): the zoomed-in view of the collaborative and user feature networks. User has sparser ratings than user , which leads to a larger uncertainty of her collaborative embedding, and VBAE infers a larger bandwidth to allow more information of to flow into . For user i’, on the other hand, a smaller bandwidth is inferred to avoid overfitting on noise in user features.

3.1 Problem Formulation

The focus of this article is on recommendations with implicit feedback [16]. We define the rating matrix as , where each row is the bag-of-words vector denoting the whether user visit each of the items. is obtained by keeping track of user activities for a certain amount of time. In addition, the users profiles are represented by matrix , where the row vector is the extracted feature for the th user. can contain user inherent attributes such as her age, location, and self-description, etc., or be built from the feature of items that the user has interacted with when such information is not available. The capital non-boldface letters and are used to denote the corresponding random variables, respectively222The subscript will be omitted for simplicity if no ambiguity exists.. Note that the density of , which is defined as , could vary dramatically for different , and is generally high-dimensional and noisy. Given the partial observation of the records in and the user features , the problem is to make predictions of the remaining ratings in so as to recommend new relevant items to users.

3.2 Model Overview


The PGM in the left part of Fig. 2 shows the overall generative and inference process of VBAE. The details of VBAE are discussed in the following sections.

3.3 Generative Process

3.3.1 User Embeddings

In this article, the collaborative and feature embedding of users are assumed to lie in -dimensional latent spaces. Therefore, VBAE starts by generating for each user the collaborative-based latent variable from a normal prior:

Since user features are extracted as the auxiliary information to complement the collaborative information, they are embedded in a user feature latent variable

, which is drawn from another Gaussian distribution as follows:

3.3.2 Channel with User-dependent Bandwidth

Traditional generative models for hybrid recommendation [47, 26, 8] directly add and via an offset variable to form the latent user embedding , or learn their fixed relative weights by direct concatenation, and do not consider that the uncertainty of usually varies for different users due to the both explicit (difference in the sparsity, diversity of ratings) and implicit (difference in the number of overlapped ratings) factors, nor that contains lots of irrelevant information that may distract the recommendation model. This could be problematic, since for users who have denser, more diverse ratings and rate items that are also rated by many other users, their associated collaborative embeddings are more reliable, and can afford to access less information from such that unnecessary dependence of the model on noisy user features can be avoided. For users with sparser ratings, however, has to access more information from to enable a more personalized recommendation even if the user features are noisy.

From an information-theoretic perspective, if we view the fusion of and into as a virtual communication channel, the issue sources from the assumption that the channel is deterministic and independent of the information already contained in the collaborative embedding , where the individual difference in sufficiency level of collaborative information is ignored. Therefore, to address such a problem, we design a user-dependent channel in VBAE by introducing a latent capacity variable that determines the bandwidth of the channel from to . The bandwidth variable encodes for each user our belief towards how much extra information is required from the user features given the information already contained in the observed interactions. Through this mechanism, the channel dynamically allocates the amount of information that is allowed to flow from to conditional on . Two strategies are explored to implement the user-dependent channel with bandwidth . The first strategy we consider is called the "hard" channel, where the bandwidth is achieved when losslessly accesses

with probability

, and accesses no information otherwise [10]. In the generative case, we can introduce an auxiliary channel variable , and draw

from a Bernoulli distribution with probability


Although the hard channel complies more strictly with the definition of bandwidth in information theory, it may results in training instability because the bandwidth only appears as a statistical property, i.e., an expectation when the feature and collaborative embeddings of a user are repeatedly fused for multiple times. However, for one user, the user latent embedding either access the user feature information or not in one recommendation, which is coarse in granularity to distinguish user with different uncertainty levels of collaborative information. Therefore, we consider a second strategy, the "soft" channel, which is a relaxed version of the hard channel and resembles more to the variational attention approach [7] than the variational information bandwidth theory [10]. This strategy assumes that the channel variable

is drawn from a Beta distribution and the bandwidth

determines the mean of the Beta:

where the channel curtails or amplifies the weights of user feature information based on the bandwidth . Given only , however, the Beta distribution for the channel

is undetermined, as its variance remains to be specified to calculate both

and . However, since we care primarily about the bandwidth itself than its uncertainty, we fix the variance of the Beta, which is then treated as a nuisance parameter, to a small value for simplicity. Hereafter, we use the mean-variance parameterization of Beta distribution unless otherwise specified, since it explicitly contains the bandwidth as its first parameter. We use VBAE-hard and VBAE-soft to distinguish the two channel implementation strategies. The detailed comparisons between the soft and hard channels are summarized in Fig. 3. After drawing , the user latent variable is deterministically calculated as

which defines the fusion process of into via the user-dependent channel of VBAE-hard and VBAE-soft.

Figure 3: Comparison between the soft and hard channels.

3.3.3 Neural Network Implementations

To model the non-linear generation process of user features and ratings from the corresponding latent variables, we parameterize the generative distributions as deep neural networks. The user feature is generated from the user-feature latent variable

via a multilayer perceptron (MLP)

: If is binary, we squashed the output of

by the sigmoid function, and draw

from , or if is real-value, we take the raw outputs of as the mean of a Gaussian distribution and draw from . Finally, we put a multinomial likelihood on the ratings as [27], and generate from the latent user variable via parameterized as , where is another MLP-based generative neural network. The generation process of , from , is given as follows:

(1)  For each layer of the collaborative and the user-feature modules of the generation network:

(a)  For each column of the weight matrices, draw

(b)  Draw the bias vector from


(c)  For of a user , draw

(2) For user features that are binary, draw

For user-item interactions, draw

where , ,

is a hyperparameter,

is the intermediate activation function and

is the Dirac Delta function. Step 1.c can be alternatively viewed as putting Gaussian priors on the intermediate activations and setting the precision to infinity.

The generative model of VBAE is described by the joint distribution of all observed and hidden variables:


where the symbol denotes the set of trainable parameters that pertain to the generation network.

3.4 Inferential Process

Given Eq. (1), however, it is intractable to calculate the posterior exactly, as the non-linearity of generative process precludes us from integrating over the latent space and calculating the marginal distribution of the observed evidence . Therefore, we resort to the amortized variational inference [3], where we introduce a variational posterior

parameterized by an inference neural network as an approximation to the true but intractable posterior. Using the conditional independence assumptions implied by VBAE, the joint variational posterior can be decomposed into the compact product of two factors that make up two modules of the inference network: the collaborative module

and the user feature module , which infer the user collaborative embeddings, the bandwidth of the user-dependent channel, and user feature embeddings from the user ratings and features, respectively.

3.4.1 Quantum-inspired Semantic and Uncertainty Measurement of Collaborative Information

The collaborative module infers the collaborative-based latent user variable and the personalized latent channel variable from the observed user-item interactions. Since the bandwidth of the channel denotes the sufficiency level of the collaborative information, an important role that this module should play is to disentangle uncertainty and semantic information from the user ratings. To achieve this objective, we first use an MLP to embed the raw, sparse rating vector into a compact hidden representation

for each user. Inspired by [24], we then use the length (L2-norm) of the hidden representation as the uncertainty measurement to calculate the channel bandwidth and the direction (L2-normalized hidden embedding) as the representation of semantic information to infer the latent collaborative embedding. The details of the introduced semantic and uncertainty measurement of the hidden rating embeddings is illustrated in Fig. 4, which draws inspirations from theoretic quantum mechanics. To see the links, we first liken the hidden rating embedding to a quantum superposition state of a physical system, which is represented by a complex vector in . Then, the superposition vector has the property that the norm of the vector is positively correlated with the probability that this superposition is observed when measuring the system (which depicts the fundamental uncertainty of quantum physics), and the direction of the vector distinguishes this superposition state with other states (which carries the state semantic information).

Figure 4: Illustration of the proposed quantum-inspired semantic and uncertainty measurement of collaborative information from user interactions.

Such an uncertainty-semantic interpretation of the length and direction of a vector also works for hidden rating embedding in recommendation, where the norm and direction of can be associated with similar meanings. To gain the intuition, we first decompose the calculation of the hidden rating embedding from the raw rating vector with a dense layer into two basic operations, embedding and element-wise sum, as follows:


If we assume that each element in , i.e., is independent and identically distributed (i.i.d.) Gaussian variable with zero mean and a small variance , the th element in , which is denoted as , is the sum of independent Gaussian variables. Therefore, is also a Gaussian variable that follows , where is the number of items this user has interacted with, i.e., the density of ratings. According to the basic probability and statistics theory, the squared L2-norm of , which we denote as , is the sum of squares of Gaussian variables with zero mean and

variance, which follows the scaled Chi-square distribution

and is equivalent to Gamma distribution with cumulative distribution function

[19]. The expected value of , according to the property of Gamma distribution, is , which is a monotonic increasing function of the number of interacted item , i.e., the rating density, a main indicator for the sufficiency level of collaborative information.

3.4.2 Bandwidth and User Collaborative Embedding

This property reveals that is positively correlated with the sufficiency level of the collaborative information. Moreover, compared to the L2-norm of the sparse rating vector , the hidden embedding lies in a latent low dimensional space where user collaborative representations are more compact, which could better reflect the similarity relationship of user rating patterns by eliminating redundant information contained in similar items. Therefore, the L2-norm of contains important information regarding interaction sparsity and interaction similarity that are suitable for the inference of the bandwidth that we have defined in the previous section. In addition, the L2-normalization of the embeddings eliminates the negative influence of the difference in the number of items the users have interacted with by scaling the hidden rating embeddings of users with different activity levels into the same sphere space, which makes the direction of , more suitable representation than the original to infer the user collaborative embedding. This is in contrast with the previous auto-encoder-based approaches such as Mult-VAE [27], CondVAE [33], where the information regarding the sparsity level of the ratings is discarded after the L2-normalization of the input ratings.

In practice, we calculate the bandwidth by a linear transformation of

transformed with sigmoid activation function to squash the value between [0, 1]:


Since is strictly non negative, in backward propagation the weight can only be updated in one direction for all the samples in one mini-batch. This is problematic, since the training loop would make

converge to zero or one where the bandwidth for all users are identical, which fails to discriminate users with different sufficiency level of collaborative information. This is referred to as the modality collapsing problem in variational inference. In this article, this problem is addressed through the batch normalization

[17]. We denote a mini-batch of the length of user hidden embedding as . Before feeding each sample in the mini-batch into Eq. (3) to calculate the bandwidth, we renormalize it by:


where and are the mean and sample variance of in the mini-batch, and is a small value to avoid division by zero error. By batch normalization, we observe that the inferred bandwidth corresponds more consistently with the sufficiency level of the user collaborative information, and the training procedure becomes more stable as well. In the testing phase, and are fixed to their running averages estimated from the training data.

3.4.3 Neural Network Implementations

The user feature module infers the feature-based latent user embedding from the extracted user features by another MLP, which serves as the auxiliary information source to the collaborative information. The detailed inference process of the latent variables , and through the inference network is described as follows:

(1)  For each layer of the collaborative and user-feature module of the inference network:

(a)  For each column of the weight matrices, draw

(b)  Draw the bias vector from ;

(c)  For of a user , draw

(2)  For the user-dependent channel variable:

(a) Calculate the bandwidth from its logits, which is inferred by the L2-norm of the hidden rating embedding

as follows:

(b) For VBAE-hard, draw the Bernoulli channel variable:

(c) For VBAE-soft, draw the Beta channel variable:

(3)  For the collaborative and user feature latent variable:

(a) Draw the mean and standard deviation:

(b) Draw the sample of the latent variable:

It is not trivial to draw samples from the Bernoulli or the Beta channel variable such that gradients can be back-propagated to the trainable weights of the inference network, and we defer the discussion of the solution, which is called the reparameterization trick, to Section 3.7. In VBAE, is inferred from the observed ratings via a non-linear neural network, where the explicit uncertainty caused by sparsity can be directly captured from the input

. In addition, since the collaborative embedding variables are inferred in an amortized manner, the implicit uncertainty can be captured by the weights, which are actively learned and shared among all users. The user latent variable

simultaneously captures the user collaborative similarity based on user interactions and the user feature similarity from extracted user features. The neural network implementation of VBAE is schematically illustrated in Fig. 2.

3.5 Training Objective

To jointly learn the parameters of the generative network and the inference network, we maximize the evidence lower bound (ELBO), which is an approximation of the marginal log-likelihood of evidence :


and the value of , for a fixed , achieves the maximum if and only if the discrepancy between the variational approximation and the true posterior measured by the Kullback-Leiber (KL) divergence is zero (i.e., iif. ).

3.6 Maximum A Posteriori Estimation

Although the collaborative and user feature module of VBAE can be jointly trained as Eq. (5), extra computational and memory consumption is inevitable. In addition, in a joint training, the VBAE model may converge to an undesirable sub-optima where it relies solely on one main information source for the recommendation, which is undesirable since the user ratings and features contain complementary information, both of which are important for recommendations. Therefore, we take an EM-like optimization approach, where we iteratively consider only one of the variational distributions in and and fix the random variables (e.g., to their means or their to previous estimates) that concern the other. Consequently, for the collaborative part , we fix to the estimated mean to calculate , and the objective becomes:


Eq. (6) can be viewed alternatively as paying a cost equals to the KL with prior value whenever the model tries to access the noisy user feature information. Moreover, the cost is dynamically decided by the urgency level of introducing the extra user feature information based on the sufficiency level of the collaborative information, which prevents the model from depending unnecessary upon the noisy features. After one-step optimization of , we then fix and to their estimated values to calculate , and maximize the following objective for the user feature part:


Intuitively, for both _step and _step, the objective consists of two parts. The first part is the expected log-likelihood term, where the inferred hidden Gaussian embeddings and the latent channel variable are encouraged to best explain the extracted user features and the observed historical interactions; the second part is the KL with prior terms and the L2 weight decay terms, which act as regularizers to prevent over-fitting and avoid polarization of embeddings due to the sparsity of interactions [34]. Liang et al. [27] have shown that the KL regularization in the collaborative part could be too strong, which over-constrains the representational ability of latent collaborative embeddings. As a solution, they introduced a scalar to control the weight of the KL term for the latent collaborative variable in Eq. (6), which has its theoretical foundation in both beta-VAE [14] and variational information bottleneck theory [1]. We anneal the from 0 to 0.2 as [27] in our implementation of VBAE, and have confirmed the effectiveness of KL annealing in our experiments. Under such settings, the model learns to encode as much information of in as it can in the initial training stages while gradually regularizing by forcing it close to the prior as the training proceeds [4].

3.7 Monte Carlo Gradient Estimator

In this section, we derive the gradients of and w.r.t. the trainable parameters of the generation and inference networks to make them amenable to SGD optimization. For Gaussian and Bernoulli distributions, since their KL divergence with prior can be computed analytically, the minimization of the KL terms in Eqs. (6), (7) w.r.t. the weights of the inference network can be analytically calculated. However, since the gradients of the expected log-likelihood terms, which we note as , need to be back-propagated through stochastic nodes , and

, it precludes us from calculating an analytic solution. Hence, we introduce Monte Carlo methods to form unbiased estimators of the gradients. For

, as the generative distribution is explicit in expectation form, its gradient can be estimated by generating samples from the encoder distribution, calculating the gradients, and taking the average [35].

3.7.1 User Embeddings: Vanilla Reparameterization

As for that is associated with the inference of user feature and collaborative embeddings, we use the reparameterization trick, where we transform the stochastic nodes into differentiable bivariate functions of their parameters and random noises to allow gradients to pass through the distribution parameters [9]. Specifically, for the Gaussian embedding variables , with the vanilla reparameterization trick [21], their samples can be reformulated via:


where . Eq. (8) could be viewed alternatively as injecting Gaussian noise to the hidden user collaborative and user feature variables, which is the main mechanism that previous auto-encoder-based recommender systems adopt to address the rating and feature noise problems.

3.7.2 Hard Channel: Gumbel-Softmax Reparameterization

Injecting Gaussian noise to user latent collaborative and feature embeddings can only simulate the generation process of low-level noise. However, as we have argued in the Introduction, pervasive high-level noise, which is information that are irrelevant to the recommendation purpose, exists in the extracted user features. Therefore, we introduce the user-dependent channel to avoid excessive model reliance on these features for a better generalization. For the channel in VBAE-hard that follows the Bernoulli distribution, we note that sampling from which is equivalent to sampling a one-hot vector from a two-class Categorical distribution with probability mass and discarding the second dimension. Therefore, we resort to the Gumbel-softmax trick [18] and reformulate samples of the channel via the Concrete distribution [31]:


where and is the temperature of the softmax and the sigmoid. When approaches zero, the samples are proved to be equivalent to samples drawn from the corresponding Bernoulli distribution with the probability . In practice, is generally annealed as the training proceeds for a more stable convergence.

3.7.3 Soft Channel: Logistic-Normal Reparameterization

Similarly, we draw the Beta channel variable of VBAE-soft by keeping the first dimension of a sample from the corresponding two-class Dirichlet distribution. However, unlike Gaussian and Categorical distributions, there is no consensus regarding how to reparameterize a Dirichlet variable [20, 40, 7]

. In this article, we eschew the commonly used reparameterization strategies that transform a uniformly distributed vector by the inverse of the Dirichlet cumulative distribution function as the bivariate transformation, but we derive its reparameterization with logistic-normal approximation instead


. The reason is that, the logistic-normal distribution converts the original parameters

of the Dirichlet (the values that should be predicted by the inference network) to the mean and standard deviation of a Gaussian distribution, such that the convergence to a low-entropy area is smoother. Otherwise, to reach a low variance area of the Dirichlet requires large values of , which is difficult to learn by the inference network and results in unstable training dynamics [7]. The relationship between the parameters of logistic-normal and the corresponding Dirichlet is formulated as follows


Note that we fix the logistic-normal to a small value, as here only the mean of the Beta channel variable (i.e., the bandwidth) is important, and a small value of the variance prevents the Dirichlet distribution from stuck into a low-entropy area. This is also a common trick used in most regression task, where the outputs are assumed to be Gaussian, for which the variance is trivial to model. The sample from the logistic-normal is drawn according to


where . A close look at Eq. (11) shows that it bears great similarity to Eq. (9), since we can view as pseudo logits of the bandwidth, and both Eqs. add and subtract two i.i.d. random variables to the logits of bandwidth before squashing it into (0, 1) with the sigmoid function. The major difference between these two equations is that for Eq. (9), a small temperature of the sigmoid pushes the value of to 0 or 1, i.e., the two extremities, such that for each user the channel is either open or closed in one iteration. However, in Eq. (9), the temperature is set to one and therefore can take any value between [0, 1], which avoids swerve of gradient direction in training and smooths the convergence.

3.7.4 Unbiased Gradient Estimators

With the stochastic user latent embedding variable and the channel variable reparameterized with the strategies we introduce above, the unbiased gradient estimator of the objective w.r.t. can be formulated as:


where is used to denote that the RHS. is an unbiased estimator of the LHS. Since the variance of the gradient estimated by reparameterization trick is low, previous work has demonstrated that as long as the batch size is large enough, it suffices to take a single sample for each user for the training to converge [9].

3.8 Prediction for New Users

After the weights of the generative and inference networks of the VBAE model are learned, our discussion shifts towards how to predict new relevant items for users given their observed ratings and noisy features . For a user, we first calculate the mean of the collaborative embedding , the bandwidth from the ratings via the collaborative inference network, and the mean of the feature embedding from the user features via the feature inference network. The user latent variable can then be approximated as:


To avoid randomness of the channel in testing, for VBAE-hard, we set to a fixed sample from to determine whether or not information in user feature embeddings are necessary to be introduced to support the recommendation, whereas for VBAE-soft, we use the mean of the Beta channel variable (approximated by logistic-normal), i.e., , as . Finally, we calculate the multinomial probabilities of the remaining items from via as:


where the is due to the non-linearity of , and the estimated logits of probabilities of unobserved items are sorted to get final ranked list of items for recommendation.

4 Empirical Study


Figure 5: The distribution of user rating density for citeulike-a, citeulike-t, and Amazon toys & games

datasets. The red curves illustrate the estimated probability density functions and the light blue dashed vertical lines shows the percentiles. (The maximum interaction value is cut off to 100, 80, 60 for the datasets for a better illustration effect)

citeulike-a citeulike-t toys & games
# users 5,551 5,031 14,706
# items 16,980 21,133 11,722
% interactions 0.217% 0.114% 0.072%
max/min #visits 403/10 1,932/5 546/5
avgstd #visits
# features 8,000 20,000 8,000
Table 1: Attributes of citeulike-a, citeulike-t and Amazon toys & games datasets after preprocessing. In the table, % interactions refers to the density of the rating matrix, max/min/avg/std #visits refer to the corresponding statistics of the number of item that the users provide implicit feedback.

In this section, we present and analyze the extensive experiments we conducted on three real-world datasets to demonstrate the effectiveness of the proposed VBAE model for hybrid recommender systems.

4.1 Dataset

We use three real-world datasets to evaluate the model performance. Two of the datasets, citeulike-a [43] and citeulike-t [44] are from CiteULike, where scholars can add academic articles they are interested in to their libraries such that new relevant articles can be automatically recommended. The third dataset, toys & games, is collected by [12] from Amazon333 In preprocessing, we randomly split the users by the ratio of 8:1:1 for training, validation, testing. For each user, 80% of the interactions are selected as the observed interactions to learn the user collaborative embedding and the bandwidth of the channel, and the remaining 20% are hold-out for testing. The user profiles are built from the features of their interacted items. We represent each article in the citeulike datasets by the concatenation of its title and abstract, and each item in toys & games by combining all of its reviews. We then select discriminative words according to the tf-idf values and normalize the word counts of each item over the maximum occurrences of each word in all items. Finally, we calculate the element-wise maximum of the normalized word counts of the observed items for each user as the user features. Table 1 summarizes the details of the datasets after preprocessing. Fig. 5 illustrates the distributions of interaction density for different users. From Fig. 5 we can find that the interaction density distribution clearly demonstrates a long-tail characteristic, which reflects the uneven distribution of sufficiency level of collaborative information for users in all three datasets.

4.2 Evaluation Metrics

Two ranking-based metrics are used to evaluate the recommendation performance: Recall@ and the truncated normalized discounted cumulative gain (NDCG@). We do not use the precision metric, since the rating matrices in all three datasets record implicit feedbacks where a zero entry does not necessarily imply that the user is not interested in the item, but it may be due to the fact that the user is not aware of its existence as well [16]. For a user , we first obtain the rank of the held-out items by sorting their multinomial probabilities calculated as Eq. (14). If we denote the item at rank by and the set of hold-out items for the user by , Recall@ is calculated as:


where in the numerator is the indicator function, and the denominator is the minimum of and the number of hold-out items. Recall@ has a maximum of 1, which is achieved when all relevant items are ranked among the top positions. Truncated discounted cumulative gain (DCG@) is computed as


which, instead of uniformly weighting all positions, introduces a logarithm discount function over the ranks where larger weights are applied to recommended items that appear at higher ranks [48]. NDCG@ is calculated by normalizing the DCG@ to [0, 1] by the ideal DCG@ where all relevant items are ranked at the top.

4.3 Implementation Details

Since the datasets we consider vary both in their scale and scope, we select the structure and the hyperparameters of VBAE based on evaluation metrics on validation users through grid search

444Due to space limit, please refer to the JSON files we release with the codes for the searched optimal hyperparameters and model architecture for each of the three datasets.. In VBAE, sigmoid is used as both intermediate and output activations. The weights of the inference network are tied to the generation network the same way as [26] to more effectively learn representations of user features. Specifically, to avoid the component collapsing problem where the inferred bandwidth for all users are identical, batch normalization [17] is applied to the L2-norm of the latent feature representations such that they have zero mean and unit variance before the inference of the bandwidth; in addition, a larger decay rate is applied to the weights of the dense layer for bandwidth inference for regularization. We first layerwise pretrain the user feature network as the initial starting point for VBAE, and then iteratively train the collaborative network (_step) and the user feature network (

_step) for 100 epochs. Adam is used as the optimizer with a batch size of 500 users. We randomly split the datasets into ten train/val/test splits as described in Section

4.1. For each split, we keep the model with the best NDCG@100 on the validation users and report the mean metrics on the test users for all splits.

4.4 Comparisons with Baselines

In this section, we compare the proposed VBAE with the following state-of-the-arts collaborative and hybrid recommendation baselines to demonstrate its effectiveness:

Recall@20 Recall@40 NDCG@100
VBAE-soft 0.299 0.376 0.296
VBAE-hard 0.293 0.373 0.294
FM 0.231 0.312 0.238
CTR 0.169 0.250 0.190
CDL 0.209 0.295 0.226
CVAE 0.236 0.334 0.247
Multi-VAE 0.261 0.346 0.265
RecVAE 0.265 0.354 0.269
CoVAE 0.247 0.338 0.260
CondVAE 0.274 0.359 0.275
DICER 0.279 0.363 0.272
(a) citeulike-a
Recall@20 Recall@40 NDCG@100
VBAE-soft 0.227 0.306 0.190
VBAE-hard 0.223 0.308 0.193
FM 0.154 0.224 0.135
CTR 0.132 0.189 0.118
CDL 0.200 0.271 0.168
CVAE 0.216 0.294 0.181
Multi-VAE 0.168 0.247 0.139
RecVAE 0.177 0.251 0.148
CoVAE 0.194 0.257 0.167
CondVAE 0.215 0.279 0.172
DICER 0.203 0.283 0.161
(b) citeulike-t
Recall@20 Recall@40 NDCG@100
VBAE-soft 0.145 0.196 0.107
VBAE-hard 0.144 0.193 0.105
FM 0.088 0.121 0.062
CTR 0.124 0.179 0.089
CDL 0.133 0.181 0.092
CVAE 0.139 0.188 0.094
Multi-VAE 0.114 0.157 0.082
RecVAE 0.110 0.154 0.077
CoVAE 0.120 0.174 0.085
CondVAE 0.132 0.180 0.094
DICER 0.127 0.172 0.092
(c) toys & games
Table 2: Comparisons between VBAE and various baselines. We report the metrics averaged on ten splits of the datasets. The best method is highlighted in bold, where the best method in each part is marked with underlines.
  • FM (Factorization Machine) is a widely employed algorithm for hybrid recommendation with sparse inputs [11]. We use Bayesian parameter search as suggested in [6]

    to find optimal hyperparameters and the loss function on the validation users.

  • CTR [43] learns the topics of item content via latent Dirichlet allocation (LDA) and couples it with probabilistic matrix factorization (PMF) for collaborative filtering. We find the optimal hyperparameters , , , and latent dimension through grid search.

  • CDL [47] replaces the LDA in CTR with a stacked Bayesian denoising auto-encoder (SDAE) [42] to learn the item content embeddings in an end-to-end manner. We set the mask rate of SDAE to 0.3 and search its architecture the same way as VBAE.


  • CVAE [26] further improves over the CDL by utilizing a VAE in place of the Bayesian SDAE, where a self-adaptive Gaussian noise is introduced to corrupt the latent item embeddings instead of corrupting the input features with zero masks.

  • Multi-VAE [27] breaks the linear collaborative modeling ability of PMF by using a VAE with multinomial likelihood to capture the user collaborative information in ratings for recommendations.

  • CoVAE [5] utilizes the non-linear Multi-VAE as the collaborative backbone and incorporates item feature information by treating their co-occurrences as pseudo training samples to collectively train the Multi-VAE with the user ratings.

  • CondVAE [33] builds a user conditional VAE where the user features are used as the conditions. We extend the original CondVAE by replacing the categorical user features with the ones we build from the interacted items, which we find have a better performance on all the datasets.

  • DICER [57] is an item-oriented auto-encoder (IAE)-based recommender system where the item content information is utilized to learn disentangled item embeddings from their user ratings to achieve more robust recommendations.

  • RecVAE [38] improves over the Multi-VAE by designing a new encoder architecture with a composite prior for user collaborative latent variables that leads to a more stable training procedure.

Table 2 summarizes the comparison results between VBAE and the selected baselines. As it can be seen, Table 2 comprises of three parts. The middle part shows four hybrid baselines with linear collaborative filtering module, i.e., matrix factorization (MF). Generally, the performance improves with the increase of the representational ability of the utilized item embedding model. Specifically, CVAE, which uses VAE to encode the item content information into Gaussian variables, performs consistently better than CDL and CTR on all three datasets. However, we also observe that simple methods such as FM can outperform some of the deep learning-based baselines (e.g., CDL on citeulike-a datasets) when their parameters are systematically searched with a Bayesian optimizer [6].

The bottom part shows baselines that utilize deep neural networks (DNNs) as the collaborative module. Multi-VAE, RecVAE can capture non-linear similarities among users, so they improve almost consistently over the linear hybrid baselines when the datasets are comparatively dense (e.g. the citeulike-a dataset), even if they do not use any user or item side information. When the datasets get sparser, however, they cannot perform on par with the PMF-based hybrid recommenders that augments with item side information due to lack of sufficient collaborative information. Moreover, we find that although augmenting user ratings with item feature concurrences as extra pseudo training samples, CoVAE does not consistently outperform Multi-VAE on all three datasets, which could suggest that the item feature occurrences do not necessarily imply user co-purchases. Treating user feature embeddings as the condition for the user collaborative embeddings, CondVAE achieves the best performance among all the non-linear UAE-based baselines on two of the denser citeulike datasets and performs on par with CVAE on the sparser Amazon toys & games dataset. DiCER, which is an IAE-based recommender that we include for comparisons, shows clear merits when the dataset has a large item-to-user ratio (e.g. the citeulike-a and citeulike-t datasets). The reason may be that for IAE-based recommenders, the number of training samples is proportional to the number of items, whereas the number of trainable weights is proportional to the number of users, so a large item-to-user ratio ensures sufficient training samples and a reasonable amount of trainable weights to guarantee a good model generalization ability.

Simultaneously addressing the uncertainty of user ratings and noise in user features, VBAE-soft and VBAE-hard out-performs all baselines on all three datasets. Although the Bayesian SDAE in CDL, VAE in CVAE, or CondVAE also have the denoising ability in that they corrupt the item features, latent item embeddings, or latent user embeddings via masked noise or self-adaptive Gaussian noise, the noise they address is not recommendation-oriented and is therefore inevitably low-level. However, high-level and personalized noise (information that is not relevant to the recommendation purpose) exists pervasively in recommendation tasks, which cannot be addressed by these models. In contrast, through the introduction of a user-dependent channel variable, VBAE actively decides how much information should be accessed from the user features based on information already contained in the ratings through a quantum-inspired collaborative uncertainty measurement mechanism. This ensures the personalized recommendation quality when the ratings are sparse by incorporating sufficient user feature information while improving the model generalization ability by avoiding unnecessary dependence on the noisy user features when the collaborative information is sufficient.

4.5 Ablation Study On the User-dependent Channel


Figure 6: NDCG@100 breakdown for users with increasing levels of activity measured by the number of items that a user has interacted in the history. The error bar represents the standard deviation across ten random splits.

In this section, we further demonstrate the effectiveness of the established information regulation mechanism in VBAE by answering the following two research questions:

RQ1: How do VBAE-hard and VBAE-soft perform compared to VBAE-like models, which, instead of explicitly considering the personalized difference of the collaborative uncertainty and feature noise for different users, treat the fusion of user feature as a fixed procedure for all the users.

RQ2: How well does the inferred bandwidth correspond to the sparsity of interactions and, therefore, the scarcity of collaborative information? The answer to this question shows the effectiveness of the proposed quantum-inspired collaborative uncertainty measurement to distinguish users with varied sufficiency levels of collaborative information.

4.5.1 Comparisons with Models with Fixed Fusion Strategies

To answer the first research question, we design the following three baseline models as ablation studies:

  • DBAE-pass uses an "all-pass" channel to link the user collaborative and feature networks, where all the information in user feature embeddings are losslessly transferred to the corresponding user latent variables irrespective of the individual difference in sufficiency level of the collaborative information;

  • DBAE-stop uses a "stop" channel where the user feature information is entirely blocked, and only the collaborative information is exploited to calculate the user latent variables. The difference between DBAE-stop and Multi-VAE is that Multi-VAE imposes the L2-normalization on the input ratings, whereas DBAE-stop imposes it on the hidden rating embeddings to make it comparable with VBAE.

  • VAE-concat concatenates the user ratings and features as the inputs to the Multi-VAE to reconstruct the ratings, instead of viewing their fusion from an information-theoretic perspective. The fusion can be viewed as learning a fixed weighted combination between user features and ratings.

R@20 N@100 Bandwidth PCC
VBAE-soft 0.299 0.296 0.543 0.054 -0.898
VBAE-hard 0.293 0.294 0.812 0.048 -0.901
DBAE-stop 0.263 0.269 0.000 0.000 N/A
DBAE-pass 0.287 0.285 1.000 0.000 N/A
VAE-concat 0.274 0.280 N/A N/A
(a) citeulike-a
R@20 N@100 Bandwidth PCC
VBAE-soft 0.227 0.190 0.546 0.050 -0.887
VBAE-hard 0.223 0.193 0.805 0.035 -0.910
DBAE-stop 0.170 0.142 0.000 0.000 N/A
DBAE-pass 0.212 0.178 1.000 0.000 N/A
VAE-concat 0.215 0.172 N/A N/A
(b) citeulike-t
R@20 N@100 Bandwidth PCC
VBAE-soft 0.145 0.107 0.560 0.057 -0.803
VBAE-hard 0.144 0.105 0.829 0.031 -0.825
DBAE-stop 0.119 0.088 0.000 0.000 N/A
DBAE-pass 0.135 0.094 1.000 0.000 N/A
VAE-concat 0.132 0.095 N/A N/A
(c) toys & games
Table 3: Comparisons between VBAEs, two VBAE-like models with deterministic channel, and an early fusion baseline which concatenates the user ratings and features as the inputs for recommendations.

The collaborative network structure of DBAE-pass and DBAE-stop is set to be the same as VBAE for a fair comparison. The comparison results are listed in Table LABEL:tab:results_channel.

As it can be seen, among the five models that we draw comparisons with, DBAE-stop performs the worst on all the datasets. Since DBAE-stop can be viewed as an altered version of Mult-VAE [26] where the L2-normalization is applied on the hidden representations rather than the input ratings and extra L2 penalties are imposed on the network weights, this confirms the previous finding that hybrid recommendation methods augmented with feature information usually perform better than collaborative-based methods when the ratings are sparse [47, 26]. Comparatively, DBAE-pass is much harder to beat than DBAE-stop, since the deficiency of collaborative information for a large number of users with sparse interaction makes the auxiliary user features valuable for personalized recommendation, even if the features are noisy. Still, two VBAE-based methods achieve better performance on all three datasets, which demonstrates that constraining the feature information allowed to be accessed for users with sufficient collaborative information can indeed improve model generalization. Although VAE-concat uses a dense layer to learn a weighted combination of user features and ratings, it is over-parameterized and prone to overfitting when the datasets are sparse. Moreover, VAE-concat the weights are fixed for all users, which ignores the individual differences in the sufficiency level of collaborative information. Therefore, we also observe that two VBAE models outperform VAE-concat on all three datasets. The superiority of user-dependent bandwidth to the all-pass and VAE-concat models indicates that for users with more informative interactions (i.e., dense and overlapped with the interactions of other users), the collaborative information in the ratings is per se very reliable for recommendations, and the noise introduced by the fusion of user features may outweigh the useful information and degenerate the recommendation performance.

The explanation for the superiority of VBAE-soft over VBAE-hard could be that VBAE-soft uses a Beta channel variable with its variance fixed to a small value, where the feature embeddings are stably and smoothly discounted based on the bandwidth inferred from user ratings. In contrast, the Bernoulli channel in VBAE-hard determines whether or not to access user features with the inferred bandwidth as the access probability, which may be coarse in granularity and makes the training process less stable than the Beta channel in VBAE-soft.

To further investigate the effectiveness of the user-dependent channels for users with different activity levels, we divide the test users into quartiles and report the NDCG@100 on each group in Fig.

6. When comparing with DBAE-stop, we mainly focus on users with low activity levels, since for these users, DBAE-stop accesses no information from user features while VBAE-hard and VBAE-soft infer a large bandwidth for the channel that allows more information to be accessed from their features. The leftmost bar group in Fig. 6 shows that VBAE-hard and VBAE-soft significantly outperform DBAE-stop on all three datasets. The result confirms that incorporating auxiliary feature information can alleviate the uncertainty of collaborative embeddings and improve recommendation performance when the ratings are extremely sparse, even if the user features are noisy. When comparing with DBAE-pass, on the other hand, we focus on users with high activity levels. Although for these users, VBAE-hard and VBAE-soft access less information from user features, the rightmost bar group of Fig. 6 shows that NDCG@100 improves consistently for these users. This indicates that for users with dense interactions, the collaborative information in the ratings is per se very reliable for recommendations, and the noise introduced by the fusion of user features may outweigh the useful information and lowers the recommendation performance. The improvement is more significant on citeulike-t and toys & games datasets. Table 1 shows that users in these two datasets span a wider spectrum in their activity levels, and therefore the reliability of collaborative embeddings varies drastically for these users. In such a case, the channel can better distinguish these users and allocate for each user a suitable budget for user feature information when calculating the user latent variables for recommendations.

4.5.2 Statistic Analysis of the Bandwidth

To answer the second research question, we calculate several statistics of the inferred bandwidth for all the test users: its averaged value, its user variability, and its Pearson correlation coefficient (PCC) with the rating density, and report them in Table LABEL:tab:results_channel. Table LABEL:tab:results_channel shows that the bandwidth inferred through the proposed quantum-inspired collaborative uncertainty measurement tends to vary across users with different rating sparsity levels. Moreover, the bandwidth has an over -0.8 PCC with the density of user interactions on all the datasets. Such results indicate that the channel in VBAE-hard and VBAE-soft can distinguish users with different amounts of collaborative information in their ratings and dynamically control the extra amount of information that needs to be accessed from the user features based on the inferred bandwidth, which more convincingly demonstrates the effectiveness of the user-dependent channel in VBAE-hard and VBAE-soft. In addition, the average bandwidth of VBAE-hard is significantly larger than that of VBAE-soft on all three datasets. The reason could be that a large bandwidth for VBAE-hard helps to maintain the stability of the Bernoulli channel in training.

4.6 Discussion of Broader Impacts of VBAE

Although we demonstrate the effectiveness of VBAE by its application in recommender systems in this article, VBAE is a general framework that is applicable to any heterogeneous information system where one information source is comparatively reliable but could be missing, whereas another information source is abundant but is susceptible to noise. One typical example of such a system other than recommendations is the "audio-assisted action recognition in the dark" task [52], which aims to detect actions in under-illuminated videos. In the task, the visual information is the more reliable modality for action prediction but could be missing due to bad illumination, whereas the audio track always accompanies the video but may contain lots of irrelevant information (e.g., background music) for the action recognition purpose. To apply VBAE to these new tasks, the only mandatory change required is to design a suitable per data point uncertainty measurement of the first information source instead of the quantum-inspired measurement proposed in this paper tailored for the user ratings, to dynamically decide the information allowed to be accessed from the second source, so that the model will not overfit on the noise in the second auxiliary modality. Therefore, we speculate that VBAE could have a broader potential impact in areas of data mining and heterogeneous information systems other than recommendations.

5 Conclusions

VBAE In this article, we develop an information-driven generative model, collaborative variational bandwidth auto-encoder (VBAE), to address uncertainty and noise problems associated with two heterogeneous sources, i.e., ratings and user features in recommender systems. In VBAE, we establish an information regulation mechanism to fuse the collaborative and feature information, where a user-dependent channel variable is introduced to dynamically control how much information should be accessed from the user features given the information already contained in the collaborative embedding. The channel alleviates the uncertainty problem when the ratings are sparse while improving the model generalization ability with respect to noisy user features. The effectiveness of VBAE is demonstrated by extensive experiments conducted on three real-world datasets.


  • [1] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy (2017) Deep variational information bottleneck. In Proc. ICLR, Cited by: §3.6.
  • [2] P. Baldi (2012)

    Autoencoders, unsupervised learning, and deep architectures

    In Proc. ICML workshop, pp. 37–49. Cited by: §2.
  • [3] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe (2017) Variational inference: a review for statisticians. Journal of the American Statistical Association 112 (518), pp. 859–877. Cited by: §3.4.
  • [4] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Józefowicz, and S. Bengio (2016) Generating sentences from a continuous space. In Proc. CoNLL, pp. 10–21. Cited by: §3.6.
  • [5] Y. Chen and M. de Rijke (2018) A collective variational autoencoder for top-n recommendation with side information. In Proc. WDLRS, pp. 3–9. Cited by: 6th item.
  • [6] M. F. Dacrema, P. Cremonesi, and D. Jannach (2019) Are we really making much progress? a worrying analysis of recent neural recommendation approaches. In Proc. RecSys, pp. 101–109. Cited by: 1st item, §4.4.
  • [7] Y. Deng, Y. Kim, J. Chiu, D. Guo, and A. Rush (2018) Latent alignment and variational attention. In Proc. NeurIPS, pp. 9712–9724. Cited by: §3.3.2, §3.7.3.
  • [8] X. Dong, L. Yu, Z. Wu, Y. Sun, L. Yuan, and F. Zhang (2017) A hybrid collaborative filtering model with deep structure for recommender systems. In Proc. AAAI, pp. 1309–1315. Cited by: §3.3.2.
  • [9] Y. Gal (2016) Uncertainty in deep learning. Ph.D. Thesis, University of Cambridge. Cited by: §3.7.1, §3.7.4.
  • [10] A. Goyal, Y. Bengio, M. Botvinick, and S. Levine (2020) The variational bandwidth bottleneck: stochastic evaluation on an information budget. In Proc. ICLR, Cited by: §3.3.2.
  • [11] H. Guo, R. Tang, Y. Ye, Z. Li, and X. He (2017) DeepFM: A factorization-machine based neural network for ctr prediction. In Proc. IJCAI, pp. 1725–1731. Cited by: 1st item.
  • [12] R. He and J. McAuley (2016) Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proc. WWW, pp. 507–517. Cited by: §4.1.
  • [13] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. Chua (2017) Neural collaborative filtering. In Proc. WWW, pp. 173–182. Cited by: §1.
  • [14] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-VAE: learning basic visual concepts with a constrained variational framework. In Proc. ICLR, Cited by: §3.6.
  • [15] Y. Hou, N. Yang, Y. Wu, and S. Y. Philip (2019) Explainable recommendation with fusion of aspect information. World Wide Web 22 (1), pp. 221–240. Cited by: §1.
  • [16] Y. Hu, Y. Koren, and C. Volinsky (2008) Collaborative filtering for implicit feedback datasets. In Proc. ICDM, pp. 263–272. Cited by: §3.1, §4.2.
  • [17] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proc. ICML, pp. 448–456. Cited by: §3.4.2, §4.3.
  • [18] E. Jang, S. Gu, and B. Poole (2017) Categorical reparameterization with gumbel-softmax. In Proc. ICLR, Cited by: §3.7.2.
  • [19] R. A. Johnson, D. W. Wichern, et al. (2002)

    Multivariate linear regression models

    In Applied Multivariate Statistical Analysis, pp. 360–417. Cited by: §3.4.1.
  • [20] W. Joo, W. Lee, S. Park, and I. Moon (2020) Dirichlet variational autoencoder. Pattern Recognit. 107, pp. 107514. Cited by: §3.7.3.
  • [21] D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In Proc. ICLR, Cited by: §1, §2.1, §3.7.1.
  • [22] Y. Koren (2008) Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proc. KDD, pp. 426–434. Cited by: §2.3.
  • [23] W. Lee, K. Song, and I. Moon (2017) Augmented variational autoencoders for collaborative filtering with auxiliary information. In Proceedings CIKM, pp. 1139–1148. Cited by: §2.2, §2.
  • [24] Q. Li, B. Wang, and M. Melucci (2019) CNM: an interpretable complex-valued network for matching. In Proc. NAACL, pp. 4139–4148. Cited by: §3.4.1.
  • [25] S. Li, J. Kawale, and Y. Fu (2015) Deep collaborative filtering via marginalized denoising auto-encoder. In Proc. CIKM, pp. 811–820. Cited by: §1.
  • [26] X. Li and J. She (2017) Collaborative variational autoencoder for recommender systems. In Proc. KDD, pp. 305–314. Cited by: §1, §2.1, §3.3.2, 4th item, §4.3, §4.5.1.
  • [27] D. Liang, R. G. Krishnan, M. D. Hoffman, and T. Jebara (2018) Variational autoencoders for collaborative filtering. In Proc. WWW, pp. 689–698. Cited by: §1, §2.2, §2, §3.3.3, §3.4.2, §3.6, 5th item.
  • [28] J. P. Lucas, N. Luz, M. N. Moreno, R. Anacleto, A. A. Figueiredo, and C. Martins (2013) A hybrid recommendation approach for a tourism system. Expert Syst. Appl. 40 (9), pp. 3532–3550. Cited by: §1.
  • [29] J. Ma, C. Zhou, P. Cui, H. Yang, and W. Zhu (2019) Learning disentangled representations for recommendation. In Proc. NeurIPS, pp. 5711–5722. Cited by: §1.
  • [30] J. Ma, C. Zhou, H. Yang, P. Cui, X. Wang, and W. Zhu (2020) Disentangled self-supervision in sequential recommenders. In Proc. KDD, pp. 483–491. Cited by: §1.
  • [31] C. J. Maddison, A. Mnih, and Y. W. Teh (2017)

    The concrete distribution: a continuous relaxation of discrete random variables

    In Proc. ICLR, Cited by: §3.7.2.
  • [32] A. Mnih and R. R. Salakhutdinov (2007) Probabilistic matrix factorization. In Proc. NeurIPS, pp. 1257–1264. Cited by: §2.1.
  • [33] B. Pang, M. Yang, and C. Wang (2019) A novel top-N recommendation approach based on conditional variational auto-encoder. In Proc. PAKDD, pp. 357–368. Cited by: §2.3, §3.4.2, 7th item.
  • [34] D. Park, H. Song, M. Kim, and J. Lee (2020) TRAP: two-level regularized autoencoder-based embedding for power-law distributed data. In Proc. WWW, pp. 1615–1624. Cited by: §3.6.
  • [35] D. J. Rezende, S. Mohamed, and D. Wierstra (2014)

    Stochastic backpropagation and approximate inference in deep generative models

    In Proc. ICML, Vol. 32, pp. 1278–1286. Cited by: §3.7.
  • [36] N. Sachdeva, G. Manco, E. Ritacco, and V. Pudi (2019) Sequential variational autoencoders for collaborative filtering. In Proc. WSDM, pp. 600–608. Cited by: §2.
  • [37] S. Sedhain, A. K. Menon, S. Sanner, and L. Xie (2015) AutoRec: autoencoders meet collaborative filtering. In Proc. WWW, pp. 111–112. Cited by: §1, §2.
  • [38] I. Shenbin, A. Alekseev, E. Tutubalina, V. Malykh, and S. I. Nikolenko (2020) RecVAE: a new variational autoencoder for top-N recommendations with implicit feedback. In Proc. WSDM, pp. 528–536. Cited by: 9th item.
  • [39] M. Slaney (2011) Web-scale multimedia analysis: does content matter?. IEEE MultiMedia 18 (2), pp. 12–15. Cited by: §1.
  • [40] A. Srivastava and C. Sutton (2017) Autoencoding variational inference for topic models. In Proc. ICLR, Cited by: §3.7.3.
  • [41] J. Su, W. Huang, S. Y. Philip, and V. S. Tseng (2010)

    Efficient relevance feedback for content-based image retrieval by mining user navigation patterns

    IEEE Trans. Knowl. Data Eng. 23 (3), pp. 360–372. Cited by: §1.
  • [42] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol (2010)

    Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion

    J. Mach. Learn. Res. 11 (Dec), pp. 3371–3408. Cited by: §2.1, 3rd item.
  • [43] C. Wang and D. M. Blei (2011) Collaborative topic modeling for recommending scientific articles. In Proc. KDD, pp. 448–456. Cited by: 2nd item, §4.1.
  • [44] H. Wang, B. Chen, and W. Li (2013) Collaborative topic regression with social regularization for tag recommendation. In Proc. IJCAI, pp. 2719–2725. Cited by: §4.1.
  • [45] H. Wang and W. Li (2014) Relational collaborative topic regression for recommender systems. IEEE Trans. Knowl. Data Eng. 27 (5), pp. 1343–1355. Cited by: §1.
  • [46] H. Wang, X. Shi, and D. Yeung (2016) Collaborative recurrent autoencoder: recommend while learning to fill in the blanks. In Proc. NeurIPS, pp. 415–423. Cited by: §1.
  • [47] H. Wang, N. Wang, and D. Yeung (2015) Collaborative deep learning for recommender systems. In Proc. KDD, pp. 1235–1244. Cited by: §2.1, §3.3.2, 3rd item, §4.5.1.
  • [48] Y. Wang, L. Wang, Y. Li, D. He, W. Chen, and T. Liu (2013) A theoretical analysis of NDCG ranking measures. In Proc. CoLT, pp. 1–30. Cited by: §4.2.
  • [49] Y. Wu, C. DuBois, A. X. Zheng, and M. Ester (2016) Collaborative denoising auto-encoders for top-N recommender systems. In Proc. WSDM, pp. 153–162. Cited by: §1, §2.2, §2.
  • [50] B. Xu, J. Bu, C. Chen, C. Wang, D. Cai, and X. He (2013) EMR: a scalable graph-based ranking model for content-based image retrieval. IEEE Trans. Knowl. Data Eng. 27 (1), pp. 102–114. Cited by: §1.
  • [51] Y. Xu, L. Zhu, Z. Cheng, J. Li, Z. Zhang, and H. Zhang (2021) Multi-modal discrete collaborative filtering for efficient cold-start recommendation. IEEE Trans. Knowl. Data Eng.. Cited by: §1.
  • [52] Y. Xu, J. Yang, H. Cao, J. Yin, and S. See (2021) Arid: a new dataset for recognizing action in the dark. In IJCAI Workshop, Vol. 1370, pp. 70. Cited by: §4.6.
  • [53] J. Yi, Y. Zhu, J. Xie, and Z. Chen (2021) Cross-modal variational auto-encoder for content-based micro-video background music recommendation. ArXiv preprint. Cited by: §1.
  • [54] Q. Yi, N. Yang, and P. Yu (2021) Dual adversarial variational embedding for robust recommendation. IEEE Trans. Knowl. Data Eng.. Cited by: §1, §2.3.
  • [55] S. Zhang, L. Yao, A. Sun, and Y. Tay (2019) Deep learning based recommender system: a survey and new perspectives. ACM Computing Surveys 52 (1), pp. 1–38. Cited by: §1.
  • [56] S. Zhang, L. Yao, and X. Xu (2017) AutoSVD++: an efficient hybrid collaborative filtering model via contractive auto-encoders. In Proc. SIGIR, pp. 957–960. Cited by: §1.
  • [57] Y. Zhang, Z. Zhu, Y. He, and J. Caverlee (2020)

    Content-collaborative disentanglement representation learning for enhanced recommendation

    In Proc. RecSys, pp. 43–52. Cited by: §2.1, §2, 8th item.
  • [58] J. P. Zhou, Z. Cheng, F. Pérez, and M. Volkovs (2020) TAFA: two-headed attention fused autoencoder for context-aware recommendations. In Proc. RecSys, pp. 338–347. Cited by: §2.2.
  • [59] Z. Zhu, J. Wang, and J. Caverlee (2019) Improving top-k recommendation via joint collaborative autoencoders. In Proc. WWW, pp. 3483–3482. Cited by: §1.