VBAE
Collaborative variational bandwidth autoencoder (VBAE) for recommender systems.
view repo
Hybrid recommendations have recently attracted a lot of attention where user features are utilized as auxiliary information to address the sparsity problem caused by insufficient useritem interactions. However, extracted user features generally contain rich multimodal information, and most of them are irrelevant to the recommendation purpose. Therefore, excessive reliance on these features will make the model overfit on noise and difficult to generalize. In this article, we propose a variational bandwidth autoencoder (VBAE) for recommendations, aiming to address the sparsity and noise problems simultaneously. VBAE first encodes user collaborative and feature information into Gaussian latent variables via deep neural networks to capture nonlinear user similarities. Moreover, by considering the fusion of collaborative and feature variables as a virtual communication channel from an informationtheoretic perspective, we introduce a userdependent channel to dynamically control the information allowed to be accessed from the feature embeddings. A quantuminspired uncertainty measurement of the hidden rating embeddings is proposed accordingly to infer the channel bandwidth by disentangling the uncertainty information in the ratings from the semantic information. Through this mechanism, VBAE incorporates adequate auxiliary information from user features if collaborative information is insufficient, while avoiding excessive reliance on noisy user features to improve its generalization ability to new users. Extensive experiments conducted on three realworld datasets demonstrate the effectiveness of the proposed method. Codes and datasets are released at https://github.com/yaochenzhu/vbae.
READ FULL TEXT VIEW PDF
Collaborative filtering (CF) is a successful approach commonly used by m...
read it
Variational autoencoder (VAE) is an efficient nonlinear latent factor ...
read it
In recent years, researchers attempt to utilize online social informatio...
read it
In this paper, we propose a crossmodal variational autoencoder (CMVAE)...
read it
Recommender systems have been studied extensively due to their practical...
read it
We propose a tensorbased model that fuses a more granular representatio...
read it
Automatically learning features, especially robust features, has attract...
read it
Collaborative variational bandwidth autoencoder (VBAE) for recommender systems.
In the era of information overload, people have been inundated by large amounts of online content, and it becomes increasingly difficult for them to discover interesting information. Consequently, recommender systems play a pivotal role in modern web applications due to their ability to help users discover items that they may be interested in from a large collection of candidates. Based on how recommendations are made, existing recommender systems can be categorized into three classes [55]: collaborativebased methods, contentbased methods, and hybrid methods. Collaborativebased methods [45, 13] predict user preferences by exploiting their past activities, such as clicks or ratings, where the recommendation quality relies heavily on peers with similar behavior patterns. Contentbased methods [41, 50], on the other hand, make recommendations based on users or items that share similar features. Hybrid methods [28, 51, 53] combine the advantages of both worlds where the collaborative information and user/item features are comprehensively considered to generate more precise recommendations.
Recent years have witnessed an upsurge of interest in employing autoencoders [37] to both collaborative and contentbased recommender systems, where compact representations of sparse ratings [27, 25] or highdimensional user/item features [46, 59]
can be learned to more effectively exploit the similarity patterns between users or items for recommendation. As a Bayesian version of autoencoder where the encoded latent representations are modeled as random variables, VAE has demonstrated superiority compared to other forms of autoencoders, such as the contractive autoencoder
[56] and the denoising autoencoder [49]. Among them, Collaborative variational autoencoder (CVAE) [26] first used a VAE to infer latent item content embeddings from tfidf textual features, and then iteratively finetune the embeddings with rating information via matrix factorization. MultiVAE [27], in contrast, used the VAE in the collaborative setting to learn compact user embeddings from discrete user ratings. Recently, MacridVAE [29, 30] further extended MultiVAE, where learned user representations were constrained to disentangle at both the macro and the micro levels to improve the robustness and interpretability of learned embeddings.However, it is difficult to generalize the VAE to a hybrid recommender system, due to various challenges from both the collaborative and contentbased components (Fig. 1
). As users tend to vary in their activity levels and tastes, user embeddings learned through collaborative filtering bear different degrees of uncertainty, which hinders good recommendations for users with unreliable collaborative embeddings (
e.g. user #4 and #5 in Fig. 1). The uncertainty mainly comes from three aspects: (1) Sparsity: First, for a user with sparser interactions (user #4), her associated embedding is more unreliable due to the information insufficiency in her historical interactions, which makes the similarity measure induced by the embedding space less informative compared to users with denser interactions. (2) Diversity: Even if a user has denser interactions, we cannot safely conclude that we can estimate her preferences with more confidence, because her ratings may focus on a few types of items, which makes the collaborative information conveyed by different items correlated. However, if she rated items of more diverse types, we can have more confidence in the estimate of her preferences based on the rating information because the item space is more thoroughly explored. (3)
Overlapping: In addition, the uncertainty of a user embedding may also be large if the items the user has interacted with are seldomly visited by other users (user #5). Considering two users who click the same number of items, the items that the former user click are also clicked by many other users, while the latter user only clicks items that no other users have yet clicked, the embedding uncertainty of the latter user would be larger than that of the former user.Although the user/item features could be exploited to reduce the uncertainty incurred by the sparsity of ratings, the main obstacle for utilizing them lies in the heavy noise that may outweigh the useful information. Here, by noise, we mean any pattern that is irrelevant for the recommendation purpose, which should be distinguished from the lowlevel noises such as image blur, audio aliasing, and textual typos. Considering, for example, recommending academic articles to researchers, the related work and the empirical study sections are less informative compared to the abstract of methodology, and generally they should be regarded as noise in recommendations, although they both contain valuable information once a researcher gets attracted by the abstract and decides to delve deeper into the article. Moreover, since different users may consider varied aspects when rating an item, such noise exhibits a personalized characteristic that makes it difficult to eliminate [15, 54]. A similar analysis could be made for the user features, since the collection of certain user attributes, such as location, etc., may raise up privacy issues, and a widely adopted surrogate strategy is to empirically combine the features of the items that the users have interacted with to build up the user profiles for recommendations. In fact, the consensus in the community is that collaborative filtering is more reliable than featurebased methods for largescale web recommendations if user interactions are sufficient to leverage [39]. Therefore, a good hybrid recommender system should manage to avoid unnecessary reliance on the noisy user/item features depending upon the sufficiency level of the collaborative information in the ratings (e.g. user #1 #2, #3 in Fig. 1), such that the noise in these features would not outweigh the useful information and hurt the model generalization ability.
To address the above challenges, we propose an informationdriven Bayesian generative model called variational bandwidth autoencoder (VBAE) for hybrid recommendations. The model first jointly learns the generation of user features and ratings from latent collaborative and userfeature embeddings. These embeddings are modeled as Gaussian random variables and inferred via deep neural networks through the autoencoding variational Bayes (AEVB) algorithm [21]. Furthermore, observing that the extracted user features could be extremely noisy, we start by considering the fusion of the collaborative and feature embeddings from an informationtheoretic perspective, i.e., as a virtual communication channel. We then introduce a novel userdependent channel that dynamically controls the amount of information that is allowed to be accessed from the user features based on the collaborative information already contained in the user ratings. A quantuminspired uncertainty measurement of the hidden rating embeddings is proposed accordingly to infer the channel bandwidth by disentangling the uncertainty information in the ratings from the semantic information. Through this mechanism, sufficient auxiliary information can be accessed from user features if collaborative information is inadequate, while unnecessary dependence of the model on noisy user features can be avoided otherwise to make it generalize better to the user features. The main contribution of this article is summarized as follows:
We present VBAE, a unified informationdriven recommendation framework where the generation of user ratings and features is parameterized via deep Bayesian networks and their fusion is modeled as a personalized virtual communication channel, such that the rating sparsity and feature noise problems can be simultaneously addressed.
A novel quantuminspired uncertainty measurement of the hidden rating embedding is proposed to infer the bandwidth of the userdependent channel, which enhances the model generalization ability by dynamically controlling the information allowed to be accessed from user features based on the sufficiency level of collaborative information.
Two kinds of channel implementations with different desired properties, i.e.,
Bernoulli and Beta channels, are thoroughly discussed, with the corresponding optimization objectives derived with distribution approximation and variational inference to make them amenable to stochastic gradient descent.
The proposed VBAE empirically outperforms stateoftheart hybrid recommendation baselines. We also discover that the inferred bandwidth of the channel variable can well distinguish users with different sufficiency levels of collaborative information.
As a special kind of deep neural network, autoencoders aim to learn compact representations of inputs by reconstruction [2]
. Since both user ratings and features are highdimensional sparse vectors that make the direct manipulation of them in the original space difficult, much effort has been dedicated to improve representation learning strategies with autoencoders by researchers recently
[37]. Generally, autoencoderbased recommenders can be categorized into two main classes: useroriented autoencoders (UAEs) [23, 49, 27, 36] and itemoriented autoencoders (IAEs) [37, 57], based on whether the autoencoder is used to tackle the user or the item side information.The advent of IAE predates that of UAE, where item content autoencoders are built on top of matrix factorization (MF) based collaborative backbones, such as weighted matrix factorization [32], to incorporate auxiliary content information into the factorized item collaborative embeddings. Two exemplar methods from this category are CDL [47] and CVAE [26], where an item offset variable is introduced to tightly couple the Bayesian stacked denoising autoencoder (SDAE) [42] or variational autoencoder (VAE) [21] with MF to enhance its performance. The MF and item content autoencoder are trained in an iterative manner. Recently, autoencoders have been exploited to capture the item collaborative information. Among them, DICER [57] was proposed to capture nonlinear item similarity based on their user ratings, and from which disentangles the content information for a better generalization. Since in collaborative IAEs, the input dimension and the number of trainable weights is proportional to the number of users, and the number of training samples equals the number of items, these methods generally require a large itemstouser ratio such that a good representation of items can be learned for satisfactory recommendations.
Compared with IAEs, UAEs have attracted more attention among researchers because they break the longstanding bottleneck of the linear collaborative modeling limitation of MF and allow modeling users in a deeper manner [23, 49, 58]. Instead of factorizing the rating matrix into user and item embeddings, UAEbased recommenders take the historical ratings of users as the inputs, embed them into hidden user representations with a deep encoder network, and from which reconstruct the ratings with a deep decoder network. The reconstructed ratings for unrated items are then ranked for recommendations. Since UAEbased recommenders eliminate the need of modeling item latent variables and reconstruct the whole ratings directly from the user latent embedding, another advantage of UAEs over MFs is that they are efficient to fold in new users for whom historical ratings have been recorded, since recommendations can be made with a single forward propagation. The first UAEbased recommender system is the collaborative denoise autoencoder (CDAE) [49], where the input ratings are randomly masked with zeros to simulate the rating missing process. Afterwards, MultiVAE [27] was proposed where a VAE with multinomial likelihood on ratings is used instead of the DAE, which demonstrates clear advantages. However, one key problem for these collaborative UAEs is that if the ratings of certain users are sparse, the recommendation performance could be severely degenerated due to the lack of collaborative information.
Due to the wide availability of user features and various methods that build user profiles with item features, hybriding UAE with auxiliary user features to address the sparsity problem has become a new trend in the recommendation community. A simple but effective method to incorporate user feature information into UAE is to adopt the early fusion strategy similar to [22], where the user features are concatenated with ratings as the input for the UAE where only the ratings are reconstructed for recommendations. In this way, the first dense layer of the UAE can be viewed as calculating the weighted combination of user rating and features, which may be overparameterized and is susceptible to overfitting. A more sophisticated approach is the conditional VAE (CondVAE) [33], where the user features are exploited to calculate the conditional prior of user latent variables, which are then updated into posterior by the collaborative information, i.e., ratings, for recommendations. However, all these methods treat the relative importance of collaborative and content information as fixed for all users, ignoring the individual differences both in the reliability of their extracted features and in the sufficiency level of historical rating information [54]. This is problematic, since generally, user features contain much irrelevant information and noise, and a good recommender system should avoid unnecessary dependence on these features when collaborative information is sufficient to improve generalization. Therefore, it motivates us to design the informationdriven variational bandwidth autoencoder with a userdependent channel to fuse the user feature and collaborative information.
CVAE
The focus of this article is on recommendations with implicit feedback [16]. We define the rating matrix as , where each row is the bagofwords vector denoting the whether user visit each of the items. is obtained by keeping track of user activities for a certain amount of time. In addition, the users profiles are represented by matrix , where the row vector is the extracted feature for the th user. can contain user inherent attributes such as her age, location, and selfdescription, etc., or be built from the feature of items that the user has interacted with when such information is not available. The capital nonboldface letters and are used to denote the corresponding random variables, respectively^{2}^{2}2The subscript will be omitted for simplicity if no ambiguity exists.. Note that the density of , which is defined as , could vary dramatically for different , and is generally highdimensional and noisy. Given the partial observation of the records in and the user features , the problem is to make predictions of the remaining ratings in so as to recommend new relevant items to users.
VBAE
The PGM in the left part of Fig. 2 shows the overall generative and inference process of VBAE. The details of VBAE are discussed in the following sections.
In this article, the collaborative and feature embedding of users are assumed to lie in dimensional latent spaces. Therefore, VBAE starts by generating for each user the collaborativebased latent variable from a normal prior:
Since user features are extracted as the auxiliary information to complement the collaborative information, they are embedded in a user feature latent variable
, which is drawn from another Gaussian distribution as follows:
Traditional generative models for hybrid recommendation [47, 26, 8] directly add and via an offset variable to form the latent user embedding , or learn their fixed relative weights by direct concatenation, and do not consider that the uncertainty of usually varies for different users due to the both explicit (difference in the sparsity, diversity of ratings) and implicit (difference in the number of overlapped ratings) factors, nor that contains lots of irrelevant information that may distract the recommendation model. This could be problematic, since for users who have denser, more diverse ratings and rate items that are also rated by many other users, their associated collaborative embeddings are more reliable, and can afford to access less information from such that unnecessary dependence of the model on noisy user features can be avoided. For users with sparser ratings, however, has to access more information from to enable a more personalized recommendation even if the user features are noisy.
From an informationtheoretic perspective, if we view the fusion of and into as a virtual communication channel, the issue sources from the assumption that the channel is deterministic and independent of the information already contained in the collaborative embedding , where the individual difference in sufficiency level of collaborative information is ignored. Therefore, to address such a problem, we design a userdependent channel in VBAE by introducing a latent capacity variable that determines the bandwidth of the channel from to . The bandwidth variable encodes for each user our belief towards how much extra information is required from the user features given the information already contained in the observed interactions. Through this mechanism, the channel dynamically allocates the amount of information that is allowed to flow from to conditional on . Two strategies are explored to implement the userdependent channel with bandwidth . The first strategy we consider is called the "hard" channel, where the bandwidth is achieved when losslessly accesses
with probability
, and accesses no information otherwise [10]. In the generative case, we can introduce an auxiliary channel variable , and drawfrom a Bernoulli distribution with probability
:Although the hard channel complies more strictly with the definition of bandwidth in information theory, it may results in training instability because the bandwidth only appears as a statistical property, i.e., an expectation when the feature and collaborative embeddings of a user are repeatedly fused for multiple times. However, for one user, the user latent embedding either access the user feature information or not in one recommendation, which is coarse in granularity to distinguish user with different uncertainty levels of collaborative information. Therefore, we consider a second strategy, the "soft" channel, which is a relaxed version of the hard channel and resembles more to the variational attention approach [7] than the variational information bandwidth theory [10]. This strategy assumes that the channel variable
is drawn from a Beta distribution and the bandwidth
determines the mean of the Beta:where the channel curtails or amplifies the weights of user feature information based on the bandwidth . Given only , however, the Beta distribution for the channel
is undetermined, as its variance remains to be specified to calculate both
and . However, since we care primarily about the bandwidth itself than its uncertainty, we fix the variance of the Beta, which is then treated as a nuisance parameter, to a small value for simplicity. Hereafter, we use the meanvariance parameterization of Beta distribution unless otherwise specified, since it explicitly contains the bandwidth as its first parameter. We use VBAEhard and VBAEsoft to distinguish the two channel implementation strategies. The detailed comparisons between the soft and hard channels are summarized in Fig. 3. After drawing , the user latent variable is deterministically calculated aswhich defines the fusion process of into via the userdependent channel of VBAEhard and VBAEsoft.
To model the nonlinear generation process of user features and ratings from the corresponding latent variables, we parameterize the generative distributions as deep neural networks. The user feature is generated from the userfeature latent variable
via a multilayer perceptron (MLP)
: If is binary, we squashed the output ofby the sigmoid function, and draw
from , or if is realvalue, we take the raw outputs of as the mean of a Gaussian distribution and draw from . Finally, we put a multinomial likelihood on the ratings as [27], and generate from the latent user variable via parameterized as , where is another MLPbased generative neural network. The generation process of , from , is given as follows:(1) For each layer of the collaborative and the userfeature modules of the generation network:
(c) For of a user , draw
(2) For user features that are binary, draw
For useritem interactions, draw
where , ,
is a hyperparameter,
is the intermediate activation function and
is the Dirac Delta function. Step 1.c can be alternatively viewed as putting Gaussian priors on the intermediate activations and setting the precision to infinity.The generative model of VBAE is described by the joint distribution of all observed and hidden variables:
(1)  
where the symbol denotes the set of trainable parameters that pertain to the generation network.
Given Eq. (1), however, it is intractable to calculate the posterior exactly, as the nonlinearity of generative process precludes us from integrating over the latent space and calculating the marginal distribution of the observed evidence . Therefore, we resort to the amortized variational inference [3], where we introduce a variational posterior
parameterized by an inference neural network as an approximation to the true but intractable posterior. Using the conditional independence assumptions implied by VBAE, the joint variational posterior can be decomposed into the compact product of two factors that make up two modules of the inference network: the collaborative module
and the user feature module , which infer the user collaborative embeddings, the bandwidth of the userdependent channel, and user feature embeddings from the user ratings and features, respectively.The collaborative module infers the collaborativebased latent user variable and the personalized latent channel variable from the observed useritem interactions. Since the bandwidth of the channel denotes the sufficiency level of the collaborative information, an important role that this module should play is to disentangle uncertainty and semantic information from the user ratings. To achieve this objective, we first use an MLP to embed the raw, sparse rating vector into a compact hidden representation
for each user. Inspired by [24], we then use the length (L2norm) of the hidden representation as the uncertainty measurement to calculate the channel bandwidth and the direction (L2normalized hidden embedding) as the representation of semantic information to infer the latent collaborative embedding. The details of the introduced semantic and uncertainty measurement of the hidden rating embeddings is illustrated in Fig. 4, which draws inspirations from theoretic quantum mechanics. To see the links, we first liken the hidden rating embedding to a quantum superposition state of a physical system, which is represented by a complex vector in . Then, the superposition vector has the property that the norm of the vector is positively correlated with the probability that this superposition is observed when measuring the system (which depicts the fundamental uncertainty of quantum physics), and the direction of the vector distinguishes this superposition state with other states (which carries the state semantic information).Such an uncertaintysemantic interpretation of the length and direction of a vector also works for hidden rating embedding in recommendation, where the norm and direction of can be associated with similar meanings. To gain the intuition, we first decompose the calculation of the hidden rating embedding from the raw rating vector with a dense layer into two basic operations, embedding and elementwise sum, as follows:
(2) 
If we assume that each element in , i.e., is independent and identically distributed (i.i.d.) Gaussian variable with zero mean and a small variance , the th element in , which is denoted as , is the sum of independent Gaussian variables. Therefore, is also a Gaussian variable that follows , where is the number of items this user has interacted with, i.e., the density of ratings. According to the basic probability and statistics theory, the squared L2norm of , which we denote as , is the sum of squares of Gaussian variables with zero mean and
variance, which follows the scaled Chisquare distribution
and is equivalent to Gamma distribution with cumulative distribution function
[19]. The expected value of , according to the property of Gamma distribution, is , which is a monotonic increasing function of the number of interacted item , i.e., the rating density, a main indicator for the sufficiency level of collaborative information.This property reveals that is positively correlated with the sufficiency level of the collaborative information. Moreover, compared to the L2norm of the sparse rating vector , the hidden embedding lies in a latent low dimensional space where user collaborative representations are more compact, which could better reflect the similarity relationship of user rating patterns by eliminating redundant information contained in similar items. Therefore, the L2norm of contains important information regarding interaction sparsity and interaction similarity that are suitable for the inference of the bandwidth that we have defined in the previous section. In addition, the L2normalization of the embeddings eliminates the negative influence of the difference in the number of items the users have interacted with by scaling the hidden rating embeddings of users with different activity levels into the same sphere space, which makes the direction of , more suitable representation than the original to infer the user collaborative embedding. This is in contrast with the previous autoencoderbased approaches such as MultVAE [27], CondVAE [33], where the information regarding the sparsity level of the ratings is discarded after the L2normalization of the input ratings.
In practice, we calculate the bandwidth by a linear transformation of
transformed with sigmoid activation function to squash the value between [0, 1]:(3) 
Since is strictly non negative, in backward propagation the weight can only be updated in one direction for all the samples in one minibatch. This is problematic, since the training loop would make
converge to zero or one where the bandwidth for all users are identical, which fails to discriminate users with different sufficiency level of collaborative information. This is referred to as the modality collapsing problem in variational inference. In this article, this problem is addressed through the batch normalization
[17]. We denote a minibatch of the length of user hidden embedding as . Before feeding each sample in the minibatch into Eq. (3) to calculate the bandwidth, we renormalize it by:(4) 
where and are the mean and sample variance of in the minibatch, and is a small value to avoid division by zero error. By batch normalization, we observe that the inferred bandwidth corresponds more consistently with the sufficiency level of the user collaborative information, and the training procedure becomes more stable as well. In the testing phase, and are fixed to their running averages estimated from the training data.
The user feature module infers the featurebased latent user embedding from the extracted user features by another MLP, which serves as the auxiliary information source to the collaborative information. The detailed inference process of the latent variables , and through the inference network is described as follows:
(1) For each layer of the collaborative and userfeature module of the inference network:
(a) For each column of the weight matrices, draw
(b) Draw the bias vector from ;
(c) For of a user , draw
(2) For the userdependent channel variable:
(a) Calculate the bandwidth from its logits, which is inferred by the L2norm of the hidden rating embedding
as follows:(b) For VBAEhard, draw the Bernoulli channel variable:
(c) For VBAEsoft, draw the Beta channel variable:
(3) For the collaborative and user feature latent variable:
(a) Draw the mean and standard deviation:
(b) Draw the sample of the latent variable:
It is not trivial to draw samples from the Bernoulli or the Beta channel variable such that gradients can be backpropagated to the trainable weights of the inference network, and we defer the discussion of the solution, which is called the reparameterization trick, to Section 3.7. In VBAE, is inferred from the observed ratings via a nonlinear neural network, where the explicit uncertainty caused by sparsity can be directly captured from the input
. In addition, since the collaborative embedding variables are inferred in an amortized manner, the implicit uncertainty can be captured by the weights, which are actively learned and shared among all users. The user latent variable
simultaneously captures the user collaborative similarity based on user interactions and the user feature similarity from extracted user features. The neural network implementation of VBAE is schematically illustrated in Fig. 2.To jointly learn the parameters of the generative network and the inference network, we maximize the evidence lower bound (ELBO), which is an approximation of the marginal loglikelihood of evidence :
(5)  
and the value of , for a fixed , achieves the maximum if and only if the discrepancy between the variational approximation and the true posterior measured by the KullbackLeiber (KL) divergence is zero (i.e., iif. ).
Although the collaborative and user feature module of VBAE can be jointly trained as Eq. (5), extra computational and memory consumption is inevitable. In addition, in a joint training, the VBAE model may converge to an undesirable suboptima where it relies solely on one main information source for the recommendation, which is undesirable since the user ratings and features contain complementary information, both of which are important for recommendations. Therefore, we take an EMlike optimization approach, where we iteratively consider only one of the variational distributions in and and fix the random variables (e.g., to their means or their to previous estimates) that concern the other. Consequently, for the collaborative part , we fix to the estimated mean to calculate , and the objective becomes:
(6)  
Eq. (6) can be viewed alternatively as paying a cost equals to the KL with prior value whenever the model tries to access the noisy user feature information. Moreover, the cost is dynamically decided by the urgency level of introducing the extra user feature information based on the sufficiency level of the collaborative information, which prevents the model from depending unnecessary upon the noisy features. After onestep optimization of , we then fix and to their estimated values to calculate , and maximize the following objective for the user feature part:
(7)  
Intuitively, for both _step and _step, the objective consists of two parts. The first part is the expected loglikelihood term, where the inferred hidden Gaussian embeddings and the latent channel variable are encouraged to best explain the extracted user features and the observed historical interactions; the second part is the KL with prior terms and the L2 weight decay terms, which act as regularizers to prevent overfitting and avoid polarization of embeddings due to the sparsity of interactions [34]. Liang et al. [27] have shown that the KL regularization in the collaborative part could be too strong, which overconstrains the representational ability of latent collaborative embeddings. As a solution, they introduced a scalar to control the weight of the KL term for the latent collaborative variable in Eq. (6), which has its theoretical foundation in both betaVAE [14] and variational information bottleneck theory [1]. We anneal the from 0 to 0.2 as [27] in our implementation of VBAE, and have confirmed the effectiveness of KL annealing in our experiments. Under such settings, the model learns to encode as much information of in as it can in the initial training stages while gradually regularizing by forcing it close to the prior as the training proceeds [4].
In this section, we derive the gradients of and w.r.t. the trainable parameters of the generation and inference networks to make them amenable to SGD optimization. For Gaussian and Bernoulli distributions, since their KL divergence with prior can be computed analytically, the minimization of the KL terms in Eqs. (6), (7) w.r.t. the weights of the inference network can be analytically calculated. However, since the gradients of the expected loglikelihood terms, which we note as , need to be backpropagated through stochastic nodes , and
, it precludes us from calculating an analytic solution. Hence, we introduce Monte Carlo methods to form unbiased estimators of the gradients. For
, as the generative distribution is explicit in expectation form, its gradient can be estimated by generating samples from the encoder distribution, calculating the gradients, and taking the average [35].As for that is associated with the inference of user feature and collaborative embeddings, we use the reparameterization trick, where we transform the stochastic nodes into differentiable bivariate functions of their parameters and random noises to allow gradients to pass through the distribution parameters [9]. Specifically, for the Gaussian embedding variables , with the vanilla reparameterization trick [21], their samples can be reformulated via:
(8) 
where . Eq. (8) could be viewed alternatively as injecting Gaussian noise to the hidden user collaborative and user feature variables, which is the main mechanism that previous autoencoderbased recommender systems adopt to address the rating and feature noise problems.
Injecting Gaussian noise to user latent collaborative and feature embeddings can only simulate the generation process of lowlevel noise. However, as we have argued in the Introduction, pervasive highlevel noise, which is information that are irrelevant to the recommendation purpose, exists in the extracted user features. Therefore, we introduce the userdependent channel to avoid excessive model reliance on these features for a better generalization. For the channel in VBAEhard that follows the Bernoulli distribution, we note that sampling from which is equivalent to sampling a onehot vector from a twoclass Categorical distribution with probability mass and discarding the second dimension. Therefore, we resort to the Gumbelsoftmax trick [18] and reformulate samples of the channel via the Concrete distribution [31]:
(9)  
where and is the temperature of the softmax and the sigmoid. When approaches zero, the samples are proved to be equivalent to samples drawn from the corresponding Bernoulli distribution with the probability . In practice, is generally annealed as the training proceeds for a more stable convergence.
Similarly, we draw the Beta channel variable of VBAEsoft by keeping the first dimension of a sample from the corresponding twoclass Dirichlet distribution. However, unlike Gaussian and Categorical distributions, there is no consensus regarding how to reparameterize a Dirichlet variable [20, 40, 7]
. In this article, we eschew the commonly used reparameterization strategies that transform a uniformly distributed vector by the inverse of the Dirichlet cumulative distribution function as the bivariate transformation, but we derive its reparameterization with logisticnormal approximation instead
[40]. The reason is that, the logisticnormal distribution converts the original parameters
of the Dirichlet (the values that should be predicted by the inference network) to the mean and standard deviation of a Gaussian distribution, such that the convergence to a lowentropy area is smoother. Otherwise, to reach a low variance area of the Dirichlet requires large values of , which is difficult to learn by the inference network and results in unstable training dynamics [7]. The relationship between the parameters of logisticnormal and the corresponding Dirichlet is formulated as follows(10)  
Note that we fix the logisticnormal to a small value, as here only the mean of the Beta channel variable (i.e., the bandwidth) is important, and a small value of the variance prevents the Dirichlet distribution from stuck into a lowentropy area. This is also a common trick used in most regression task, where the outputs are assumed to be Gaussian, for which the variance is trivial to model. The sample from the logisticnormal is drawn according to
(11)  
where . A close look at Eq. (11) shows that it bears great similarity to Eq. (9), since we can view as pseudo logits of the bandwidth, and both Eqs. add and subtract two i.i.d. random variables to the logits of bandwidth before squashing it into (0, 1) with the sigmoid function. The major difference between these two equations is that for Eq. (9), a small temperature of the sigmoid pushes the value of to 0 or 1, i.e., the two extremities, such that for each user the channel is either open or closed in one iteration. However, in Eq. (9), the temperature is set to one and therefore can take any value between [0, 1], which avoids swerve of gradient direction in training and smooths the convergence.
With the stochastic user latent embedding variable and the channel variable reparameterized with the strategies we introduce above, the unbiased gradient estimator of the objective w.r.t. can be formulated as:
(12)  
where is used to denote that the RHS. is an unbiased estimator of the LHS. Since the variance of the gradient estimated by reparameterization trick is low, previous work has demonstrated that as long as the batch size is large enough, it suffices to take a single sample for each user for the training to converge [9].
After the weights of the generative and inference networks of the VBAE model are learned, our discussion shifts towards how to predict new relevant items for users given their observed ratings and noisy features . For a user, we first calculate the mean of the collaborative embedding , the bandwidth from the ratings via the collaborative inference network, and the mean of the feature embedding from the user features via the feature inference network. The user latent variable can then be approximated as:
(13) 
To avoid randomness of the channel in testing, for VBAEhard, we set to a fixed sample from to determine whether or not information in user feature embeddings are necessary to be introduced to support the recommendation, whereas for VBAEsoft, we use the mean of the Beta channel variable (approximated by logisticnormal), i.e., , as . Finally, we calculate the multinomial probabilities of the remaining items from via as:
(14) 
where the is due to the nonlinearity of , and the estimated logits of probabilities of unobserved items are sorted to get final ranked list of items for recommendation.
citeulikea  citeuliket  toys & games  
# users  5,551  5,031  14,706 
# items  16,980  21,133  11,722 
% interactions  0.217%  0.114%  0.072% 
max/min #visits  403/10  1,932/5  546/5 
avgstd #visits  
# features  8,000  20,000  8,000 
In this section, we present and analyze the extensive experiments we conducted on three realworld datasets to demonstrate the effectiveness of the proposed VBAE model for hybrid recommender systems.
We use three realworld datasets to evaluate the model performance. Two of the datasets, citeulikea [43] and citeuliket [44] are from CiteULike, where scholars can add academic articles they are interested in to their libraries such that new relevant articles can be automatically recommended. The third dataset, toys & games, is collected by [12] from Amazon^{3}^{3}3https://nijianmo.github.io/amazon/index.html. In preprocessing, we randomly split the users by the ratio of 8:1:1 for training, validation, testing. For each user, 80% of the interactions are selected as the observed interactions to learn the user collaborative embedding and the bandwidth of the channel, and the remaining 20% are holdout for testing. The user profiles are built from the features of their interacted items. We represent each article in the citeulike datasets by the concatenation of its title and abstract, and each item in toys & games by combining all of its reviews. We then select discriminative words according to the tfidf values and normalize the word counts of each item over the maximum occurrences of each word in all items. Finally, we calculate the elementwise maximum of the normalized word counts of the observed items for each user as the user features. Table 1 summarizes the details of the datasets after preprocessing. Fig. 5 illustrates the distributions of interaction density for different users. From Fig. 5 we can find that the interaction density distribution clearly demonstrates a longtail characteristic, which reflects the uneven distribution of sufficiency level of collaborative information for users in all three datasets.
Two rankingbased metrics are used to evaluate the recommendation performance: Recall@ and the truncated normalized discounted cumulative gain (NDCG@). We do not use the precision metric, since the rating matrices in all three datasets record implicit feedbacks where a zero entry does not necessarily imply that the user is not interested in the item, but it may be due to the fact that the user is not aware of its existence as well [16]. For a user , we first obtain the rank of the heldout items by sorting their multinomial probabilities calculated as Eq. (14). If we denote the item at rank by and the set of holdout items for the user by , Recall@ is calculated as:
(15) 
where in the numerator is the indicator function, and the denominator is the minimum of and the number of holdout items. Recall@ has a maximum of 1, which is achieved when all relevant items are ranked among the top positions. Truncated discounted cumulative gain (DCG@) is computed as
(16) 
which, instead of uniformly weighting all positions, introduces a logarithm discount function over the ranks where larger weights are applied to recommended items that appear at higher ranks [48]. NDCG@ is calculated by normalizing the DCG@ to [0, 1] by the ideal DCG@ where all relevant items are ranked at the top.
Since the datasets we consider vary both in their scale and scope, we select the structure and the hyperparameters of VBAE based on evaluation metrics on validation users through grid search
^{4}^{4}4Due to space limit, please refer to the JSON files we release with the codes for the searched optimal hyperparameters and model architecture for each of the three datasets.. In VBAE, sigmoid is used as both intermediate and output activations. The weights of the inference network are tied to the generation network the same way as [26] to more effectively learn representations of user features. Specifically, to avoid the component collapsing problem where the inferred bandwidth for all users are identical, batch normalization [17] is applied to the L2norm of the latent feature representations such that they have zero mean and unit variance before the inference of the bandwidth; in addition, a larger decay rate is applied to the weights of the dense layer for bandwidth inference for regularization. We first layerwise pretrain the user feature network as the initial starting point for VBAE, and then iteratively train the collaborative network (_step) and the user feature network (_step) for 100 epochs. Adam is used as the optimizer with a batch size of 500 users. We randomly split the datasets into ten train/val/test splits as described in Section
4.1. For each split, we keep the model with the best NDCG@100 on the validation users and report the mean metrics on the test users for all splits.In this section, we compare the proposed VBAE with the following stateofthearts collaborative and hybrid recommendation baselines to demonstrate its effectiveness:



FM (Factorization Machine) is a widely employed algorithm for hybrid recommendation with sparse inputs [11]. We use Bayesian parameter search as suggested in [6]
to find optimal hyperparameters and the loss function on the validation users.
CTR [43] learns the topics of item content via latent Dirichlet allocation (LDA) and couples it with probabilistic matrix factorization (PMF) for collaborative filtering. We find the optimal hyperparameters , , , and latent dimension through grid search.
CDL [47] replaces the LDA in CTR with a stacked Bayesian denoising autoencoder (SDAE) [42] to learn the item content embeddings in an endtoend manner. We set the mask rate of SDAE to 0.3 and search its architecture the same way as VBAE.
VAE
CVAE [26] further improves over the CDL by utilizing a VAE in place of the Bayesian SDAE, where a selfadaptive Gaussian noise is introduced to corrupt the latent item embeddings instead of corrupting the input features with zero masks.
MultiVAE [27] breaks the linear collaborative modeling ability of PMF by using a VAE with multinomial likelihood to capture the user collaborative information in ratings for recommendations.
CoVAE [5] utilizes the nonlinear MultiVAE as the collaborative backbone and incorporates item feature information by treating their cooccurrences as pseudo training samples to collectively train the MultiVAE with the user ratings.
CondVAE [33] builds a user conditional VAE where the user features are used as the conditions. We extend the original CondVAE by replacing the categorical user features with the ones we build from the interacted items, which we find have a better performance on all the datasets.
DICER [57] is an itemoriented autoencoder (IAE)based recommender system where the item content information is utilized to learn disentangled item embeddings from their user ratings to achieve more robust recommendations.
RecVAE [38] improves over the MultiVAE by designing a new encoder architecture with a composite prior for user collaborative latent variables that leads to a more stable training procedure.
Table 2 summarizes the comparison results between VBAE and the selected baselines. As it can be seen, Table 2 comprises of three parts. The middle part shows four hybrid baselines with linear collaborative filtering module, i.e., matrix factorization (MF). Generally, the performance improves with the increase of the representational ability of the utilized item embedding model. Specifically, CVAE, which uses VAE to encode the item content information into Gaussian variables, performs consistently better than CDL and CTR on all three datasets. However, we also observe that simple methods such as FM can outperform some of the deep learningbased baselines (e.g., CDL on citeulikea datasets) when their parameters are systematically searched with a Bayesian optimizer [6].
The bottom part shows baselines that utilize deep neural networks (DNNs) as the collaborative module. MultiVAE, RecVAE can capture nonlinear similarities among users, so they improve almost consistently over the linear hybrid baselines when the datasets are comparatively dense (e.g. the citeulikea dataset), even if they do not use any user or item side information. When the datasets get sparser, however, they cannot perform on par with the PMFbased hybrid recommenders that augments with item side information due to lack of sufficient collaborative information. Moreover, we find that although augmenting user ratings with item feature concurrences as extra pseudo training samples, CoVAE does not consistently outperform MultiVAE on all three datasets, which could suggest that the item feature occurrences do not necessarily imply user copurchases. Treating user feature embeddings as the condition for the user collaborative embeddings, CondVAE achieves the best performance among all the nonlinear UAEbased baselines on two of the denser citeulike datasets and performs on par with CVAE on the sparser Amazon toys & games dataset. DiCER, which is an IAEbased recommender that we include for comparisons, shows clear merits when the dataset has a large itemtouser ratio (e.g. the citeulikea and citeuliket datasets). The reason may be that for IAEbased recommenders, the number of training samples is proportional to the number of items, whereas the number of trainable weights is proportional to the number of users, so a large itemtouser ratio ensures sufficient training samples and a reasonable amount of trainable weights to guarantee a good model generalization ability.
Simultaneously addressing the uncertainty of user ratings and noise in user features, VBAEsoft and VBAEhard outperforms all baselines on all three datasets. Although the Bayesian SDAE in CDL, VAE in CVAE, or CondVAE also have the denoising ability in that they corrupt the item features, latent item embeddings, or latent user embeddings via masked noise or selfadaptive Gaussian noise, the noise they address is not recommendationoriented and is therefore inevitably lowlevel. However, highlevel and personalized noise (information that is not relevant to the recommendation purpose) exists pervasively in recommendation tasks, which cannot be addressed by these models. In contrast, through the introduction of a userdependent channel variable, VBAE actively decides how much information should be accessed from the user features based on information already contained in the ratings through a quantuminspired collaborative uncertainty measurement mechanism. This ensures the personalized recommendation quality when the ratings are sparse by incorporating sufficient user feature information while improving the model generalization ability by avoiding unnecessary dependence on the noisy user features when the collaborative information is sufficient.
In this section, we further demonstrate the effectiveness of the established information regulation mechanism in VBAE by answering the following two research questions:
RQ1: How do VBAEhard and VBAEsoft perform compared to VBAElike models, which, instead of explicitly considering the personalized difference of the collaborative uncertainty and feature noise for different users, treat the fusion of user feature as a fixed procedure for all the users.
RQ2: How well does the inferred bandwidth correspond to the sparsity of interactions and, therefore, the scarcity of collaborative information? The answer to this question shows the effectiveness of the proposed quantuminspired collaborative uncertainty measurement to distinguish users with varied sufficiency levels of collaborative information.
To answer the first research question, we design the following three baseline models as ablation studies:
DBAEpass uses an "allpass" channel to link the user collaborative and feature networks, where all the information in user feature embeddings are losslessly transferred to the corresponding user latent variables irrespective of the individual difference in sufficiency level of the collaborative information;
DBAEstop uses a "stop" channel where the user feature information is entirely blocked, and only the collaborative information is exploited to calculate the user latent variables. The difference between DBAEstop and MultiVAE is that MultiVAE imposes the L2normalization on the input ratings, whereas DBAEstop imposes it on the hidden rating embeddings to make it comparable with VBAE.
VAEconcat concatenates the user ratings and features as the inputs to the MultiVAE to reconstruct the ratings, instead of viewing their fusion from an informationtheoretic perspective. The fusion can be viewed as learning a fixed weighted combination between user features and ratings.



The collaborative network structure of DBAEpass and DBAEstop is set to be the same as VBAE for a fair comparison. The comparison results are listed in Table LABEL:tab:results_channel.
As it can be seen, among the five models that we draw comparisons with, DBAEstop performs the worst on all the datasets. Since DBAEstop can be viewed as an altered version of MultVAE [26] where the L2normalization is applied on the hidden representations rather than the input ratings and extra L2 penalties are imposed on the network weights, this confirms the previous finding that hybrid recommendation methods augmented with feature information usually perform better than collaborativebased methods when the ratings are sparse [47, 26]. Comparatively, DBAEpass is much harder to beat than DBAEstop, since the deficiency of collaborative information for a large number of users with sparse interaction makes the auxiliary user features valuable for personalized recommendation, even if the features are noisy. Still, two VBAEbased methods achieve better performance on all three datasets, which demonstrates that constraining the feature information allowed to be accessed for users with sufficient collaborative information can indeed improve model generalization. Although VAEconcat uses a dense layer to learn a weighted combination of user features and ratings, it is overparameterized and prone to overfitting when the datasets are sparse. Moreover, VAEconcat the weights are fixed for all users, which ignores the individual differences in the sufficiency level of collaborative information. Therefore, we also observe that two VBAE models outperform VAEconcat on all three datasets. The superiority of userdependent bandwidth to the allpass and VAEconcat models indicates that for users with more informative interactions (i.e., dense and overlapped with the interactions of other users), the collaborative information in the ratings is per se very reliable for recommendations, and the noise introduced by the fusion of user features may outweigh the useful information and degenerate the recommendation performance.
The explanation for the superiority of VBAEsoft over VBAEhard could be that VBAEsoft uses a Beta channel variable with its variance fixed to a small value, where the feature embeddings are stably and smoothly discounted based on the bandwidth inferred from user ratings. In contrast, the Bernoulli channel in VBAEhard determines whether or not to access user features with the inferred bandwidth as the access probability, which may be coarse in granularity and makes the training process less stable than the Beta channel in VBAEsoft.
To further investigate the effectiveness of the userdependent channels for users with different activity levels, we divide the test users into quartiles and report the NDCG@100 on each group in Fig.
6. When comparing with DBAEstop, we mainly focus on users with low activity levels, since for these users, DBAEstop accesses no information from user features while VBAEhard and VBAEsoft infer a large bandwidth for the channel that allows more information to be accessed from their features. The leftmost bar group in Fig. 6 shows that VBAEhard and VBAEsoft significantly outperform DBAEstop on all three datasets. The result confirms that incorporating auxiliary feature information can alleviate the uncertainty of collaborative embeddings and improve recommendation performance when the ratings are extremely sparse, even if the user features are noisy. When comparing with DBAEpass, on the other hand, we focus on users with high activity levels. Although for these users, VBAEhard and VBAEsoft access less information from user features, the rightmost bar group of Fig. 6 shows that NDCG@100 improves consistently for these users. This indicates that for users with dense interactions, the collaborative information in the ratings is per se very reliable for recommendations, and the noise introduced by the fusion of user features may outweigh the useful information and lowers the recommendation performance. The improvement is more significant on citeuliket and toys & games datasets. Table 1 shows that users in these two datasets span a wider spectrum in their activity levels, and therefore the reliability of collaborative embeddings varies drastically for these users. In such a case, the channel can better distinguish these users and allocate for each user a suitable budget for user feature information when calculating the user latent variables for recommendations.To answer the second research question, we calculate several statistics of the inferred bandwidth for all the test users: its averaged value, its user variability, and its Pearson correlation coefficient (PCC) with the rating density, and report them in Table LABEL:tab:results_channel. Table LABEL:tab:results_channel shows that the bandwidth inferred through the proposed quantuminspired collaborative uncertainty measurement tends to vary across users with different rating sparsity levels. Moreover, the bandwidth has an over 0.8 PCC with the density of user interactions on all the datasets. Such results indicate that the channel in VBAEhard and VBAEsoft can distinguish users with different amounts of collaborative information in their ratings and dynamically control the extra amount of information that needs to be accessed from the user features based on the inferred bandwidth, which more convincingly demonstrates the effectiveness of the userdependent channel in VBAEhard and VBAEsoft. In addition, the average bandwidth of VBAEhard is significantly larger than that of VBAEsoft on all three datasets. The reason could be that a large bandwidth for VBAEhard helps to maintain the stability of the Bernoulli channel in training.
Although we demonstrate the effectiveness of VBAE by its application in recommender systems in this article, VBAE is a general framework that is applicable to any heterogeneous information system where one information source is comparatively reliable but could be missing, whereas another information source is abundant but is susceptible to noise. One typical example of such a system other than recommendations is the "audioassisted action recognition in the dark" task [52], which aims to detect actions in underilluminated videos. In the task, the visual information is the more reliable modality for action prediction but could be missing due to bad illumination, whereas the audio track always accompanies the video but may contain lots of irrelevant information (e.g., background music) for the action recognition purpose. To apply VBAE to these new tasks, the only mandatory change required is to design a suitable per data point uncertainty measurement of the first information source instead of the quantuminspired measurement proposed in this paper tailored for the user ratings, to dynamically decide the information allowed to be accessed from the second source, so that the model will not overfit on the noise in the second auxiliary modality. Therefore, we speculate that VBAE could have a broader potential impact in areas of data mining and heterogeneous information systems other than recommendations.
VBAE In this article, we develop an informationdriven generative model, collaborative variational bandwidth autoencoder (VBAE), to address uncertainty and noise problems associated with two heterogeneous sources, i.e., ratings and user features in recommender systems. In VBAE, we establish an information regulation mechanism to fuse the collaborative and feature information, where a userdependent channel variable is introduced to dynamically control how much information should be accessed from the user features given the information already contained in the collaborative embedding. The channel alleviates the uncertainty problem when the ratings are sparse while improving the model generalization ability with respect to noisy user features. The effectiveness of VBAE is demonstrated by extensive experiments conducted on three realworld datasets.
Autoencoders, unsupervised learning, and deep architectures
. In Proc. ICML workshop, pp. 37–49. Cited by: §2.Multivariate linear regression models
. In Applied Multivariate Statistical Analysis, pp. 360–417. Cited by: §3.4.1.The concrete distribution: a continuous relaxation of discrete random variables
. In Proc. ICLR, Cited by: §3.7.2.Stochastic backpropagation and approximate inference in deep generative models
. In Proc. ICML, Vol. 32, pp. 1278–1286. Cited by: §3.7.Efficient relevance feedback for contentbased image retrieval by mining user navigation patterns
. IEEE Trans. Knowl. Data Eng. 23 (3), pp. 360–372. Cited by: §1.Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion
. J. Mach. Learn. Res. 11 (Dec), pp. 3371–3408. Cited by: §2.1, 3rd item.Contentcollaborative disentanglement representation learning for enhanced recommendation
. In Proc. RecSys, pp. 43–52. Cited by: §2.1, §2, 8th item.
Comments
There are no comments yet.