1. Introduction
Personalizing recommendations is a key factor in successful recommender systems, and thus is of large industrial and academic interest. Challenges arise both with regards to efficiency and effectiveness, especially for largescale systems with tens to hundreds of millions of items and users.
Recommendation approaches based on collaborative filtering (CF), contentbased filtering, and their combinations have been investigated extensively (see the surveys in (adomavicius2005toward; shi2014collaborative)), with CF based systems being one of the major methods in this area. CF based systems learn directly from either implicit (e.g., clicks) or explicit feedback (e.g., ratings), where matrix factorization approaches have traditionally worked well (bennett2007netflix; Koren:2009:MFT:1608565.1608614). CF learns dimensional user and item representations based on a factorization of the interactionmatrix between users and items, e.g., based on their click or rating history, such that the inner product can be used for computing useritem relevance. However, in the case of new unseen items (i.e., coldstart items), standard CF methods are unable to learn meaningful representations, and thus cannot recommend those items (and similarly for coldstart users). To handle these cases, contentaware approaches are used when additional content information is available, such as textual descriptions, and have been shown to improve upon standard CF based methods (lian2015content).
In largescale recommendation settings, providing topK recommendations among all existing items using an inner product is computationally costly, and thus provides a practical obstacle in employing these systems at scale. Hashingbased approaches solve this by generating binary user and item hash codes, such that useritem relevance can be computed using the Hamming distance (i.e., the number of bit positions where two bit strings are different). The Hamming distance has a highly efficient hardwarelevel implementation, and has been shown to allow for realtime retrieval among a billion items (shan2018recurrent). Early work on hashingbased collaborative filtering systems (karatzoglou2010collaborative; zhou2012learning; zhang2014preference) learned realvalued user and item representations, which were then in a later step discretized into binary hash codes. Further work focuses on endtoend approaches, which improve upon the twostage approaches by reducing the discretizing error by optimizing the hash codes directly (zhang2016discrete; Liu:2019:CCC:3331184.3331206). Recent contentaware hashingbased approaches (Lian:2017:DCM:3097983.3098008; Zhang:2018:DDL:3159652.3159688) have been shown to perform well in both standard and coldstart settings, however they share the common problem of generating coldstart item hash codes differently from standard items, which we claim is unnecessary and limits their generalizability in coldstart settings.
We present a novel neural approach for contentaware hashingbased collaborative filtering (NeuHashCF) robust to coldstart recommendation problems. NeuHashCF consists of two joint hashing components for generating user and item hashing codes, which are connected in a variational autoencoder architecture. Inspired by semantic hashing (salakhutdinov2009semantic), the item hashing component learns to directly map an item’s content information to a hash code, while maximizing its ability to reconstruct the original content information input. The user hash codes are generated directly based on the user’s id through learning a user embedding matrix, and are jointly optimized with the item hash codes to optimize the log likelihood of observing each useritem rating in the training data. Through this endtoend trainable architecture, all item hash codes are generated in the same way, independently of whether they are seen or not during training. We experimentally compare our NeuHashCF to stateoftheart baselines, where we obtain significant performance improvements in coldstart recommendation settings by up to 12% NDCG and 13% MRR, and up to 4% in standard recommendation settings. Our NeuHashCF approach uses 24x fewer bits, while obtaining the same or better performance than the state of the art, and notable storage reductions.
In summary, we contribute a novel contentaware hashingbased collaborative filtering approach (NeuHashCF), which in contrast to existing stateoftheart approaches generates item hash codes in a unified way (not distinguishing between standard and coldstart items).
2. Related Work
The seminal work of das2007google used a LocalitySensitive Hashing (gionis1999similarity) scheme, called MinHashing, for efficiently searching Google News, where a Jaccard measure for itemsharing between users was used to generate item and user hash codes. Following this, karatzoglou2010collaborative used matrix factorization to learn realvalued latent user and item representations, which were then mapped to binary codes using random projections. Inspired by this, zhou2012learning applied iterative quantization (gong2012iterative)
as a way of rotating and binarizing the realvalued latent representations, which had originally been proposed for efficient hashingbased image retrieval. However, since the magnitude of the original realvalued representations are lost in the quantization, the Hamming distance between two hash codes might not correspond to the original relevance (inner product of realvalued vectors) of an item to a user. To solve this,
zhang2014preference imposed a constant norm constraint on the realvalued representations followed by a separate quantization.Each of the above approaches led to improved recommendation performance, however, they can all be considered twostage approaches, where the quantization is done as a postprocessing step, rather than being part of the hash code learning procedure. Furthermore, postprocessing quantization approaches have been shown to lead to large quantization errors (zhang2016discrete), leading to the investigation of approaches learning the hash codes directly.
Next, we review (1) hashingbased approaches for recommendation with explicit feedback; (2) contentaware hashingbased recommendation approaches designed for the coldstart setting of item recommendation; and (3) the related domain of semantic hashing, which our approach is partly inspired from.
2.1. Learning to Hash Directly
Discrete Collaborative Filtering (DCF) (zhang2016discrete) was the first approach towards learning item and user hash codes directly, rather than through a twostep approach. DCF is based on a matrix factorization formulation with additional constraints enforcing the discreteness of the generated hash codes. DCF further investigated balanced and decorrelation constraints to improve generalization by better utilizing the Hamming space. Inspired by DCF, zhang2017discrete proposed Discrete Personalized Ranking (DPR) as a method designed for collaborative filtering with implicit feedback (in contrast to explicit feedback in the DCF case). DPR optimized a ranking objective through AUC and regularized the hash codes using both balance and decorrelation constraints similar to DCF. While these and previous twostage approaches have led to highly efficient and improved recommendations, they are still inherently constrained by the limited representational ability of binary codes (in contrast to realvalued representations). To this end, Compositional Coding for Collaborative Filtering (CCCF) (Liu:2019:CCC:3331184.3331206) was proposed as a hybrid approach between discrete and realvalued representations. CCCF considers each hash code as consisting of a number of blocks, each of which is associated with a learned realvalued scalar weight. The block weights are used for computing a weighted Hamming distance, following the intuition that not all parts of an item hash code are equally relevant for all users. While this hybrid approach led to improved performance, it has a significant storage overhead (due to each hash code’s block weights) and computational runtime increase, due to the weighted Hamming distance, compared to the efficient hardwaresupported Hamming distance.
2.2. Contentaware Hashing
A common problem for collaborative filtering approaches, both binary and realvalued, is the coldstart setting, where a number of items have not yet been seen by users. In this setting, approaches based solely on traditional collaborative filtering cannot generate representations for the new items. Inspired by DCF, Discrete Contentaware Matrix Factorization (DCMF) (Lian:2017:DCM:3097983.3098008)
was the first hashingbased approach that also handled the coldstart setting. DCMF optimizes a multiobjective loss function, which most importantly learns hash codes directly for minimizing the squared rating error. Secondly, it also learns a latent representation for each content feature (e.g., each word in the content vocabulary), which is multiplied by the content features to approximate the learned hash codes, such that this can be used for generating hash codes in a coldstart setting. DCMF uses an alternating optimization strategy and, similarly to DCF, includes constraints enforcing bit balancing and decorrelation. Another approach, Discrete Deep Learning (DDL)
(Zhang:2018:DDL:3159652.3159688)learns hash codes similarly to DCMF, through an alternating optimization strategy solving a relaxed optimization problem. However, instead of learning latent representations for each content feature to solve the coldstart problem, they train a deep belief network
(hinton2006fast) to approximate the already learned hash codes based on the content features. This is a problem as described below.DCMF and DDL both primarily learn hash codes not designed for coldstart settings, but then as a subobjective learn how to map content features to new compatible hash codes for the coldstart setting. In practice, this is problematic as it corresponds to learning coldstart item hash codes based on previously learned hash codes from standard items, which we claim is unnecessary and limits their generalizability in coldstart settings. In contrast, our proposed NeuHashCF approach does not distinguish between between the settings for generating item hash codes, but rather always bases the item hash codes on the content features through a variational autoencoder architecture. As such, our approach can learn a better mapping from content features to hash code, since it is learned directly, as opposed to learning it in two steps by approximating the existing hash codes that have already been generated.
2.3. Semantic Hashing
The related area of Semantic Hashing (salakhutdinov2009semantic) aims to map objects (e.g., images or text) to hash codes, such that similar objects have a short Hamming distance between them. Early work focused on twostep approaches based on learning realvalued latent representations followed by a rounding stage (weiss2009spectral; zhang2010laplacian; zhang2010self). Recent work has primarily used autoencoderbased approaches, either with a secondary rounding step (chaidaroon2017variational), or through direct optimization of binary codes using Bernoulli sampling and straightthrough estimators for backpropagation during training (shen2018nash; Hansen:2019:UNG:3331184.3331255; hansensemhashsigir2020). We draw inspiration from the latter approaches in the design of the item hashing component of our approach, as substantial performance gains have previously been observed in the semantic hashing literature over roundingbased approaches.
3. Hashingbased Collaborative Filtering
Collaborative Filtering learns realvalued latent user and item representations, such that the inner product between a user and item corresponds to the item’s relevance to that specific user, where the ground truth is denoted as a useritem rating . Hashingbased collaborative filtering learns hash codes, corresponding to binary latent representations, for users and items. We denote bit user and item hash codes as and , respectively. For estimating an item’s relevance to a specific user in the hashing setting, the Hamming distance is computed as opposed to the inner product, as:
(1) 
Thus, the Hamming distance corresponds to summing the differing bits between the codes, which can be implemented very efficiently using hardwarelevel bit operations through the bitwise XOR and popcount operations. The relation between the inner product and Hamming distance of hash codes is simply:
(2) 
meaning it is trivial to replace realvalued user and item representations with learned hash codes in an existing recommender system.
3.1. Contentaware Neural Hashingbased Collaborative Filtering (NeuHashCF)
We first give an overview of our model, Contentaware Neural Hashingbased Collaborative Filtering (NeuHashCF), and then detail its components. NeuHashCF consists of two joint components for generating user and item hash codes. The item hashing component learns to derive item hash codes directly from the content features associated with each item. The item hashing component has two optimization objectives: (1) to maximize the likelihood of the observed useritem ratings, and (2) the unsupervised objective of reconstructing the original content features. Through this design, all item hash codes are based on content features, thus directly generating hash codes usable for both standard and coldstart recommendation settings. This contrasts existing stateoftheart models (Zhang:2018:DDL:3159652.3159688; Lian:2017:DCM:3097983.3098008) that separate how standard and coldstart item hash codes are generated. Through this choice, NeuHashCF can generate higher quality coldstart item hash codes, but also improve the representational power of already observed items by better incorporating content features.
The user hashing component learns user hash codes, located within the same Hamming space as the item hash codes, by maximizing the likelihood of the observed useritem ratings, which is a shared objective with the item hashing component. Maximizing the likelihood of the observed useritem ratings influences the model optimization in relation to both user and item hash codes, while the unsupervised feature reconstruction loss of the item hashing component is focused only on the item hash codes. The aim of this objective combination is to ensure that the hash code distances enforce useritem relevance, but also that items with similar content have similar hash codes.
Next, we describe the architecture of our variational autoencoder (Section 3.2), followed by how users and items are encoded into hash codes (Section 3.3), decoded for obtaining a target value (Section 3.4), and lastly the formulation of the final loss function (Section 3.5). We provide a visual overview of our model in Figure 1.
3.2. Variational Autoencoder Architecture
We propose a variational autoencoder architecture for generating user and item hash codes, where we initially define the likelihood functions of each user and item as:
(3)  
(4) 
where is the set of all items rated by user , is the set of all users who have rated item , and
is the probability of observing the content of item
. We denote as the dimensional content feature vector (a bagofwords representation) associated with each item, and denote the nonzero entries as . Thus, we can define the content likelihood similar to Eq. 3 and 4:(5) 
In order to maximize the likelihood of the users and items, we need to maximize the likelihood of the observed ratings, , as well as the word probabilities . Since they must be maximized based on the generated hash codes, we assume that is conditioned on both and , and that is conditioned on . For ease of derivation, we choose to maximize the log likelihood instead of the raw likelihoods, such that the log likelihood of the observed ratings and item content can be computed as:
(6)  
(7) 
where the hash codes are sampled by repeating consecutive Bernoulli trials, which as a prior is assumed to have equal probability of sampling either 1 or 1. Thus, and can be computed simply as:
(8) 
where is the j’th bit of a hash code (either user or item), and where we set for equal sampling probability of 1 and 1. However, optimizing the log likelihoods directly is intractable, so instead we maximize their variational lower bounds (kingma2014auto):
(9)  
(10) 
where and are learned
approximate posterior probability distributions (see Section
3.3), and KL is the KullbackLeibler divergence. Intuitively, the conditional log likelihood within the expectation term can be considered a reconstruction term, which represents how well either the observed ratings or item content can be decoded from the hash codes (see Section
3.4). The KL divergence can be considered as a regularization term, by punishing large deviations from the Bernoulli distribution with equal sampling probability of 1 and 1, which is computed analytically as:
(11) 
with for equal sampling probability. The KL divergence is computed similarly for the user hash codes using . Next we describe how to compute the learned approximate posterior probability distributions.
3.3. Encoder Functions
The learned approximate posterior distributions and
can be considered encoder functions for items and users, respectively, and are both modeled through a neural network formulation. Their objective is to transform users and items into
bit hash codes.3.3.1. Item encoding
An item is encoded based on its content through multiple layers to obtain sampling probabilities for generating the hash code:
(12)  
(13) 
where and are learned weights and biases, is elementwise multiplication, and is a learned importance weight for scaling the content words, which has been used similarly for semantic hashing (Hansen:2019:UNG:3331184.3331255). Next, we obtain the sampling probabilities by transforming the last layer, , into an dimensional vector:
(14) 
where
is the sigmoid function to scale the output between 0 and 1, and
is the set of parameters used for the item encoding. We can now sample the item hash code from a Bernoulli distribution, which can be computed for each bit as:(15) 
where is an dimensional vector with uniformly sampled values. The model is trained using randomly sampled vectors, since it encourages model exploration because the same item may be represented as multiple different hash codes during training. However, to produce a deterministic output for testing once the model is trained, we fix each value within to 0.5 instead of a randomly sampled value.
3.3.2. User encoding
The user hash codes are learned similarly to the item hash codes, however, since we do not have a user feature vector, the hash codes are learned using only the user id. Thus, the sampling probabilities are learned as:
(16) 
where is the learned user embedding, and
is a onehot encoding of user
. Following the same approach as the item encoding, we can sample the user hash code based on for each bit as:(17) 
where is the set of parameters for user encoding. During training and testing, we use the same sampling strategy as for the item encoding. For both users and items, we use a straightthrough estimator (bengio2013estimating) for computing the gradients for backpropgation through the sampled hash codes.
3.4. Decoder Functions
3.4.1. Useritem rating decoding
The first decoding step aims to reconstruct the original useritem rating , which corresponds to computing the conditional log likelihood of Eq. 9, i.e., . We first transform the useritem rating into the same range as the inner product between the hash codes:
(18) 
Similarly to (liang2018variational; sachdeva2019sequential)
, we assume the ratings are Gaussian distributed around their true mean for each rating value, such that we can compute the conditional log likelihood as:
(19) 
where the variance
is constant, thus providing an equal weighting of all ratings. However, the exact value of the variance is irrelevant, since maximizing Eq. 19 corresponds to simply minimizing the squared error (MSE) of the mean term, i.e., . Thus, maximizing the log likelihood is equivalent to minimizing the MSE, as similarly done in related work (zhang2016discrete; Lian:2017:DCM:3097983.3098008; Zhang:2018:DDL:3159652.3159688). Lastly, note that due to the equivalence between the inner product and the Hamming distance (see Eq. 2), this directly optimizes the hash codes for the Hamming distance.3.4.2. Item content decoding
The secondary decoding step aims to reconstruct the original content features given the generated item hash code in Eq. 10, i.e., . We compute this as the summation of word log likelihoods (based on Eq. 5) using a softmax:
(20) 
where is a onehot encoding for word , is the set of all vocabulary words of the content feature vectors, is a learned word embedding, is a wordlevel bias term, and the learned importance weight is the same as in Eq. 12. This softmax expression is maximized when the item hash codes are able to decode the original content words.
3.4.3. Noise infusion for robustness
Previous work on semantic hashing has shown that infusing random noise into the hash codes before decoding increases robustness, and leads to more generalizable hash codes (shen2018nash; chaidaroon2018deep; Hansen:2019:UNG:3331184.3331255). Thus, we apply a Gaussian noise to both user and item hash codes before decoding:
(21) 
where variance annealing is used for decreasing the initial value of in each training iteration.
3.5. Combined Loss Function
NeuHashCF can be trained in an endtoend fashion by maximising the combination of the variational lower bounds from Eq. 9 and 10, corresponding to the following loss:
(22) 
where corresponds to the lower bound in Eq. 9, corresponds to the lower bound in Eq. 10, and is a tunable hyper parameter to control the importance of decoding the item content.
4. Experimental Evaluation
4.1. Datasets
We evaluate our approach on wellknown and publicly available datasets with explicit feedback, where we follow the same preprocessing as related work (Lian:2017:DCM:3097983.3098008; Zhang:2018:DDL:3159652.3159688; wang2011collaborative) as described in the following. We disallow users to have rated the same item multiple times and use only the last rating in these cases. Due to the very high sparsity of these types of datasets, we apply a filtering to densify the dataset. We remove users who have rated fewer then 20 items, as well items that have been rated by fewer than 20 users. Since the removal of either a user or item may violate the density requirements, we apply the filtering iteratively until all users and items satisfy the requirement. The datasets are described below and summarized in Table 1:
 Yelp:

is from the Yelp Challenge^{1}^{1}1https://www.yelp.com/dataset/challenge, which consists of user ratings and textual reviews on locations such as hotels, restaurants, and shopping centers. User ratings range between 1 (worst) to 5 (best), and most ratings are associated with a textual review.
 Amazon (he2016ups):

is from a collection of book reviews from Amazon^{2}^{2}2http://jmcauley.ucsd.edu/data/amazon/. Similarly to Yelp, each user rates a number of books between 1 to 5, and most are accompanied by a textual review as well.
Similarly to related work (Lian:2017:DCM:3097983.3098008; Zhang:2018:DDL:3159652.3159688; wang2011collaborative), to obtain content information related to each item, we use the textual reviews (when available) by users for an item. We filter stop words and aggregate all textual reviews for each item into a single large text, and compute the TFIDF bagofwords representations, where the top 8000 unique words are kept as the content vocabulary. We apply this preprocessing step separately on each dataset, thus resulting in two different vocabularies.
Dataset  #users  #items  #ratings  sparsity 

Yelp  27,147  20,266  1,293,247  99.765% 
Amazon  35,736  38,121  1,960,674  99.856% 
4.2. Experimental Design
Following wang2011collaborative, we use two types of recommendations settings: 1) inmatrix regression for estimating the relevance of known items with existing ratings, and 2) outofmatrix regression for estimating the relevance of coldstart items. Both of these recommendation types lead to different evaluation setups as described next.
4.2.1. Inmatrix regression
Inmatrix regression can be considered the standard setup of all items (and users) being known at all times, and thus corresponds to the setting solvable by standard collaborative filtering. We split each user’s items into a training and testing set using a 50/50 split, and use 15% of the training set as a validation set for hyper parameter tuning.
4.2.2. Outofmatrix regression
Outofmatrix regression is also known as a coldstart setting, where new items are to be recommended. In comparison to inmatrix regression, this task cannot be solved by standard collaborative filtering. We sort all items by their number of ratings, and then proportionally split them 50/50 into a training and testing set, such that each set has approximately the same number of items with similar number of ratings. Similarly to the inmatrix regression setting, we use 15% of the training items as a validation set for hyper parameter tuning.
4.3. Evaluation Metrics
We evaluate the effectiveness of our approach and the baselines as a ranking task with the aim of placing the most relevant (i.e., highest rated) items at the top of a ranked list. As detailed in Section 4.2, each user has a number of rated items, such that the ranked list is produced by sorting each user’s testing items by their Hamming distance between the user and item hash codes. To measure the quality of the ranked list, we use Normalized Discounted Cumulative Gain (NDCG), which incorporates both ranking precision and the position of ratings. Secondly, we are interested in the first position of the item with the highest rating, as this ideally should be in the top. To this end, we compute the Mean Reciprocal Rank (MRR) of the highest ranked item with the highest given rating from the user’s list of testing items.
4.4. Baselines
We compare NeuHashCF against existing stateoftheart contentaware hashingbased recommendation approaches, as well as hashingbased approaches that are not contentaware to highlight the benefit of including content:
 DCMF:

Discrete Contentaware Matrix Factorization (Lian:2017:DCM:3097983.3098008)^{3}^{3}3https://github.com/DefuLian/recsys/tree/master/alg/discrete/dcmf is a contentaware matrix factorization technique, which is discretized and optimized through solving multiple mixedinteger subproblems. Similarly to our approach, its primary objective is to minimize the squared error between the rating and estimated rating based on the Hamming distance. It also learns a latent representation for each word in the text associated to each item, which is used for generating hash codes for coldstart items.
 DDL:

Discrete Deep Learning (Zhang:2018:DDL:3159652.3159688)^{4}^{4}4https://github.com/yixianqianzy/ddl also uses an alternating optimizing strategy for solving multiple mixedinteger subproblems, where the primary objective is a mean squared error loss. In contrast to DCMF, DDL uses a deep belief network for generating coldstart item hash codes, which is trained by learning to map the content of known items into their hash codes generated in the first part of the approach.
 DCF:

Discrete Collaborative Filtering (zhang2016discrete)^{5}^{5}5https://github.com/hanwangzhang/DiscreteCollaborativeFiltering can be considered the predecessor to DCMF, but is not contentaware, which was the primary novelty of DCMF.
 NeuHashCF/no.C:

We include a version of our NeuHashCF that is not contentaware, which is done by simply learning item hash codes similarly to user hash codes, thus not including any content features.
For both DCMF and DDL, hash codes for coldstart items are seen as a secondary objective, as they are generated differently from noncoldstart item hash codes. In contrast, our NeuHashCF treats all items identically as all item hash codes are generated based on content features alone.
To provide a comparison to nonhashing based approaches, which are notably more computationally expensive for making recommendations (see Section 4.7), we also include the following baselines:
 FM:

Factorization Machines (rendle2010factorization) works on a concatenated dimensional vector of the onehot encoded user id, onehot encoder item id, and the content features. It learns latent vectors, as well as scalar weights and biases for each of the dimensions. FM estimates the useritem relevance by computing a weighted sum of all nonzero entries and all interactions between nonzero entries of the concatenated vector. This results in a large amount of inner product computations and a large storage cost associated with the latent representations and scalars. We use the FastFM implementation (JMLR:v17:15355)^{6}^{6}6https://github.com/ibayer/fastFM.
 MF:

Matrix Factorization (Koren:2009:MFT:1608565.1608614) is a classic noncontentaware collaborative filtering approach, which learns realvalued item and user latent vectors, such that the inner product corresponds to the useritem relevance. MF is similar to a special case of FM without any feature interactions.
Yelp (inmatrix)  Yelp (outofmatrix)  
16 dim.  32 dim.  64 dim.  16 dim.  32 dim.  64 dim.  
NDCG  @2  @6  @10  @2  @6  @10  @2  @6  @10  @2  @6  @10  @2  @6  @10  @2  @6  @10 
NeuHashCF  .662  .701  .752  .681  .718  .766  .697  .731  .776  .646  .694  .747  .687  .725  .772  .702  .737  .780 
DCMF  .642  .678  .733  .655  .691  .743  .670  .701  .752  .611  .647  .703  .617  .655  .709  .626  .664  .717 
DDL  .636  .674  .729  .651  .686  .739  .664  .698  .749  .575  .615  .673  .579  .622  .681  .612  .646  .700 
NeuHashCF/no.C  .634  .672  .727  .655  .689  .741  .666  .699  .749                   
DCF  .639  .676  .730  .649  .685  .738  .671  .700  .750                   
MF (realvalued)  .755  .763  .800  .755  .763  .800  .755  .763  .800                   
FM (realvalued)  .754  .763  .801  .750  .760  .798  .744  .755  .794  .731  .750  .789  .724  .744  .785  .719  .740  .781 
Amazon (inmatrix)  Amazon (outofmatrix)  
16 dim.  32 dim.  64 dim.  16 dim.  32 dim.  64 dim.  
NDCG  @2  @6  @10  @2  @6  @10  @2  @6  @10  @2  @6  @10  @2  @6  @10  @2  @6  @10 
NeuHashCF  .759  .777  .810  .780  .798  .827  .786  .803  .831  .758  .778  .809  .769  .788  .818  .787  .804  .831 
DCMF  .749  .767  .800  .761  .777  .810  .773  .788  .818  .727  .748  .782  .729  .749  .784  .733  .752  .786 
DDL  .734  .755  .791  .748  .768  .802  .762  .779  .811  .704  .728  .766  .705  .729  .767  .705  .727  .766 
NeuHashCF/no.C  .748  .768  .802  .760  .776  .808  .771  .785  .816                   
DCF  .745  .767  .802  .759  .776  .809  .774  .787  .818                   
MF (realvalued)  .824  .826  .848  .824  .826  .848  .824  .826  .848                   
FM (realvalued)  .821  .822  .845  .817  .819  .843  .813  .816  .841  .792  .800  .827  .785  .793  .821  .780  .790  .819 
Yelp (inmatrix)  Yelp (outofmatrix)  Amazon (inmatrix)  Amazon (outofmatrix)  
MRR  16 dim.  32 dim.  64 dim.  16 dim.  32 dim.  64 dim.  16 dim.  32 dim.  64 dim.  16 dim.  32 dim.  64 dim. 
NeuHashCF  .646  .668  .687  .628  .674  .692  .749  .770  .779  .750  .764  .782 
DCMF  .629  .644  .660  .598  .604  .612  .738  .753  .767  .719  .721  .726 
DDL  .620  .638  .651  .557  .562  .604  .721  .741  .753  .696  .694  .694 
NeuHashCF/no.C  .621  .642  .656        .737  .752  .764       
DCF  .626  .636  .664        .736  .751  .769       
MF (realvalued)  .767  .767  .767        .826  .826  .826       
FM (realvalued)  .761  .756  .750  .730  .722  .717  .824  .821  .815  .792  .784  .780 
4.5. Tuning
For training our NeuHashCF approach, we use the Adam (kingma2014adam) optimizer with learning rates selected from and batch sizes from , where 0.0005 and 2000 were consistently chosen. We also tune the number of encoder layers from
and the number of neurons in each from
; most runs had the optimal validation performance with 2 layers and 1000 neurons. To improve robustness of the codes we added Gaussian noise before decoding the hash codes, where the variance was initially set to 1 and decreased by 0.01% after every batch. Lastly, we tune in Eq. 22 from , where 0.001 was consistently chosen. The code^{7}^{7}7We make the code publicly available at https://github.com/casperhansen/NeuHashCFis written in TensorFlow
(abadi2016tensorflow). For all baselines, we tune the hyper parameters on the validation set as described in the original papers.4.6. Results
The experimental comparison is summarized in Table 2 and 3 for NDCG@ and MRR, respectively. The tables are split into inmatrix and outofmatrix evaluation settings for both datasets, and the methods can be categorized into groups: (1) contentaware (NeuHashCF, DCMF, DDL), (2) not contentaware (NeuHashCF/no.C, DCF), (3) realvalued not contentaware (MF), and (4) realvalued contentaware (FM). For all methods, we compute hash codes (or latent representations for MF and FM) of length
. We use a twotailed paired ttest for statistical significance testing against the best performing hashingbased baseline. Statistically significant improvements, at the 0.05 level, over the best performing hashingbased baseline per column are marked with a star (
), and the best performing hashingbased approach is shown in bold.4.6.1. Inmatrix regression
In the inmatrix setting, where all items have been rated in the training data, our NeuHashCF significantly outperforms all hashingbased baselines. On Yelp, we observe improvements in NDCG by up to 0.03, corresponding to a 4.3% improvement. On Amazon, we observe improvements in NDCG by up to 0.02, corresponding to a 2.7% improvement. Similar improvements are noted on both datasets for MRR (1.64.1% improvements). On all datasets and across the evaluated dimensions, NeuHashCF performs similarly or better than stateoftheart hashingbased approaches while using 24 times fewer bits, thus providing both a significant performance increase as well as a 24 times storage reduction. Interestingly, the performance gap between existing contentaware and not contentaware approaches is relatively small. When considering the relative performance increase of our NeuHashCF with and without content features, we see the benefit of basing the item hash codes directly on the content. DCMF and DDL both utilize the content features for handling coldstart items, but not to the same degree for the inmatrix items, which we argue explains the primary performance increase observed for NeuHashCF, since NeuHashCF/no.C performs similarly to the baselines.
We also include MF and FM as realvalued baselines to better gauge the discretization gap. As expected, the realvalued approaches outperform the hashingbased approaches, however as the number of bit increases the performance difference decreases. This is to be expected, since realvalued approaches reach faster a potential representational limit, where more dimensions would not positively impact the ranking performance. In fact, for FM we observe a marginal performance drop when increasing its number of latent dimensions, thus indicating that it is overfitting. In contrast, MF keeps the same performance (differing on far out decimals) independently of its number of latent dimensions.
Yelp (outofmatrix)  Amazon (outofmatrix)  

10%  20%  30%  40%  50%  10%  20%  30%  40%  50%  
NDCG  @10  MRR  @10  MRR  @10  MRR  @10  MRR  @10  MRR  @10  MRR  @10  MRR  @10  MRR  @10  MRR  @10  MRR 
NeuHashCF  .730  .603  .750  .634  .769  .666  .771  .668  .772  .674  .794  .727  .812  .753  .817  .761  .818  .763  .818  .764 
DCMF  .688  .572  .693  .578  .704  .593  .710  .602  .709  .604  .774  .710  .778  .712  .781  .717  .784  .720  .784  .721 
DDL  .678  .556  .681  .562  .687  .572  .684  .571  .681  .562  .770  .713  .766  .689  .767  .700  .765  .693  .767  .694 
FM (realvalued)  .766  .688  .776  .707  .778  .712  .786  .724  .785  .722  .806  .759  .813  .771  .817  .775  .823  .786  .821  .784 
4.6.2. Outofmatrix regression
We now consider the outofmatrix setting, corresponding to recommending coldstart items. NeuHashCF significantly outperforms the existing stateoftheart hashingbased baselines even more than for the inmatrix setting. On Yelp, we observe the smallest NDCG increase for 16 bit at 0.035, which is however doubled in most cases for 32 and 64 bits, corresponding to improvements of up to 12.1% gain over stateoftheart baselines. We observe a similar trend on Amazon, where the lowest improvement of 0.027 NDCG is observed at 16 bits, but increasing the number of bits leads to consistently larger improvements of up to 7.4%. These results are also consistent with MRR, where increasing the number of bits provides increasingly larger performance increases between +5 and +13.1% on Yelp and between +4.3 and +7.7% on Amazon. In all cases, the performance of NeuHashCF on 16 bits is even better than the best baseline at 64 bits, thus verifying the high quality of the hash codes generated by NeuHashCF.
For the realvalued FM baseline, we observe that it outperforms ours and existing baselines at 16 and 32 dimensions, however at 64 dimensions NeuHashCF outperforms FM on Amazon for NDCG@ (across all dimensions). When we consider Yelp, NeuHashCF obtains a NDCG@10 within 0.01 of FM, but worse on the other NDCG cut offs and on MRR.
4.6.3. Outofmatrix regression with limited training data
To evaluate how the contentaware approaches generalize to the coldstart setting depending on the number of training items, we furthermore create smaller versions of the 50/50 outofmatrix split used previously. In addition to using 50% of the data for the training set, we consider splits using 10%, 20%, 30%, and 40% as well. In all outofmatrix settings the validation and testing sets are identical to be able to compare the impact of the training size. The results can be seen in Table 4 for 32 bit hash codes and 32 latent dimensions in FM. Similarly to before, NeuHashCF outperforms the hashingbased baselines in all cases with similar gains as observed previously. Most approaches, except DDL on Amazon, obtain the lowest performance using 10% of the data, and more training items generally improve the performance, although at 3050% the pace of improvement slows down significantly. This indicates that the methods have observed close to sufficiently many training items and increasing the amount may not lead to better generalizability of the coldstart hash codes. Interestingly, NeuHashCF obtains the largest improvement going from 10% to 50% on both NDCG and MRR, indicating that it generalizes better than the baselines. In contrast, DDL does not improve on Amazon by including more training items, which indicates that its ability to generalize to coldstart items is rather limited.
4.7. Computational Efficiency
To study the high efficiency of using hash codes in comparison to realvalued vectors, we consider a setup of 100,000 users and 1,0001,000,000 items. We randomly generate hash codes and realvalued vectors and measure the time taken to compute all Hamming distances (or inner products) from each user to all items, resulting in a total  computations. We use a machine with a 64 bit instruction set^{8}^{8}8We used an Intel Xeon CPU E52670, and hence generate hash codes and vectors of length 64. We report the average runtime over 10 repetitions in Figure 2, and observe a speed up of a factor 4050 for the Hamming distance, highlighting the efficiency benefit of hashingbased approaches. For FM, its dominating cost is its large number of inner product computations, which scales quadratically in the number of nonzero content features for a given item, thus making it highly intractable in largescale settings.
4.8. Impact of Average Item Popularity per User
We now look at how different user characteristics impact the performance of the methods. We first compute the average item popularity of each user’s list of rated items, and then order the users in ascending order of that average. An item’s popularity is computed as the number of users who have rated that specific item, and thus the average item popularity of a user is representative of their attraction to popular content. Figure 3 plots the NDCG@10 for 32 dimensional representations using a meansmoothing window size of 1000 (i.e., each shown value is averaged based on the values within a window of 1000 users). Generally, all methods perform better for users who have a high average item popularity, where for Yelp we see a NDCG@10 difference of up to 0.25 from the lowest to highest average popularity (0.2 for Amazon). This observation can be explained by highly popular items occurring more times in the training data, such that they have a better learned representation. Additionally, the hashingbased approaches have a larger performance difference, compared to the realvalued MF and FM, which is especially due to their lower relative performance for users with a very low average item popularity (left side of plots). In the outofmatrix setting the same trend is observed, however with our NeuHashCF performing highly similarly to FM when excluding the users with the lowest average item popularity. We hypothesize that users with a low average item popularity have a more specialized preference, thus benefitting more from the higher representational power of realvalued representations.
4.9. Impact of Number of Items per User
We now consider how the number of items each user has rated impacts performance. We order users by their number of rated items and plot NDCG@10 for 32 bit hash codes. Figure 4 plots this in the same way as in Figure 3. Generally across all methods, we observe that performance initially increases, but then drops once a user has rated close to 100 items, depending on the dataset. While the hashingbased approaches keep steadily dropping in performance, MF and FM do so at a slower pace and even increase for users with the highest number of rated items in the inmatrix setting. The plots clearly show that the largest performance difference, between the realvalued and hashingbased approaches, is for the group of users with a high number of rated items, corresponding to users with potentially the highest diversity of interests. In this setting, the limited representational power of hash codes, as opposed to realvalued representations, may not be sufficient to encode users with largely varied interests. We observe very similar trends for the outofmatrix setting for coldstart items, although the performance gap between our NeuHashCF and the realvalued approaches is almost entirely located among the users with a high number of rated items.
5. Conclusion
We presented contentaware neural hashing for collaborative filtering (NeuHashCF), a novel hashingbased recommendation approach, which is robust to coldstart recommendation problems (i.e., the setting where the items to be recommended have not been rated previously). NeuHashCF is a neural approach that consists of two joint components for generating user and item hash codes. The user hash codes are learned from an embedding based procedure using only the user’s id, whereas the item hash codes are learned directly from associated content features (e.g., a textual item description). This contrasts existing stateoftheart contentaware hashingbased methods (Lian:2017:DCM:3097983.3098008; Zhang:2018:DDL:3159652.3159688), which generate item hash codes differently depending on whether they are coldstart items or not. NeuHashCF is formulated as a variational autoencoder architecture, where both user and item hash codes are sampled from learned Bernoulli distributions to enforce endtoend trainability. We presented a comprehensive experimental evaluation of NeuHashCF in both standard and coldstart settings, where NeuHashCF outperformed stateoftheart approaches by up to 12% NDCG and 13% MRR in coldstart recommendation (up to 4% in both NDCG and MRR in standard recommendation settings). In fact, the ranking performance of NeuHashCF on 16 bit hash codes is better than that of 3264 bit stateoftheart hash codes, thus resulting in both a significant effectiveness increase, but also in a 24x storage reduction. Analysis of our results showed that the largest performance difference between hashingbased and realvalued approaches occurs for users interested in the least popular items, and for the group of users with the highest number of rated items. Future work includes extending the architecture to accept richer item and user representations, such as (Hansen0ASL19; WangZLLZS20; RashedGS19; CostaD19).
Comments
There are no comments yet.