Deep Retrieval: An End-to-End Learnable Structure Model for Large-Scale Recommendations

by   Weihao Gao, et al.
ByteDance Inc.

One of the core problems in large-scale recommendations is to retrieve top relevant candidates accurately and efficiently, preferably in sub-linear time. Previous approaches are mostly based on a two-step procedure: first learn an inner-product model and then use maximum inner product search (MIPS) algorithms to search top candidates, leading to potential loss of retrieval accuracy. In this paper, we present Deep Retrieval (DR), an end-to-end learnable structure model for large-scale recommendations. DR encodes all candidates into a discrete latent space. Those latent codes for the candidates are model parameters and to be learnt together with other neural network parameters to maximize the same objective function. With the model learnt, a beam search over the latent codes is performed to retrieve the top candidates. Empirically, we showed that DR, with sub-linear computational complexity, can achieve almost the same accuracy as the brute-force baseline.



There are no comments yet.


page 1

page 2

page 3

page 4


EENMF: An End-to-End Neural Matching Framework for E-Commerce Sponsored Search

E-commerce sponsored search contributes an important part of revenue for...

Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance

Recently, Information Retrieval community has witnessed fast-paced advan...

Large scale near-duplicate image retrieval using Triples of Adjacent Ranked Features (TARF) with embedded geometric information

Most approaches to large-scale image retrieval are based on the construc...

Learning To Retrieve: How to Train a Dense Retrieval Model Effectively and Efficiently

Ranking has always been one of the top concerns in information retrieval...

Candidate Generation with Binary Codes for Large-Scale Top-N Recommendation

Generating the Top-N recommendations from a large corpus is computationa...

Climbing the WOL: Training for Cheaper Inference

Efficient inference for wide output layers (WOLs) is an essential yet ch...

Code Repositories


Advanced retrieval algorithms based on Spark for distributed training.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recommendation systems have gained great success in various commercial applications for decades. The objective of recommendation systems is to retrieve relevant candidates from an enormous corpus based on user features and historical behaviors. In the era of mobile internet, the amount of candidates from content providers and the amount of active users both grow rapidly to tens of millions to hundred of millions, making it more challenging to design accurate recommendation systems. The scalability and efficiency of the algorithm are the main challenge for modern recommendation systems.

One of the early successful techniques of recommendation systems is the collaborative filtering (CF), which makes predictions based on the simple idea that similar users may prefer similar items. Item-based collaborative filtering (Item-CF) Sarwar et al. (2001) extends the idea by considering the similarities between items and items, and later lays a foundation for Amazon’s recommendation system Linden et al. (2003).

Recently, vector-based recommendation algorithms have been widely adopted. The main idea is to embed users and items in a latent vector space, and use the inner product of vectors to represent the preference between users and items. Representative vector embedding methods includes matrix factorization (MF) 

Mnih and Salakhutdinov (2008); Koren et al. (2009), factorization machines (FM) Rendle (2010), DeepFM Guo et al. (2017), Field-aware FM (FFM) Juan et al. (2016), etc. However, when the number of items is large, the complexity of brute-force computation of the inner product for all items can be prohibitive. Thus, maximum inner product search (MIPS) or approximate nearest neighbors (ANN) algorithms are usually used to retrieve items when the corpus is large. Efficient MIPS or ANN algorithms includes tree-based algorithms Muja and Lowe (2014); Houle and Nett (2014), locality sensitive hashing (LSH) Shrivastava and Li (2014); Spring and Shrivastava (2017), product quantization (PQ) Jegou et al. (2010); Ge et al. (2013), hierarchical navigable small world graphs (HNSW) Malkov and Yashunin (2018), etc.

Despite their success in real world applications, vector-based algorithms has two main deficiencies: (1) The objective of learning vector representation and learning good MIPS structure are not perfectly aligned; (2) The dependency on inner products of user and item embeddings limits the capability of the model He et al. (2017). In order to break these limitations, tree based models Zhu et al. (2019, 2018) have been proposed. These methods use a tree as indices and map each item to a leaf node of the tree. Learning objectives for model parameters and tree structures are well aligned to improve the accuracy. However, the tree structure itself can be difficult to learn: data available at the leaf level can be scarce and might not provide enough signal to learn a good tree at that finer level.

In this paper, we proposed an end-to-end trainable structure model — Deep Retrieval (DR). Instead of using a tree structure, we propose to use a matrix as in Figure 0(a) for indexing, motivated by Chen et al. (2018). Each item is indexed by one or more “codes” (or equivalently “paths”) with length and range . For example, an item related to chocolate may be encoded by and another item related to cake may be encoded by . There are possible paths and each path can be interpreted as a cluster of items: each path could contain multiple items and each item could also belong to multiple paths.

Figure 1: (a) Consider a structure with width and depth . Assuming an item is encoded by length- vector , which is called a “code” or a “path”. The path denotes that the item is assigned to the , , indices of the

matrix. In the figure, arrows with the same color form a path. Different paths could intersect with each other by sharing the same index at some layer. (b) Flow chart showing the process for constructing the probability


The advantage of DR is two-folded. In training, the item paths can be learnt together with the neural network parameters of the structure model using an expectation-maximization (EM) type algorithm. Therefore the entire training process is end-to-end and can be easily deployed on deep learning platforms. In terms of model capability, the multiple-to-multiple encoding scheme enables DR to learn more complicated relationships among users and items. Experiments show the benefit of being able to learn such complicated relationships. The rest of the paper is organized as follows.

In Section 2, we describe the structure model and the structure objective function used in training in detail. We also introduce a beam search algorithm to find candidate paths in the inference stage.

In Section 3, we introduce the EM algorithm for training neural network parameters and paths of items jointly. We adopt a penalization on the sizes of the paths to prevent overfitting.

In Section 4, we demonstrate the performance of DR on two public datasets: MovieLens-20M111 and Amazon books222 Experiment results show that DR can almost achieve the brute-force accuracy with sub-linear computational complexity.

In Section 5, we conclude the paper and discuss several possible future research directions.

2 Structure Model

In this section, we introduce the structure model in Deep Retrieval in detail. Firstly, we construct the probability for user to select path given the model parameters and present the training objective. Then we introduce a multi-path mechanism to enable the model to capture multi-aspect properties of items. In the inference stage, we introduce a beam search algorithm to select candidate paths based on user embeddings.

2.1 Structure Objective Function

The structure model contains layers 333

Other neural network architectures such as recurrent neural networks (RNN) can also be applied here. Since

is not very large in our settings, for simplicity, we use MLP here. with

nodes in each layer, where each layer is a multi-layer perceptron (MLP) with skip connections and softmax as output. Each layer takes an input vector and outputs a probability distribution over

based on parameters . Let be the labels of all items and be the mapping from items to paths. We assume is fixed in this section and will introduce the algorithm for learning together with in Section 3.

Given a pair of training sample , which denotes a positive interaction (click, convert, like, etc.) between user and item , as well as the path associated with the item , the probability is constructed layer by layer as follows (see Figure 0(b) for a flow chart).

The first layer takes the user embedding as input, and output a probability over the nodes of the first layer, based on parameters .

From the second layer onward, we concatenate the user embedding and the embeddings of all the previous layers (called path embeddings) as the input of MLP, and output over the nodes of layer , based on parameters .

The probability of is just the product of the probabilities of all the layers.


Given a set of training samples , we maximize the log likelihood function of the structure model.


The size of the input vector of layer is the embedding size times , and the size of the output vector is . The parameters of layer have a size of . The parameters contain the parameters ’s in all layers, as well as the path embeddings. The parameters in the entire model have an order of , which is significantly smaller than the number of possible paths when .

2.2 Multi-path Structure Objective

In tree-based deep models Zhu et al. (2018, 2019) as well as the structure model we introduced before, each item belongs to only one cluster, which limits the capacity of the model to express multi-aspect information in real data. For example, an item related to kebab should belong to a cluster related to food. An item related to flowers should belong to a cluster related to gifts. However, an item related to chocolate or cakes should belong to both clusters in order to be recommended to users interested in either food or gifts. In real world recommendation systems, a cluster might not have an explicit meaning such as food or gifts, but this example motivates us to assign each item to multiple clusters. In DR, we allow each item to be assigned to different paths . Let be the mapping from items to multiple paths and the multi-path structure objective is defined as,


Beam search for inference. In the inference stage, we want to retrieve items from the structure model, given user embeddings as input. To this end, we utilize beam search algorithm Reddy and others (1977) to retrieve multiple paths and merge the items in the retrieved paths. In each layer, the algorithm selects top nodes from the all successors of the selected nodes from the previous layer. Finally it returns top paths in the final layer. The time complexity of the inference stage is , which is sub-linear with respect to the number of items. The detail of the beam search algorithm is relegated to the appendix.

3 Learning

In the previous section, we introduced the structure model in Deep Retrieval and the structure objective to be optimized. The objective is entirely continuous with respect to the parameters, hence can be optimized by any gradient-based optimizer. However, the objective involves the mapping from items to paths , which is discrete and can not be optimized by gradient-based optimizer. This mapping acts as the “clustering” of items, which motivates us to use an EM-style algorithm to optimize the mapping and the continuous parameters jointly. In this section, we describe the EM algorithm in detail and introduce a penalization term to prevent overfitting.

3.1 EM Algorithm for Joint Training

Given a user-item pair in training data set, let the path associate with the item () be the latent data in EM algorithm. Along with the continuous parameters , the objective function is given by


We maximize the objective function over all possible mappings . However, there are number of possible paths so we could not maximize it over all ’s. Instead, we only record the values for the top paths using beam search and leave the rest paths with zero scores444We also tried to use a small value. But it is difficult to figure out what the small value should be.. Unfortunately, this renders an unstable function due to the presence of in Equation (4). To address this, we approximate it using an upper bound 555Since this is an upper bound for the true objective we are maximizing, there is no guarantee as to maximizing a surrogate via a lower bound. However we still find it works well in practice. to obtain


where denotes the number of occurrences of in the training set which is independent of the mapping. We denote the score as for simplicity of notation. In practice, it is impossible to retain all scores as the possible number of paths is exponentially large, so we only retain a subset of paths with larger scores through beam search. In M-step, we simply maximize over each , which is equivalent to pick highest scores of among the top paths from beam search. The EM training algorithm is summarized in Algorithm 1.

  Input: training set .

is some predefined epoch number.

  Initialize and randomly.
  for  to  do
     Fixed , optimize parameter using a gradient-based optimizer to maximize structure objective .
     for  to  do
        Compute among the top paths from beam search.
        Let top scores of .
     end for
  end for
Algorithm 1 EM algorithm for structure learning

3.2 Penalization on Size of Paths

Overfitting is likely to happen if we do not apply any penalization for Algorithm 1. Imagine that the structure model is overfitted and gives a particular path a very high probability, for any input. Then in M-step, all the items will be assigned to path , making the model fail to cluster the items. In order to prevent overfitting, we introduce a penalization term on the size of the paths. The penalized function is given by


where is the penalty factor, denotes the number of items in and is an increasing and convex function. A quadratic function controls the average size of paths, and higher order polynomials penalize more on larger paths. In our experiments, we use . It’s worth mentioning that the penalty is only applied in M-step, not during training of continuous parameters .

Coordinate descent algorithm for path assignment. It is intractable to jointly optimize the penalized objective function (6) over all the path assignments, since the penalization term can not be decomposed into summation of terms of each item. So we use coordinate descent algorithm in M-step. We fix the assignment of all other items while optimizing the path assignments ’s of . The time complexity for coordinate descent is , where is the pool size for candidate paths. The detail of the coordinate descent algorithm is relegated to the appendix.

3.3 Multi-task Learning and Reranking with Softmax Models

Based on the experiments we conducted in this paper, we found that jointly training DR with a softmax classification model greatly improves the performance. We conjecture that this is because the paths for the items are randomly assigned in the beginning, leading to increased difficulty for optimization. By sharing the inputs with an easy-to-train softmax model, we are able to give the structure model an uplift in the optimization direction. So the final objective we are maximizing is

After performing beam search to retrieval a set of candidate items using Algorithm 3, we use the softmax function to rerank those candidates to obtain the final top candidates.

3.4 Complexity Analysis

We summarize the time complexity of each stage in Deep Retrieval as follows. The time complexity of the inference stage is per sample, which is sub-linear with respect to the number of items. The time complexity of training continuous parameters is , where is the number of training samples in one epoch and be the multiplicity of paths. The time complexity of path assignment by coordinate descent algorithm is , where is corpus size and is the pool size of candidate paths.

4 Experiments

In this section, we study the performance of DR on two public recommendation datasets: MovieLens-20M Harper and Konstan (2015) and Amazon books He and McAuley (2016); McAuley et al. (2015). We compare the performance of DR with brute-force algorithm, as well as several other recommendation baselines including tree-based models TDM Zhu et al. (2018) and JTM Zhu et al. (2019)

. In the end of this section, we investigate the role of important hyperparameters in DR.

4.1 Datasets and Metrics

MovieLens-20M. This dataset contains rating and free-text tagging activities from a movie recommendation service called MovieLens. We use the 20M subset which were created by the behaviors of 138,493 users between 1995 and 2015. Each user-movie interaction contains a used-id, a movie-id, a rating between 1.0 to 5.0, as well as a timestamp.

In order to make a fair comparison, we exactly follow the same data pre-processing procedure as TDM. We only keep records with rating higher or equal to 4.0, and only keep users with at least ten reviews. After pre-processing, the dataset contains 129,797 users, 20,709 movies and 9,939,873 interactions. Then we randomly sample 1,000 users and corresponding records to construct the validation set, another 1,000 users to construct the test set, and other users to construct the training set. For each user, the first half of the reviews according to the timestamp are used as historical behavior features and the latter half are used as ground truths to be predicted.

Amazon books. This dataset contains user reviews of books from Amazon, where each user-book interaction contains a user-id, an item-id, and corresponding timestamp. Similar to MovieLens-20M, we follow the same pre-processing procedure as JTM. The dataset contains 294,739 users, 1,477,922 items and 8,654,619 interactions. Notice that Amazon books dataset have much more items but sparser interactions than MovieLens-20M. We randomly sample 5,000 users and corresponding records as the test set, another 5,000 users as the validation set and other users as the training set. The construction procedures of behavior features and ground truths are the same as in MovieLens-20M.

Metrics. We use precision, recall and F-measure as metrics to evaluate the performance for each algorithm. We emphasize that the metrics are computed for each user individually, and averaged without weight across users, following the same setting as both TDM and JTM. We compute the metrics by retrieving top and items for each user in MovieLens-20M and Amazon books respectively.

Model and training.

Here we present some details about the model and training procedure in the experiment. Since the dataset is split in a way such that the users in the training set, validation set and test set are disjoint, we drop the user-id and only use the behavior sequence as input for DR. The behavior sequence is truncated to length of 69 if it is longer than 69, and filled with a placeholder symbol if it is shorter than 69. A recurrent neural network with GRU is utilized to project the behavior sequence onto a fixed dimension embedding as the input of DR. We adopt the multi-task learning framework, and rerank the items in the recalled paths by a softmax reranker. We train the embeddings of DR and softmax jointly for the initial two epochs, freeze the embeddings of softmax and train the embeddings of DR for two more epochs. The reason is to prevent overfitting of the softmax model. In the inference stage, the number of items retrieved from beam search is not fixed due to the differences of path sizes, but the variance is not large. Empirically we control the beam size such that the number of items from beam search is 5 to 10 times the number of finally retrieved items.

4.2 Empirical Results

We compare the performance of DR with the following algorithms: Item-CF Sarwar et al. (2001), YouTube product DNN Covington et al. (2016), TDM and JTM. We directly use the numbers of Item-CF, Youtube DNN, TDM and JTM from TDM and JTM papers for fair comparison. Among the different variants of TDM presented, we pick the one with best performance. The result of JTM is only available for Amazon books. We also compare DR with brute-force retrieval algorithm, which directly computes the inner-product of user embedding and all the item embeddings learnt in the softmax model and returns the top items. The brute-force algorithm is usually computationally prohibitive in practical large recommendation systems, but can be used as an upper bound for small dataset for inner-product based models.

Table 1 shows the performance of DR compared to other algorithms and brute-force for MovieLens-20M. Table 2

shows the results for Amazon books. For DR and brute-force, we independently train the same model for 5 times and compute the mean and standard deviation of each metric. We conclude the following results: (1) DR performs better than other methods including tree-based retrieval algorithms such as TDM and JTM. (2) The performance of DR is very close to or on par with the performance of brute-force method.

Algorithm Precision@10 Recall@10 F-measure@10
Item-CF 8.25% 5.66% 5.29%
YouTube DNN 11.87% 8.71% 7.96%
TDM (best) 14.06% 10.55% 9.49%
DR 20.58% 0.47% 10.89% 0.32% 12.32% 0.36%
Brute-force 20.70% 0.16% 10.96% 0.32% 12.38% 0.32%
Table 1: Comparison of precision@10, recall@10 and F-measure@10 for DR, brute-force retrieval and other recommendation algorithms on MovieLens-20M.
Algorithm Precision@200 Recall@200 F-measure@200
Item-CF 0.52% 8.18% 0.92%
YouTube DNN 0.53% 8.26% 0.93%
TDM (best) 0.56% 8.57% 0.98%
JTM 0.79% 12.45% 1.38%
DR 0.95% 0.01% 13.74% 0.14% 1.63% 0.02%
Brute-force 0.95% 0.01% 13.75% 0.10% 1.63% 0.02%
Table 2: Comparison of precision@200, recall@200 and F-measure@200 for DR, brute-force and other recommendation algorithms on Amazon Books.

4.3 Sensitivity of Hyperparameters

DR introduces some key hyperparameters which may infect the performance dramatically, including the width of the structure model , number of multiple paths , beam size and penalty factor . In the MovieLens-20M experiment, we choose , , and . In the Amazon books experiment, we choose , , and .

Using the Amazon books dataset, we show the role of these hyperparameters and see how they may affect the performance. We present how the recall@200 change as these hyperparameters change in Figure 2. We fix the value of other hyperparameters unchanged when varying one hyperparameter. Precision@200 and F-measure@200 follow similar trends so we plot them in the appendix.

Figure 2: Relationship between recall@200 in Amazon Books experiment and model width , number of paths , beam size and penalty factor , respectively.
  • Width of model controls the overall capacity of the model. If is too small, the number of clusters is too small for all the items; if is too big, the time complexity of training and inference stages grow linearly with . Moreover, large may increase the possibility of overfitting. An appropriate should be chosen depending on the size of the corpus.

  • Number of paths enables the model to express multi-aspect information of items. The performance is the worst when , and keeps increasing as increases. Large may not affect the performance, but the time complexity for training grows linearly with . In practice, choosing between 3 and 5 is recommended.

  • Beam size controls the number of candidate paths to be recalled. Larger leads a better performance as well as heavier computation in the inference stage.

  • Penalty factor controls the number of items in each path. The best performance is achieved when falls in a certain range. Smaller leads to a larger path size (see Table 3) hence heavier computation in the reranking stage. Beam size and penalty factor should be appropriately chosen as a trade off between model performance and inference speed.

Overall, we can see that DR is fairly stable to hyperparameters since there is a wide range of hyperparameters which leads to near-optimal performances.

Penalty factor 3e-10 3e-9 3e-8 3e-7 3e-6
Top path size 3837 197 1948 30 956 29 459 13 242 1
Table 3: Relationship between the path with most items (called top path) and penalty factor

5 Conclusion and Discussion

In this paper, we have proposed Deep Retrieval, an end-to-end learnable structure model for large-scale recommender systems. DR uses an EM-style algorithm to learn the model parameters and paths of items jointly. Experiments have shown that DR performs well compared with brute-force baselines in two public recommendation datasets.

There are several future research directions based on the current model design. Firstly, the structure model defines the probability distribution on paths only based on user side information. An useful idea would be how to incorporate item side information more directly into the DR model. Secondly, we only make use of positive interactions such as click, convert or like between user and item. Negative interactions such as non-click, dislike or unfollow should also be considered in future work to improve the model performance. Finally, the current DR still uses a softmax model as a reranker, which might have an upper bound of the performance capped by the softmax model. We are actively working to address this issue as well.

Broader Impact

It is widely believed that practical recommender systems not only reflect user preferences, they often shape them over time Krishnan et al. (2014); Adomavicius et al. (2019)

. And this can lead to potential biases on decision making or user behaviors due to the continuous feedback loop in the system. Our proposed method is more on the perspective of improving the core machine learning technology to better retrieve top relevant candidates based on historical user behavior data. Whether this method amplifies or mitigates the existing biases in real recommender systems needs to be further examined.


  • G. Adomavicius, J. Bockstedt, S. P. Curley, J. Zhang, and S. Ransbotham (2019) The hidden side effects of recommendation systems. MIT Sloan Management Review 60 (2), pp. 1. Cited by: Broader Impact.
  • T. Chen, M. R. Min, and Y. Sun (2018) Learning k-way d-dimensional discrete codes for compact embedding representations. arXiv preprint arXiv:1806.09464. Cited by: §1.
  • P. Covington, J. Adams, and E. Sargin (2016) Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems, pp. 191–198. Cited by: §4.2.
  • T. Ge, K. He, Q. Ke, and J. Sun (2013) Optimized product quantization. IEEE transactions on pattern analysis and machine intelligence 36 (4), pp. 744–755. Cited by: §1.
  • H. Guo, R. Tang, Y. Ye, Z. Li, and X. He (2017) DeepFM: a factorization-machine based neural network for ctr prediction. arXiv preprint arXiv:1703.04247. Cited by: §1.
  • F. M. Harper and J. A. Konstan (2015) The movielens datasets: history and context. Acm transactions on interactive intelligent systems (tiis) 5 (4), pp. 1–19. Cited by: §4.
  • R. He and J. McAuley (2016) Ups and downs: modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web, pp. 507–517. Cited by: §4.
  • X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. Chua (2017) Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web, pp. 173–182. Cited by: §1.
  • M. E. Houle and M. Nett (2014) Rank-based similarity search: reducing the dimensional dependence. IEEE transactions on pattern analysis and machine intelligence 37 (1), pp. 136–150. Cited by: §1.
  • H. Jegou, M. Douze, and C. Schmid (2010) Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33 (1), pp. 117–128. Cited by: §1.
  • Y. Juan, Y. Zhuang, W. Chin, and C. Lin (2016) Field-aware factorization machines for ctr prediction. In Proceedings of the 10th ACM Conference on Recommender Systems, pp. 43–50. Cited by: §1.
  • Y. Koren, R. Bell, and C. Volinsky (2009) Matrix factorization techniques for recommender systems. Computer 42 (8), pp. 30–37. Cited by: §1.
  • S. Krishnan, J. Patel, M. J. Franklin, and K. Goldberg (2014) A methodology for learning, analyzing, and mitigating social influence bias in recommender systems. In Proceedings of the 8th ACM Conference on Recommender systems, pp. 137–144. Cited by: Broader Impact.
  • G. Linden, B. Smith, and J. York (2003) Amazon. com recommendations: item-to-item collaborative filtering. IEEE Internet computing 7 (1), pp. 76–80. Cited by: §1.
  • Y. A. Malkov and D. A. Yashunin (2018) Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1.
  • J. McAuley, C. Targett, Q. Shi, and A. Van Den Hengel (2015) Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 43–52. Cited by: §4.
  • A. Mnih and R. R. Salakhutdinov (2008) Probabilistic matrix factorization. In Advances in neural information processing systems, pp. 1257–1264. Cited by: §1.
  • M. Muja and D. G. Lowe (2014)

    Scalable nearest neighbor algorithms for high dimensional data

    IEEE transactions on pattern analysis and machine intelligence 36 (11), pp. 2227–2240. Cited by: §1.
  • D. R. Reddy et al. (1977) Speech understanding systems: a summary of results of the five-year research effort. Department of Computer Science. Camegie-Mell University, Pittsburgh, PA 17. Cited by: §2.2.
  • S. Rendle (2010) Factorization machines. In 2010 IEEE International Conference on Data Mining, pp. 995–1000. Cited by: §1.
  • B. Sarwar, G. Karypis, J. Konstan, and J. Riedl (2001) Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web, pp. 285–295. Cited by: §1, §4.2.
  • A. Shrivastava and P. Li (2014) Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Advances in Neural Information Processing Systems, pp. 2321–2329. Cited by: §1.
  • R. Spring and A. Shrivastava (2017)

    A new unbiased and efficient class of lsh-based samplers and estimators for partition function computation in log-linear models

    arXiv preprint arXiv:1703.05160. Cited by: §1.
  • H. Zhu, D. Chang, Z. Xu, P. Zhang, X. Li, J. He, H. Li, J. Xu, and K. Gai (2019) Joint optimization of tree-based index and deep model for recommender systems. In Advances in Neural Information Processing Systems, pp. 3973–3982. Cited by: §1, §2.2, §4.
  • H. Zhu, X. Li, P. Zhang, G. Li, J. He, H. Li, and K. Gai (2018) Learning tree-based deep model for recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1079–1088. Cited by: §1, §2.2, §4.


Appendix A Coordinate descent algorithm for penalized path assignment

In this section, we illustrate the coordinate descent algorithm used in path assignment with penalty in detail. Recall that in penalized M-step, we want to maximize the following objective over all assignments ’s.


Now we apply the coordinate descent algorithm by fixing the assignments of all other items and focus on item . Notice that the term is irrelevant to hence can be dropped. For each item , the partial objective function can be written as


The coordinate descent algorithm is given as follows. In practise, three to five iterations are enough to ensure the algorithm converges. The time complexity grows linear with vocabulary size , multiplicity of paths as well as number of candidate paths .

  Input: Score functions . Number of iterations .
  Initialize for all paths .
  for  to  do
     for all items  do
        for  to  do
           if  then
           end if
           for all candidate paths of item such that  do
              Compute penalized scores
           end for
        end for
     end for
  end for
  Output: path assignments .
Algorithm 2 Coordinate descent algorithm for penalized path assignment

Appendix B Beam search algorithm for inference

In this section, we present the beam search algorithm for inference in detail. Given user embedding and structure model with parameter , the beam search algorithm (1) picks the top nodes at the first layer; (2) picks the top nodes among the successors of the chosen nodes at the previous layer; (3) outputs the final nodes at the final layer. The algorithm is shown in Algorithm 3. In each layer, choosing top from candidates has a time complexity of . The total complexity is .

  Input user , structure model with parameter , beam size .
  Let be top entries of .
  for  to  do
     Let be the top entries of the set of all successors of defined as follows.
  end for
  Output , a set of paths.
Algorithm 3 Beam search algorithm

Appendix C Precision and F-measure against hyperparameters for Amazon books experiment

Here we plot the precision@200 and F-measure@200 against hyperparameters in the Amazon books datasets. The results of precision@200 are shown in Figure 3 and the results of F-measure@200 are shown in Figure 4. We can see that both the precision and the F-measure follow the same trend as the recall shown in Section 4.3.

Figure 3: Relationship between precision@200 in Amazon Books experiment and model width , number of paths , beam size and penalty factor , respectively.
Figure 4: Relationship between F-measure@200 in Amazon Books experiment and model width , number of paths , beam size and penalty factor , respectively.