Joint Optimization of Tree-based Index and Deep Model for Recommender Systems

by   Han Zhu, et al.

Large-scale industrial recommender systems are usually confronted with computational problems due to the enormous corpus size. To retrieve and recommend the most relevant items to users under response time limits, resorting to an efficient index structure is an effective and practical solution. Tree-based Deep Model (TDM) for recommendation zhu2018learning greatly improves recommendation accuracy using tree index. By indexing items in a tree hierarchy and training a user-node preference prediction model satisfying a max-heap like property in the tree, TDM provides logarithmic computational complexity w.r.t. the corpus size, enabling the use of arbitrary advanced models in candidate retrieval and recommendation. In tree-based recommendation methods, the quality of both the tree index and the trained user preference prediction model determines the recommendation accuracy for the most part. We argue that the learning of tree index and user preference model has interdependence. Our purpose, in this paper, is to develop a method to jointly learn the index structure and user preference prediction model. In our proposed joint optimization framework, the learning of index and user preference prediction model are carried out under a unified performance measure. Besides, we come up with a novel hierarchical user preference representation utilizing the tree index hierarchy. Experimental evaluations with two large-scale real-world datasets show that the proposed method improves recommendation accuracy significantly. Online A/B test results at Taobao display advertising also demonstrate the effectiveness of the proposed method in production environments.


page 1

page 2

page 3

page 4


Context-aware Tree-based Deep Model for Recommender Systems

How to predict precise user preference and how to make efficient retriev...

Loss Aversion in Recommender Systems: Utilizing Negative User Preference to Improve Recommendation Quality

Negative user preference is an important context that is not sufficientl...

Approximate Nearest Neighbor Search under Neural Similarity Metric for Large-Scale Recommendation

Model-based methods for recommender systems have been studied extensivel...

Learning Tree-based Deep Model for Recommender Systems

We propose a novel recommendation method based on tree. With user behavi...

Candidate Generation with Binary Codes for Large-Scale Top-N Recommendation

Generating the Top-N recommendations from a large corpus is computationa...

Quantifying Availability and Discovery in Recommender Systems via Stochastic Reachability

In this work, we consider how preference models in interactive recommend...

Calibration Matters: Tackling Maximization Bias in Large-scale Advertising Recommendation Systems

Calibration is defined as the ratio of the average predicted click rate ...

1. Introduction

Recommendation has become an increasingly popular means to help users acquire information from content providers. Personalized recommendation methods have been extensively studied and adopted in various kinds of applications like video streaming (Davidson et al., 2010; Covington et al., 2016), news recommendation (Okura et al., 2017) and e-commerce (Zhu et al., 2018).

Recommendation problem is basically to retrieve a set of most relevant or preferred items for each user request from the entire corpus. In the practice of large-scale recommendation, the algorithm design should strike a balance between accuracy and efficiency. In corpus with tens or hundreds of millions of items, methods that need to linearly scan each item’s preference score for each single user request are not computationally tractable. To solve the computational problem, index structure is commonly used to accelerate the retrieval process. In early recommender systems, item-based collaborative filtering (item-CF) along with the inverted index is a popular solution to overcome the calculation barrier (Linden et al., 2003). In item-CF based systems, the pre-calculated item similarity is used to build an inverted index, in which items that most similar to user’s historical behaviors could be retrieved quickly and precisely. However, the scope of candidate set is limited, because only those items similar to user’s historical behaviors can be ultimately recommended.

In recent days, vector representation based methods like matrix factorization

(Salakhutdinov and Mnih, 2007; Koren et al., 2009), factorization machine (Rendle, 2010)

and deep learning models

(Covington et al., 2016; Okura et al., 2017; Beutel et al., 2018)

have been actively researched. This kind of methods can learn user and item’s vector representations, the inner-product of which represents user-item preference. For systems that use vector representation based methods, the recommendation set generation is equivalent to the k-nearest neighbor (kNN) search problem. Quantization-based index

(Liu et al., 2005; Johnson et al., 2017)

for approximate kNN search is widely adopted to accelerate the retrieval process. However, in the above solution, the vector representation learning and the kNN search index construction are optimized towards different objectives individually. The vector representation learning aims to minimize the estimation error of user-item preference, while the index construction usually minimizes the quantization error. The divergence between these two objectives leads to suboptimal vector representations and index structure

(Cao et al., 2016). An even more important problem is that the dependence on vector kNN search index requires an inner-product form of user preference modeling, which limits the model capability (He et al., 2017). For example, models like Deep Interest Network (Zhou et al., 2018b), Deep Interest Evolution Network (Zhou et al., 2018a) and xDeepFM (Lian et al., 2018), which have been proven to be effective in user preference prediction, could not be used to generate candidates in recommendation.

In order to break the inner-product form limitation and make arbitrary advanced user preference models computationally tractable to retrieve candidates from the entire corpus, our previous work Tree-based Deep Model (TDM) (Zhu et al., 2018) creatively uses tree structure as index and greatly improves the recommendation accuracy. TDM uses a tree hierarchy to organize items, and each leaf node in the tree corresponds to an item. Like a max-heap, TDM assumes that each user-node preference is the largest one among the node’s all children’s preferences. In the training stage, a user-node preference prediction model is trained to fit the max-heap like preference distribution. Unlike vector kNN search based methods where the index structure requires an inner-product form of user preference modeling, there is no restriction on the form of preference model in TDM. And in prediction, preference scores given by the trained model are used to perform layer-wise beam search in the tree index to retrieve the candidate items. The time complexity of beam search in tree index is logarithmic w.r.t. the corpus size without restriction on model structures, which is a prerequisite to make advanced user preference models feasible to retrieve candidates in recommendation.

The index structure plays different roles in kNN search based methods and tree-based methods. In kNN search based methods, the user and item’s vector representations are learnt first, and the vector search index is built then. While in tree-based methods, the tree index’s hierarchy also affects the retrieval model training. Therefore, how to learn the tree index and user preference model jointly is an important problem. Tree-based method is also an active research topic in literature of extreme classification (Weston et al., 2013; Agrawal et al., 2013; Prabhu and Varma, 2014; Choromanska and Langford, 2015; Daumé III et al., 2017; Han et al., 2018; Prabhu et al., 2018), which is sometimes considered as the same with recommendation (Jain et al., 2016; Prabhu et al., 2018). In the existing tree-based methods, the tree structure is learnt for a better hierarchy in the sample or label space. However, the objective of sample or label partitioning task in the tree learning stage is not fully consistent with the ultimate target, i.e., accurate recommendation. The inconsistency between objectives of index learning and prediction model training leads the overall system to a suboptimal status. To address this challenge and facilitate better cooperation of tree index and user preference prediction model, we focus on developing a way to simultaneously learn the tree hierarchy and user preference prediction model by optimizing a unified performance measure. The main contributions of this paper are summarized as follows:

  • We propose a joint optimization framework to learn the tree structure and user preference prediction model in tree-based recommendation, where a unified performance measure, i.e., the accuracy of user preference prediction is optimized.

  • We demonstrate that the proposed tree structure learning algorithm is equivalent to the weighted maximum matching problem of bipartite graph, and give an approximate algorithm to learn the tree structure.

  • We propose a novel method that makes better use of tree index to generate hierarchical user representation, which can help learn more accurate user preference prediction model.

  • We show that both the tree structure learning and hierarchical user representation can improve recommendation accuracy. These two modules can even mutually improve each other to achieve more significant performance promotion.

The remainder of this paper is organized as follows: in Section 2, we will compare some large-scale recommendation methods to show their differences. In Section 3, we firstly give a brief introduction to our previous work TDM to make this paper self-contained, and then describe the proposed joint learning method in detail. In Section 4, experimental results of both offline comparison and online A/B test are given to show the effectiveness of proposed methods. At last, we give a conclusion of our work in Section 5.

2. Related Work

In real-world applications, the recommendation process usually has two stages: candidate generation and ranking (Davidson et al., 2010; Zhu et al., 2018, 2017). Model-based large-scale recommendation methods are usually confronted with computational restrictions in the candidate generation stage. To overcome the calculation barrier, there are mainly three kinds of approaches: 1) Pre-calculate item or user similarities and use inverted index to accelerate the retrieval (Linden et al., 2003); 2) Convert user preference to distance of embedding vectors, and use approximate kNN search in retrieval (Covington et al., 2016); 3) Use tree or ensemble of trees to perform efficient retrieval (Zhu et al., 2018).

Industrial recommender systems typically adopt vector kNN search to achieve fast retrieval, e.g., YouTube video recommendation (Covington et al., 2016; Beutel et al., 2018), Yahoo news recommendation (Okura et al., 2017)

and extensions that use recurrent neural network to model user behavior sequence

(Hidasi et al., 2015; Tan et al., 2016; Wu et al., 2017). Such approaches use either traditional deep neural network (DNN) or recurrent neural network (RNN) to learn user and item’s embedding representations based on various user behavioral and contextual data. However, due to the dependence of approximate kNN search index structures in retrieval, user preference models that use attention network or cross features (Zhou et al., 2018b, a; Cheng et al., 2016) are challenging to be applied.

Tree-based methods are also studied and adopted in real-world applications. Label Partitioning for Sublinear Ranking (LPSR) (Weston et al., 2013)

uses k-means clustering with data points’ features to learn the tree hierarchy and then assign labels to leaf nodes. In the prediction stage, the test sample is passed down along the tree to a leaf node according to its distance to each node’s cluster center, and the 1-vs-All base classifier is used to rank all labels belonged to the retrieved leaf node. Partitioned Label Trees (Parabel)

(Prabhu et al., 2018)

also use recursive clustering to build tree hierarchy, but the tree is built to partition the labels according to label similarities. Multi-label Random Forest (MLRF)

(Agrawal et al., 2013) and FastXML (Prabhu and Varma, 2014)

learn an ensemble of sample partitioning trees (a forest), and a ranked list of the most frequent labels in all the leaf nodes retrieved from the forest is returned in prediction. MLRF optimizes the Gini index when splitting nodes, and FastXML optimizes a combined loss function including a binary classification loss and a label ranking loss. In all the above methods, the tree structure keeps unchanged in training and prediction once built, which is hard to completely adapt the retrieval model dynamically.

Our previous work TDM (Zhu et al., 2018) introduces a tree-based model for large-scale recommendation differentiated from existing tree-based methods with a max-heap like user-node preference formulation. In TDM, tree is used as a hierarchical index (Kraska et al., 2018)

, and an attention model

(Zhou et al., 2018b) is trained to predict user-node preference. Different from most tree-based methods where non-leaf nodes are used to route decision-making to leaves, TDM explicitly formulates user-node preference for all the nodes to facilitate hierarchical beam search in the tree index. Despite achieving remarkable progress, the joint optimization problem of index and model is not well solved yet as that the proposed alternatively learning method of model and tree has different objectives.

Figure 1. Tree-based deep recommendation model. (a) User preference prediction model. We firstly hierarchically abstract the user behaviors with nodes in corresponding layers. Then the abstract user behaviors and the target node together with the other feature such as the user profile are used as the input of the model. (b) Tree hierarchy. Each item is firstly assigned to a different leaf node with a projection function . Red nodes (items) in the leaf level are selected as the candidate set.

3. Joint Optimization of Tree-based Index and Deep Model

In this section, we firstly give a brief review of TDM (Zhu et al., 2018). TDM uses a tree hierarchy as index and allows arbitrary advanced deep model as user preference prediction model in recommendation. Then we propose the joint learning framework of the tree-based index and deep model. It alternatively optimizes the index and prediction model under a global loss function. A greedy-based tree learning algorithm is proposed to optimize the index. In the last subsection, we specify the hierarchical user preference representation used in model training.

3.1. Tree-based Deep Recommendation Model

A recommender system needs to return a candidate set containing items that a user has interests in from the item corpus. In practice, how to make effective and efficient retrieval from a large item corpus is a challenging problem. TDM uses a tree as index and creatively proposes a max-heap like probability formulation on the tree, where the user preference for each non-leaf node

in level is derived as:


where is the ground truth probability that the user prefers the node . is a layer normalization term. The above formulation means that the ground truth user-node probability on a node equals to the maximum user-node probability of its children divided by a normalization term. Therefore, the top-k nodes in level must be contained in the children of top-k nodes in level and the retrieval for top-k leaf items can be restricted to top-k nodes in each layer without losing the accuracy. Based on this, TDM turns the recommendation task into a hierarchical retrieval problem. By a top-down retrieval process, the candidate items are selected gradually from coarse to detailed. The candidate generating process of TDM is shown in Fig 1.

Each item in the corpus is firstly assigned to a leaf node of a tree hierarchy . The non-leaf nodes can be seen as a coarser abstraction of their children. In retrieval, the user information combined with the node to score is firstly vectorized to a user preference representation as the input of a deep neural network (e.g. fully connected networks). Then the probability that the user is interested in the node is returned by , as shown in Fig 1(a). While retrieving for the top-k items (leaf nodes), a top-down beam search strategy is carried out level by level, as shown in Fig1(b). In level , only the children of nodes with top-k probabilities in level are scored and sorted to pick candidate nodes. This process continues until leaf items are reached.

With tree index, the overall retrieval complexity for a user request is reduced from linear to logarithmic w.r.t. the capacity of item corpus without restrictions on the preference model structure. These make TDM break the inner-product form of user preference modeling restriction brought by vector kNN search index and enable arbitrary advanced deep models to retrieve candidates from the entire corpus, which greatly raises the recommendation accuracy.

3.2. Joint Optimization Framework

According to the retrieval process, the recommendation accuracy of TDM is determined by the quality of the user preference model and tree index . Given pairs of positive training data , which means the user is interested in the target item , determines which non-leaf nodes should select to achieve for . Instead of separately learning and as previous and related works, we propose to jointly learn and with a global loss function. As we will see in experiments, jointly optimizing and could improve the ultimate recommendation accuracy.

Denote as user ’s preference probability over leaf node given a user-item pair , where is a projection function that projects an item to a leaf node in . Note that the projection function actually determines the item hierarchy in the tree, as shown in Fig 1(b). The model is used to estimate and output the user-node preference , given as model parameters. If the pair is a positive sample, we have the ground truth preference following the multi-class setting (Covington et al., 2016; Beutel et al., 2018). According to the max-heap property, the user preference probability of all ’s ancestor nodes, i.e., should also be , in which is the projection from a node to its ancestor node in level and is the max level in . To fit such a user-node preference distribution, the global loss function is formulated as


where we sum up the negative logarithm of predicted user-node preference probability on all the positive training samples and their ancestor user-node pairs as the global empirical loss.

0:  Loss function , initial deep model and initial tree
1:  for  do
2:     Solve by optimizing the model .
3:     Solve by rebuilding the tree hierarchy with Algorithm 2
4:  end for
4:  Learned model and tree
Algorithm 1 Joint learning framework of the tree index and deep model

Since optimizing the projection function is a combinational optimization, it can hardly be simultaneously optimized with using gradient-based algorithms. To conquer this, we propose a joint learning framework as shown in Algorithm 1. It alternatively optimizes the loss function with respect to the user preference model and the tree hierarchy. The consistency of the training loss in model training and tree learning promotes the convergence of the framework. Actually, Algorithm 1 surely converges if both the model training and tree learning decrease the value of (2) since is a decreasing sequence and lower bounded by . In model training, is to learn a user-node preference model for each layer. Benefiting from the tree hierarchy, is converted to learn the user-node preference distribution and therefore arbitrary advanced deep model can be used, which can be solved by popular optimization algorithms for neural networks such as SGD(Bottou, 2010), Adam(Kingma and Ba, 2014). In the normalized user preference setting (Covington et al., 2016; Beutel et al., 2018)

, since the number of nodes increases exponentially with the node level, Noise-contrastive estimation

(Gutmann and Hyvärinen, 2010) is used to estimate to avoid calculating the normalization term by sampling strategy. The task of tree learning is to solve given , which is a combinational optimization problem. Actually, given the tree structure, equals to find the optimal matching between items in the corpus and the leaf nodes of . Furthermore, we have 111For convenience, we assume is a given complete binary tree. It is worth to mention that the proposed algorithm can be naturally extended to multi-way trees.

0:  Gap , max tree level , original projection
0:  Optimized projection
1:  Set current level , initialize
2:  while True do
3:     for each node in level  do
4:        Denote as the item set,
5:        Find a projection that maximize , s.t.
6:        Update . Set ,
7:     end for
8:     if  then
9:        Break the loop
10:     end if
11:     if  then
13:     end if
15:  end while
Algorithm 2 Tree learning algorithm
Remark 1 ().

is essentially an assignment problem to find a maximum weighted matching on a weighted bipartite graph.


Suppose the -th item is assigned to the -th leaf node , i.e. , the following weight value can be computed:


where contains all positive sample pairs that the target item is .

If we take leaf nodes in and items in corpus as vertices and the full connection between leaf nodes and items as edges, we can construct a weighted bipartite graph with as the weight of edge between and . Furthermore, we can see that each assignment between items and leaf nodes equals a matching of . Given an assignment , the total loss (2) can be computed by

where is the corpus size. Therefore, equals to find the maximum weighted matching of . ∎

Traditional algorithms for assignment problems such as the classic Hungarian algorithm are hard to apply for large corpus because of their high complexity. Even for the simplest greedy algorithm that greedily chooses the unassigned pair with the largest weight , a big weight matrix needs to be computed and stored in advance, which is not acceptable. To conquer this issue, we propose a segmented tree learning algorithm.

Instead of assigning items directly to leaf nodes, we assign the items every levels from top to bottom. Denote the partial weight of from level to level given projection function as

We firstly find an assignment to maximize w.r.t. the projection function , which is equivalent to assign all the items to nodes in level . For a complete binary tree with max level , each level node is assigned with no more than items. This is also a maximum matching problem which can be efficiently solved by a greedy algorithm, since the number of possible locations for each item is largely decreased if is well chosen (e.g. for , the number is ). Keeping each item ’s corresponding ancestor node in level (which is ) unchanged, we then successively maximize the next levels. The recursion stops until each item is assigned to a leaf node. The proposed algorithm is detailed in Algorithm 2.

In line 5 of Algorithm 2, we use a greedy algorithm with rebalance strategy to solve the sub-problem. Each item is firstly assigned to the child of in level with largest weight . Then, to guarantee that each child is assigned with no more than items, a rebalance process is applied. To promote the stability of tree learning and facilitate the convergence of the whole framework, for nodes that have more than items, we keep those items that have the same assignment in level with the former iteration (i.e., ) in priority. The other items assigned to the node are sorted in descending order of weight , and the exceeded part of items are moved to other nodes that still have redundant space, according to the descending order of each item’s weight . Algorithm 2 helps us avoid storing a single big weight matrix. Furthermore, each sub-task can run in parallel to further improve the efficiency.

3.3. Hierarchical User Preference Representation

As shown in Section 3.1, TDM is a hierarchical retrieval model to generate the candidate items hierarchically from coarse to detailed. In retrieval, a top-down beam search is carried out levelly through the tree index by the user preference prediction model . Therefore, task in each level are heterogeneous. Based on this, a layer-specific input of is necessary to raise the recommendation accuracy.

A series of related work (Zhang et al., 2017; Davidson et al., 2010; Linden et al., 2003; Koren et al., 2009; Zhou et al., 2018b; Zhu et al., 2017, 2018) has shown that the user’s historical behaviors play a key role in predicting the user’s interests. Besides, since each item in user behaviors is a one hot ID feature, a common way in the generation of the deep model’s input is firstly embedding each item into a continuous feature space. Based on the fact that a non-leaf node is an abstraction of its children in the tree hierarchy, given a user behavior sequence where is the -th item the user interacts, we propose to use together with the target node and other possible features such as the user profile to generate the input of in layer to predict the user-node preference, as shown in Fig 1(a). In this way, the ancestor nodes of items the user interacts are used as the abstract user’s behaviors, with which we replace the original user behavior sequence in training for the corresponding layer. Generally, the hierarchical user preference representation brings two main benefits:

  1. Layer independence. As a common way, shared item embeddings between layers will bring noises in training as the user preference prediction model for different layers because the targets differ for different layers. An explicit way to solve this is to attach an item with an independent embedding for each layer to generate the input of . However, this will greatly increase the number of parameters and make the system hard to optimize and apply. The proposed abstract user behaviors generate the input of with node embeddings in the corresponding layer and achieve layer independence in training without increasing the number of parameters.

  2. Precise description. generates the candidate items hierarchically through the tree index. With the increase of retrieval level, the candidate nodes in each level describe the ultimate recommended items from coarsely to precisely until the leaf level is reached. The proposed hierarchical user preference representation grasps the nature of the retrieval process and gives a precise description of user behaviors with nodes in corresponding layer, which promotes the predictability of user preference by reducing the confusion brought by too detailed or coarse description. For example, ’s task in upper layers is to coarsely select a candidate set and the user behaviors are also coarsely described with homogeneous node embeddings in the same upper layers in training and prediction.

4. Experimental Study

We study the performance of our proposed method in this section both offline and online. In offline experiments, we use two large-scale real-world datasets to evaluate different methods: Amazon Books dataset (McAuley et al., 2015; He and McAuley, 2016) and UserBehavior dataset (Zhu et al., 2018). We firstly compare the overall performance of the proposed method with other existing recommendation models to show the effectiveness of the joint learning framework. And then, ablation study results are given to help comprehend how each part of the framework works in detail. At last, we evaluate the proposed framework in Taobao display advertising platform with real online traffic.

4.1. Datasets

The offline experiments are conducted in two large-scale real-world datasets: 1) user-book review dataset from Amazon; 2) user-item behavior dataset from Taobao called UserBehavior. The details are as follows:

Amazon Books222 This dataset is made up of product reviews from Amazon. Here we use its largest subset Books and only keep users who have reviewed no less than books. Each review record forms a user-book pair, with the format of user ID, book ID, rating and timestamp.

UserBehavior333 It’s a subset of Taobao user behavior data, containing about million randomly sampled users who have behaviors from November to December , . Similar to Amazon Books, only users with at least behaviors are kept. Each user-item behavior consists of user ID, item ID, item’s unique category ID, behavior type and timestamp. All behavior types are treated equal in our experiments.

Table 1 summarizes the details of the above two datasets after preprocessing.

Amazon Books UserBehavior
# of users 294,739 969,529
# of items 1,477,922 4,162,024
# of categories 3,700 9,439
# of records 8,654,619 100,020,395
Table 1. Details of the two datasets after preprocessing. One record is a user-item pair that represents user feedback.

4.2. Compared Methods and Experiment Settings


Method Amazon Books UserBehavior
Precision Recall F-Measure Precision Recall F-Measure
Item-CF 0.52% 8.18% 0.92% 1.56% 6.75% 2.30%
YouTube product-DNN 0.53% 8.26% 0.93% 2.25% 10.15% 3.36%
TDM 0.51% 7.58% 0.89% 2.23% 10.84% 3.40%
TDM-A 0.56% 8.57% 0.98% 2.81% 13.45% 4.23%
JTM 0.79% 12.45% 1.38% 3.06% 14.54% 4.61%


Table 2. Comparison results of different methods in Amazon Books and UserBehavior.

To evaluate the performance of the proposed framework, we compare the following methods:

  • Item-CF (Sarwar et al., 2001) is a basic collaborative filtering method and is widely used for personalized recommendation especially for large-scale corpus (Linden et al., 2003).

  • YouTube product-DNN (Covington et al., 2016) is a practical method used in YouTube video recommendation. It’s the representative work of vector kNN search based methods. The inner-product of the learnt user and item’s vector representation reflects the preference. And we use the exact kNN search to retrieve candidates in our experiments.

  • TDM (Zhu et al., 2018) is the tree-based deep model for recommendation. It enables arbitrary advanced models to retrieve user interests using the tree index. We use the proposed DNN version of TDM without tree learning.

  • TDM-A is a variant of TDM without tree index. The only difference is that it directly learns a user-item preference model and linearly scan all items in prediction to retrieve the top-k candidates. TDM-A is not computationally tractable in online system but a strong baseline in offline comparison.

  • JTM is the proposed joint learning framework of the tree index and user preference prediction model.

We follow the settings of TDM (Zhu et al., 2018) to split the dataset. Considering the user amount of two datasets, we randomly sample disjoint users to create Amazon Books’ validation set and testing set, while disjoint users are selected as UserBehavior’s validation and testing set each. Other users in two datasets compose training set accordingly. For each user in validation and testing set, we take the first half of behaviors along the time line as known features and the latter half as ground truth.

We implement YouTube product-DNN in Alibaba’s deep learning platform X-DeepLearning (XDL) and the source code is given 444 TDM, TDM-A and JTM are also implemented in XDL 555, and we use the same user preference prediction model for them. The user preference model is a three-layer plain-DNN, each layer of which has , and hidden units respectively with PReLU (Xu et al., 2015)activation function. For all compared methods except Item-CF, we use the same user behavior feature as input. Each user behavior sequence has at most user-item pairs. To utilize the sequential information, input user behaviors are divided into time windows in time order. The ultimate user feature is the concatenation of each time window’s average item embedding vector. YouTube product-DNN uses the inner-product of learnt user and candidate item’s vector representations to reflect user preference, while other methods compute user-item preference with DNN using the concatenation of user feature and candidate item’s embedding as input. We deploy negative sampling for all methods except Item-CF and use the same negative sampling ratio. One implicit feedback has negative samples in Amazon Books and in UserBehavior.

TDM and JTM requires an initial tree in advance of training process. Amazon Books uses a random tree, in which items are randomly arranged in the leaf layer, since there’s no finer categories under books for about of books. By taking advantage of item-category relation of UserBehavior, a category tree can be created, where items from the same category aggregate in the leaf layer. The tree learning layer gap is set to in all experiments that have joint optimization.

Precision, Recall and F-Measure are three general metrics and we use them to evaluate the performance of different methods. For a user , suppose (—) is the recalled set and is the ground truth set. The equations of three metrics are

4.3. Comparison Results

Table 2 shows the quantitative results of all methods in two datasets. It clearly shows that our proposed JTM outperforms other baselines in all metrics. Compared with the previous best model TDM-A in two datasets, JTM achieves and recall lift in Amazon Books and UserBehavior respectively.

As mentioned in Section 4.2, though computationally intractable in online system, TDM-A is a significantly strong baseline for offline comparison and the theoretical upper-bound for YouTube product-DNN and tree-based models (TDM and JTM) on the condition of similar DNN user preference model. Comparison results of TDM-A and other methods give insights in many aspects.

Firstly, results of YouTube product-DNN and TDM-A indicate the limitation of inner-product form. Evidently, these two methods adopt the same input. The difference is that YouTube product-DNN give the rank list with inner-product of learnt user and item’s vector, while TDM-A computes the score with DNN using user and item’s vector concatenation as input. Such a slight change brings apparent improvement, which verifies the effectiveness of the neural network over inner-product form.

Next, TDM performs worse than TDM-A as a result of tree hierarchy. Tree hierarchy takes effect in both training and prediction process. User-node samples are generated along the tree to fit max-heap like preference distribution, and layer-wise beam search is deployed in the tree index when prediction. Without a well-defined tree hierarchy, user preference prediction model may converge to a suboptimal version with confused generated samples, and it’s possible to lose targets in the non-leaf layers so that inaccurate candidate sets are returned. Especially in sparse data like Amazon Books, learnt embedding of each node in tree hierarchy is not distinguishable enough so that TDM with a random tree doesn’t perform well than other baselines. This phenomenon illustrates the influence of tree and necessity to learn more advanced tree.

By joint learning of tree index and user preference prediction model, JTM outperforms TDM-A on all metrics in two datasets. More precise user preference prediction model and more advanced tree hierarchy are obtained, as well as better item set selection. Hierarchical user preference representation alleviates the data sparsity problem in upper layers, because the feature space of user behavior feature is much smaller while having the same number of samples. And it helps model training in a layer-wise way to reduce the propagation of noises between layers. Besides, tree hierarchy learning makes similar items aggregate in the leaf layer, so that the internal layer models can get training samples with more consistent and unambiguous distribution. Benefited from the above two reasons, a unified optimization of hierarchical user preference model and tree index makes it possible for JTM to provide better item set selection than TDM-A. More specific analysis of each part can be seen in Section 4.4.

4.4. Ablation Study

Hierarchical User Preference Representation

To explore why the proposed hierarchical user preference representation works, we perform additional experiments on three variants of user preference representation in tree-based model in two datasets. Tree-based model samples target nodes from all layers of the tree and uses the concatenation embedding of user behaviors and target node as input. The difference of three variants lies in user behavior features and they utilize a fixed initial tree as explained in Section 4.2 with no tree learning algorithms adopted. More details are as follows:

  • TDM is the basic tree-based model introduced in section 4.2. Each node has only one embedding. When dealing with samples of all layers, user behavior feature for one user is totally the same.

  • JTM-HI is an advanced version of TDM which uses layer-independent feature space. More specifically, the user behavior features are directly mapped to different embedding spaces when training different layers’ models. Compared to TDM, the parameter size increases multiple times according to the height of the tree.

  • JTM-H is TDM with hierarchical user preference representation. It simplifies JTM-HI by taking advantage of tree hierarchy. The same as TDM, each node in tree has only one embedding. User behaviors in the leaf layer map to corresponding layers naturally by embeddings of their ancestors in the tree. No more parameters is needed.


Dataset Method Metric
Precision Recall F-Measure
Amazon Books TDM 0.51% 7.58% 0.89%
JTM-HI 0.59% 8.53% 1.02%
JTM-H 0.69% 10.71% 1.22%
UserBehavior TDM 2.23% 10.84% 3.40%
JTM-HI 2.40% 11.44% 3.62%
JTM-H 2.66% 12.93% 4.02%


Table 3. Evaluations of hierarchical representation for user preference model in tree-based models in Amazon Books and UserBehavior.

From Table 3, we have several observations. JTM-HI outperforms TDM in both datasets, which proves that the layer-independent feature space indeed reduces the noises brought by sharing embedding space of user behavior feature in all layers of the tree. JTM-H gets higher performance than JTM-HI with less parameters, which demonstrates that hierarchical user preference representation works well. On the one hand, tree hierarchy provides a natural hierarchical representation. Node embedding in the same layer of tree are homogeneous, thus it’s easier to capture latent feature cross in the same layer than between leaf and non-leaf layers. On the other hand, with hierarchical user preference representation, the parameter space of user behavior feature shrinks a lot in upper layers, which partially solves the data sparsity problem.

In UserBehavior, the recall metric raises from to with layer-independent feature, as a result of feature confusion alleviation between layers. Another recall improvement from to comes from homogeneity and appropriate granularity features inside each layer. The relative improvements are more significant in Amazon Books, because the data sparsity problem is more serious and the random initialized tree introduces more inter-layer noise, which can be well solved by the proposed method.

Iterative Joint Learning

(a) Precision@200
(b) Recall@200
(c) F-Measure@200
(d) Precision@200
(e) Recall@200
(f) F-Measure@200
Figure 2. Results of iterative joint learning in two datasets. 2(a), 2(b), 2(c) are results in Amazon Books and 2(d), 2(e), 2(f) shows the performance in UserBehavior. The horizontal axis of each figure represents the number of iterations, and the vertical axis denotes the value of corresponding metrics. Joint Learning represents results learnt with the proposed tree learning algorithm and Clustering indicates the results of using k-means clustering to learn the tree. Two models here both adopt hierarchical user preference representation.

Tree hierarchy decides sample generation and search path. A suitable tree would benefit model training and inference a great deal. Fig 2 gives the comparison results of iterative joint learning between clustering-based tree learning algorithm proposed in TDM (Zhu et al., 2018) and the proposed tree learning algorithm.

Obviously, the proposed tree learning algorithm has two merits: 1) It can converge to an optimal tree stably; 2) Tree learning and user preference prediction model training share the same goal, guaranteeing the accuracy of recommendation. From Fig 2, we can see that results increase iteratively on all three metrics. Besides, though clustering method has better results at early iterations, tree learning algorithm makes the model stably converge to a better result through joint learning in both datasets. The above results demonstrate the effectiveness of iterative joint learning. It helps optimize the information maintained in tree hierarchy, thus facilitating training and inference.

Joint Performance of User Preference Prediction Model and Tree Learning

To further study the contribution of each part and their joint performance, we perform some contrast experiments on each part of JTM. Detailed descriptions are as follows:

  • TDM is the basic tree-based model. It uses a plain-DNN for user preference model and applies no tree learning.

  • JTM-J learns tree hierarchy and user preference prediction model jointly and iteratively. It adopts the same user preference prediction model as TDM.

  • JTM-H deploys hierarchical representation in user preference prediction model with the fixed initial tree hierarchy.

  • JTM optimizes user preference prediction model with hierarchical representations and tree hierarchy alternatively in a joint framework.

The corresponding results666Note that JTM-J and JTM jointly optimize user preference model and tree hierarchy iteratively. Here we only list the converged result of these two methods. of the above variants are in Table 4. Take recall metric as an example. In Amazon Books, the recall increases from to with hierarchical representation. Limited by the confusion and difficulty from initial feature constitution and the random tree hierarchy, only a slight increase occurs after joint learning of tree hierarchy and the former user preference prediction model. However, it lifts to by adopting the joint learning framework of more expressive features and tree learning. A more obvious comparison result can be seen in UserBehavior. Tree learning and hierarchical representation of user preference brings (TDM JTM-J ) and (TDM JTM-H ) absolute gain separately on recall metric. Furthermore, more improvement up to TDM JTM ) absolute recall is achieved by simultaneous optimization of both, which is more than the sum of and .


Dataset Method Metric
Precision Recall F-Measure
Amazon Books TDM 0.51% 7.58% 0.89%
JTM-J 0.51% 7.60% 0.89%
JTM-H 0.69% 10.71% 1.22%
JTM 0.79% 12.45% 1.38%
UserBehavior TDM 2.23% 10.84% 3.40%
JTM-J 2.48% 11.72% 3.73%
JTM-H 2.66% 12.93% 4.02%
JTM 3.06% 14.54% 4.61%


Table 4. Performances of joint learning in Amazon Books and UserBehavior.

The above results in Table 4 clearly show the effectiveness of hierarchical representation and tree learning, as well as the joint learning framework. Evidently, the joint learning of user preference model with hierarchical representation and learnt tree brings more promotion than the arithmetic sum of each single one in all metrics. Thus it’s beneficial to optimize user preference model with hierarchical representation and learn tree hierarchy in a unified framework.

4.5. Online Results

The proposed JTM is also evaluated in production environments. We conduct the experiments in display advertising scenario of Guess What You Like column of Taobao App Homepage. We use click-through rate (CTR) and revenue per mille (RPM) to measure the performance, which are the key performance indicators for online display advertising. The equations of online metrics are:

In our advertising systems, advertisers bid on plenty of granularities, such as ad clusters, items, shops, etc. Several simultaneously running recommendation approaches in all granularities produce candidate sets and the combination of them are passed to subsequent stages, like CTR prediction (Zhou et al., 2018b, a), ranking (Zhu et al., 2017; Jin et al., 2018), etc. Our baseline is such a combination of all running recommendation methods. To assess the effectiveness of JTM, we deploy JTM to replace Item-CF, which is one of the major candidate-generation approaches in granularity of items in our systems. TDM is evaluated in the same way as JTM. The corpus to deal with contains tens of millions of items. Each comparison bucket has of all online traffic. Under our efforts, we have accomplished the first version of JTM and evaluated its performance online. The results are presented in Table 5.

Method CTR RPM
Baseline 0.0% 0.0%
TDM +5.4% +7.6%
JTM +11.3% +12.9%
Table 5. Online results from Jan 21 to Jan 27, 2019 in Guess What You Like column of Taobao App Homepage.

Table 5 reveals the lift on two online metrics. growth on CTR exhibits that more precise items have been recommended with JTM. As for RPM, it has a improvement, indicating JTM can bring more income for Taobao advertising platform. The experimented JTM only works with basic fully connected network to capture user preference. Under joint optimization framework, more advanced user preference model can achieve more significant performance improvements. Note that TDM is a strong baseline with significant improvement, however JTM still has and gain in CTR and RPM respectively compared with TDM in exactly the same scenario.

5. Conclusion

Recommender system plays a key role in various kinds of applications such as video streaming and e-commerce. In this paper, we propose a joint learning framework of the tree index and user preference prediction model used in tree-based deep recommendation model. The tree index and deep model are alternatively optimized under a global loss function. An efficient greedy algorithm is proposed in tree learning. Besides, a novel hierarchical user preference representation is proposed to make a precise description of user behaviors utilizing the tree hierarchy. Both online and offline experimental results show the advantages of the proposed framework over other related large-scale recommendation models.


  • (1)
  • Agrawal et al. (2013) Rahul Agrawal, Archit Gupta, Yashoteja Prabhu, and Manik Varma. 2013. Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages. In Proceedings of the 22nd international conference on World Wide Web. ACM, 13–24.
  • Beutel et al. (2018) Alex Beutel, Paul Covington, Sagar Jain, Can Xu, Jia Li, Vince Gatto, and Ed H Chi. 2018. Latent Cross: Making Use of Context in Recurrent Recommender Systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 46–54.
  • Bottou (2010) Léon Bottou. 2010.

    Large-scale machine learning with stochastic gradient descent.

    In Proceedings of COMPSTAT’2010. Springer, 177–186.
  • Cao et al. (2016) Yue Cao, Mingsheng Long, Jianmin Wang, Han Zhu, and Qingfu Wen. 2016.

    Deep Quantization Network for Efficient Image Retrieval.. In

    AAAI. 3457–3463.
  • Cheng et al. (2016) Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 7–10.
  • Choromanska and Langford (2015) Anna E Choromanska and John Langford. 2015. Logarithmic time online multiclass prediction. In Advances in Neural Information Processing Systems. 55–63.
  • Covington et al. (2016) Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for YouTube Recommendations. In ACM Conference on Recommender Systems. 191–198.
  • Daumé III et al. (2017) Hal Daumé III, Nikos Karampatziakis, John Langford, and Paul Mineiro. 2017. Logarithmic time one-against-some. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 923–932.
  • Davidson et al. (2010) James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, Taylor Van Vleet, Ullas Gargi, Sujoy Gupta, Yu He, Mike Lambert, Blake Livingston, et al. 2010. The YouTube video recommendation system. In Proceedings of the fourth ACM conference on Recommender systems. ACM, 293–296.
  • Gutmann and Hyvärinen (2010) Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In

    Proceedings of the 13th International Conference on Artificial Intelligence and Statistics

    . 297–304.
  • Han et al. (2018) Lei Han, Yiheng Huang, and Tong Zhang. 2018. Candidates vs. Noises Estimation for Large Multi-Class Classification Problem. In Proceedings of the 35th International Conference on Machine Learning. PMLR, 1890–1899.
  • He and McAuley (2016) Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web. International World Wide Web Conferences Steering Committee, 507–517.
  • He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web. 173–182.
  • Hidasi et al. (2015) Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2015. Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939 (2015).
  • Jain et al. (2016) Himanshu Jain, Yashoteja Prabhu, and Manik Varma. 2016. Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 935–944.
  • Jin et al. (2018) Junqi Jin, Chengru Song, Han Li, Kun Gai, Jun Wang, and Weinan Zhang. 2018. Real-Time Bidding with Multi-Agent Reinforcement Learning in Display Advertising. arXiv preprint arXiv:1802.09756 (2018).
  • Johnson et al. (2017) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017).
  • Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • Koren et al. (2009) Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Techniques for Recommender Systems. Computer 42, 8 (2009), 30–37.
  • Kraska et al. (2018) Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data. ACM, 489–504.
  • Lian et al. (2018) Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems. arXiv preprint arXiv:1803.05170 (2018).
  • Linden et al. (2003) Greg Linden, Brent Smith, and Jeremy York. 2003. recommendations: Item-to-item collaborative filtering. IEEE Internet computing 7, 1 (2003), 76–80.
  • Liu et al. (2005) Ting Liu, Andrew W Moore, Ke Yang, and Alexander G Gray. 2005. An investigation of practical approximate nearest neighbor algorithms. In Advances in neural information processing systems. 825–832.
  • McAuley et al. (2015) Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. 2015. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 43–52.
  • Okura et al. (2017) Shumpei Okura, Yukihiro Tagami, Shingo Ono, and Akira Tajima. 2017. Embedding-based news recommendation for millions of users. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1933–1942.
  • Prabhu et al. (2018) Yashoteja Prabhu, Anil Kag, Shrutendra Harsola, Rahul Agrawal, and Manik Varma. 2018. Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 993–1002.
  • Prabhu and Varma (2014) Yashoteja Prabhu and Manik Varma. 2014. Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 263–272.
  • Rendle (2010) Steffen Rendle. 2010. Factorization Machines. In IEEE International Conference on Data Mining. 995–1000.
  • Salakhutdinov and Mnih (2007) Ruslan Salakhutdinov and Andriy Mnih. 2007. Probabilistic Matrix Factorization. In International Conference on Neural Information Processing Systems. 1257–1264.
  • Sarwar et al. (2001) Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based collaborative filtering recommendation algorithms. In International Conference on World Wide Web. 285–295.
  • Tan et al. (2016) Yong Kiam Tan, Xinxing Xu, and Yong Liu. 2016. Improved recurrent neural networks for session-based recommendations. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 17–22.
  • Weston et al. (2013) Jason Weston, Ameesh Makadia, and Hector Yee. 2013. Label partitioning for sublinear ranking. In International Conference on Machine Learning. 181–189.
  • Wu et al. (2017) Chao-Yuan Wu, Amr Ahmed, Alex Beutel, Alexander J Smola, and How Jing. 2017. Recurrent recommender networks. In Proceedings of the tenth ACM international conference on web search and data mining. ACM, 495–503.
  • Xu et al. (2015) Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. 2015. Empirical evaluation of rectified activations in convolutional network. arXiv:1505.00853 (2015).
  • Zhang et al. (2017) Shuai Zhang, Lina Yao, and Aixin Sun. 2017. Deep Learning based Recommender System: A Survey and New Perspectives. (2017).
  • Zhou et al. (2018a) Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2018a. Deep Interest Evolution Network for Click-Through Rate Prediction. arXiv preprint arXiv:1809.03672 (2018).
  • Zhou et al. (2018b) Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018b. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1059–1068.
  • Zhu et al. (2017) Han Zhu, Junqi Jin, Chang Tan, Fei Pan, Yifan Zeng, Han Li, and Kun Gai. 2017. Optimized Cost Per Click in Taobao Display Advertising. In Proceedings of the 23rd ACM SIGKDD Conference. ACM, 2191–2200.
  • Zhu et al. (2018) Han Zhu, Xiang Li, Pengye Zhang, Guozheng Li, Jie He, Han Li, and Kun Gai. 2018. Learning Tree-based Deep Model for Recommender Systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1079–1088.