1. Introduction
Recommendation has become an increasingly popular means to help users acquire information from content providers. Personalized recommendation methods have been extensively studied and adopted in various kinds of applications like video streaming (Davidson et al., 2010; Covington et al., 2016), news recommendation (Okura et al., 2017) and ecommerce (Zhu et al., 2018).
Recommendation problem is basically to retrieve a set of most relevant or preferred items for each user request from the entire corpus. In the practice of largescale recommendation, the algorithm design should strike a balance between accuracy and efficiency. In corpus with tens or hundreds of millions of items, methods that need to linearly scan each item’s preference score for each single user request are not computationally tractable. To solve the computational problem, index structure is commonly used to accelerate the retrieval process. In early recommender systems, itembased collaborative filtering (itemCF) along with the inverted index is a popular solution to overcome the calculation barrier (Linden et al., 2003). In itemCF based systems, the precalculated item similarity is used to build an inverted index, in which items that most similar to user’s historical behaviors could be retrieved quickly and precisely. However, the scope of candidate set is limited, because only those items similar to user’s historical behaviors can be ultimately recommended.
In recent days, vector representation based methods like matrix factorization
(Salakhutdinov and Mnih, 2007; Koren et al., 2009), factorization machine (Rendle, 2010)and deep learning models
(Covington et al., 2016; Okura et al., 2017; Beutel et al., 2018)have been actively researched. This kind of methods can learn user and item’s vector representations, the innerproduct of which represents useritem preference. For systems that use vector representation based methods, the recommendation set generation is equivalent to the knearest neighbor (kNN) search problem. Quantizationbased index
(Liu et al., 2005; Johnson et al., 2017)for approximate kNN search is widely adopted to accelerate the retrieval process. However, in the above solution, the vector representation learning and the kNN search index construction are optimized towards different objectives individually. The vector representation learning aims to minimize the estimation error of useritem preference, while the index construction usually minimizes the quantization error. The divergence between these two objectives leads to suboptimal vector representations and index structure
(Cao et al., 2016). An even more important problem is that the dependence on vector kNN search index requires an innerproduct form of user preference modeling, which limits the model capability (He et al., 2017). For example, models like Deep Interest Network (Zhou et al., 2018b), Deep Interest Evolution Network (Zhou et al., 2018a) and xDeepFM (Lian et al., 2018), which have been proven to be effective in user preference prediction, could not be used to generate candidates in recommendation.In order to break the innerproduct form limitation and make arbitrary advanced user preference models computationally tractable to retrieve candidates from the entire corpus, our previous work Treebased Deep Model (TDM) (Zhu et al., 2018) creatively uses tree structure as index and greatly improves the recommendation accuracy. TDM uses a tree hierarchy to organize items, and each leaf node in the tree corresponds to an item. Like a maxheap, TDM assumes that each usernode preference is the largest one among the node’s all children’s preferences. In the training stage, a usernode preference prediction model is trained to fit the maxheap like preference distribution. Unlike vector kNN search based methods where the index structure requires an innerproduct form of user preference modeling, there is no restriction on the form of preference model in TDM. And in prediction, preference scores given by the trained model are used to perform layerwise beam search in the tree index to retrieve the candidate items. The time complexity of beam search in tree index is logarithmic w.r.t. the corpus size without restriction on model structures, which is a prerequisite to make advanced user preference models feasible to retrieve candidates in recommendation.
The index structure plays different roles in kNN search based methods and treebased methods. In kNN search based methods, the user and item’s vector representations are learnt first, and the vector search index is built then. While in treebased methods, the tree index’s hierarchy also affects the retrieval model training. Therefore, how to learn the tree index and user preference model jointly is an important problem. Treebased method is also an active research topic in literature of extreme classification (Weston et al., 2013; Agrawal et al., 2013; Prabhu and Varma, 2014; Choromanska and Langford, 2015; Daumé III et al., 2017; Han et al., 2018; Prabhu et al., 2018), which is sometimes considered as the same with recommendation (Jain et al., 2016; Prabhu et al., 2018). In the existing treebased methods, the tree structure is learnt for a better hierarchy in the sample or label space. However, the objective of sample or label partitioning task in the tree learning stage is not fully consistent with the ultimate target, i.e., accurate recommendation. The inconsistency between objectives of index learning and prediction model training leads the overall system to a suboptimal status. To address this challenge and facilitate better cooperation of tree index and user preference prediction model, we focus on developing a way to simultaneously learn the tree hierarchy and user preference prediction model by optimizing a unified performance measure. The main contributions of this paper are summarized as follows:

We propose a joint optimization framework to learn the tree structure and user preference prediction model in treebased recommendation, where a unified performance measure, i.e., the accuracy of user preference prediction is optimized.

We demonstrate that the proposed tree structure learning algorithm is equivalent to the weighted maximum matching problem of bipartite graph, and give an approximate algorithm to learn the tree structure.

We propose a novel method that makes better use of tree index to generate hierarchical user representation, which can help learn more accurate user preference prediction model.

We show that both the tree structure learning and hierarchical user representation can improve recommendation accuracy. These two modules can even mutually improve each other to achieve more significant performance promotion.
The remainder of this paper is organized as follows: in Section 2, we will compare some largescale recommendation methods to show their differences. In Section 3, we firstly give a brief introduction to our previous work TDM to make this paper selfcontained, and then describe the proposed joint learning method in detail. In Section 4, experimental results of both offline comparison and online A/B test are given to show the effectiveness of proposed methods. At last, we give a conclusion of our work in Section 5.
2. Related Work
In realworld applications, the recommendation process usually has two stages: candidate generation and ranking (Davidson et al., 2010; Zhu et al., 2018, 2017). Modelbased largescale recommendation methods are usually confronted with computational restrictions in the candidate generation stage. To overcome the calculation barrier, there are mainly three kinds of approaches: 1) Precalculate item or user similarities and use inverted index to accelerate the retrieval (Linden et al., 2003); 2) Convert user preference to distance of embedding vectors, and use approximate kNN search in retrieval (Covington et al., 2016); 3) Use tree or ensemble of trees to perform efficient retrieval (Zhu et al., 2018).
Industrial recommender systems typically adopt vector kNN search to achieve fast retrieval, e.g., YouTube video recommendation (Covington et al., 2016; Beutel et al., 2018), Yahoo news recommendation (Okura et al., 2017)
and extensions that use recurrent neural network to model user behavior sequence
(Hidasi et al., 2015; Tan et al., 2016; Wu et al., 2017). Such approaches use either traditional deep neural network (DNN) or recurrent neural network (RNN) to learn user and item’s embedding representations based on various user behavioral and contextual data. However, due to the dependence of approximate kNN search index structures in retrieval, user preference models that use attention network or cross features (Zhou et al., 2018b, a; Cheng et al., 2016) are challenging to be applied.Treebased methods are also studied and adopted in realworld applications. Label Partitioning for Sublinear Ranking (LPSR) (Weston et al., 2013)
uses kmeans clustering with data points’ features to learn the tree hierarchy and then assign labels to leaf nodes. In the prediction stage, the test sample is passed down along the tree to a leaf node according to its distance to each node’s cluster center, and the 1vsAll base classifier is used to rank all labels belonged to the retrieved leaf node. Partitioned Label Trees (Parabel)
(Prabhu et al., 2018)also use recursive clustering to build tree hierarchy, but the tree is built to partition the labels according to label similarities. Multilabel Random Forest (MLRF)
(Agrawal et al., 2013) and FastXML (Prabhu and Varma, 2014)learn an ensemble of sample partitioning trees (a forest), and a ranked list of the most frequent labels in all the leaf nodes retrieved from the forest is returned in prediction. MLRF optimizes the Gini index when splitting nodes, and FastXML optimizes a combined loss function including a binary classification loss and a label ranking loss. In all the above methods, the tree structure keeps unchanged in training and prediction once built, which is hard to completely adapt the retrieval model dynamically.
Our previous work TDM (Zhu et al., 2018) introduces a treebased model for largescale recommendation differentiated from existing treebased methods with a maxheap like usernode preference formulation. In TDM, tree is used as a hierarchical index (Kraska et al., 2018)
, and an attention model
(Zhou et al., 2018b) is trained to predict usernode preference. Different from most treebased methods where nonleaf nodes are used to route decisionmaking to leaves, TDM explicitly formulates usernode preference for all the nodes to facilitate hierarchical beam search in the tree index. Despite achieving remarkable progress, the joint optimization problem of index and model is not well solved yet as that the proposed alternatively learning method of model and tree has different objectives.3. Joint Optimization of Treebased Index and Deep Model
In this section, we firstly give a brief review of TDM (Zhu et al., 2018). TDM uses a tree hierarchy as index and allows arbitrary advanced deep model as user preference prediction model in recommendation. Then we propose the joint learning framework of the treebased index and deep model. It alternatively optimizes the index and prediction model under a global loss function. A greedybased tree learning algorithm is proposed to optimize the index. In the last subsection, we specify the hierarchical user preference representation used in model training.
3.1. Treebased Deep Recommendation Model
A recommender system needs to return a candidate set containing items that a user has interests in from the item corpus. In practice, how to make effective and efficient retrieval from a large item corpus is a challenging problem. TDM uses a tree as index and creatively proposes a maxheap like probability formulation on the tree, where the user preference for each nonleaf node
in level is derived as:(1) 
where is the ground truth probability that the user prefers the node . is a layer normalization term. The above formulation means that the ground truth usernode probability on a node equals to the maximum usernode probability of its children divided by a normalization term. Therefore, the topk nodes in level must be contained in the children of topk nodes in level and the retrieval for topk leaf items can be restricted to topk nodes in each layer without losing the accuracy. Based on this, TDM turns the recommendation task into a hierarchical retrieval problem. By a topdown retrieval process, the candidate items are selected gradually from coarse to detailed. The candidate generating process of TDM is shown in Fig 1.
Each item in the corpus is firstly assigned to a leaf node of a tree hierarchy . The nonleaf nodes can be seen as a coarser abstraction of their children. In retrieval, the user information combined with the node to score is firstly vectorized to a user preference representation as the input of a deep neural network (e.g. fully connected networks). Then the probability that the user is interested in the node is returned by , as shown in Fig 1(a). While retrieving for the topk items (leaf nodes), a topdown beam search strategy is carried out level by level, as shown in Fig1(b). In level , only the children of nodes with topk probabilities in level are scored and sorted to pick candidate nodes. This process continues until leaf items are reached.
With tree index, the overall retrieval complexity for a user request is reduced from linear to logarithmic w.r.t. the capacity of item corpus without restrictions on the preference model structure. These make TDM break the innerproduct form of user preference modeling restriction brought by vector kNN search index and enable arbitrary advanced deep models to retrieve candidates from the entire corpus, which greatly raises the recommendation accuracy.
3.2. Joint Optimization Framework
According to the retrieval process, the recommendation accuracy of TDM is determined by the quality of the user preference model and tree index . Given pairs of positive training data , which means the user is interested in the target item , determines which nonleaf nodes should select to achieve for . Instead of separately learning and as previous and related works, we propose to jointly learn and with a global loss function. As we will see in experiments, jointly optimizing and could improve the ultimate recommendation accuracy.
Denote as user ’s preference probability over leaf node given a useritem pair , where is a projection function that projects an item to a leaf node in . Note that the projection function actually determines the item hierarchy in the tree, as shown in Fig 1(b). The model is used to estimate and output the usernode preference , given as model parameters. If the pair is a positive sample, we have the ground truth preference following the multiclass setting (Covington et al., 2016; Beutel et al., 2018). According to the maxheap property, the user preference probability of all ’s ancestor nodes, i.e., should also be , in which is the projection from a node to its ancestor node in level and is the max level in . To fit such a usernode preference distribution, the global loss function is formulated as
(2) 
where we sum up the negative logarithm of predicted usernode preference probability on all the positive training samples and their ancestor usernode pairs as the global empirical loss.
Since optimizing the projection function is a combinational optimization, it can hardly be simultaneously optimized with using gradientbased algorithms. To conquer this, we propose a joint learning framework as shown in Algorithm 1. It alternatively optimizes the loss function with respect to the user preference model and the tree hierarchy. The consistency of the training loss in model training and tree learning promotes the convergence of the framework. Actually, Algorithm 1 surely converges if both the model training and tree learning decrease the value of (2) since is a decreasing sequence and lower bounded by . In model training, is to learn a usernode preference model for each layer. Benefiting from the tree hierarchy, is converted to learn the usernode preference distribution and therefore arbitrary advanced deep model can be used, which can be solved by popular optimization algorithms for neural networks such as SGD(Bottou, 2010), Adam(Kingma and Ba, 2014). In the normalized user preference setting (Covington et al., 2016; Beutel et al., 2018)
, since the number of nodes increases exponentially with the node level, Noisecontrastive estimation
(Gutmann and Hyvärinen, 2010) is used to estimate to avoid calculating the normalization term by sampling strategy. The task of tree learning is to solve given , which is a combinational optimization problem. Actually, given the tree structure, equals to find the optimal matching between items in the corpus and the leaf nodes of . Furthermore, we have ^{1}^{1}1For convenience, we assume is a given complete binary tree. It is worth to mention that the proposed algorithm can be naturally extended to multiway trees.Remark 1 ().
is essentially an assignment problem to find a maximum weighted matching on a weighted bipartite graph.
Proof.
Suppose the th item is assigned to the th leaf node , i.e. , the following weight value can be computed:
(3) 
where contains all positive sample pairs that the target item is .
If we take leaf nodes in and items in corpus as vertices and the full connection between leaf nodes and items as edges, we can construct a weighted bipartite graph with as the weight of edge between and . Furthermore, we can see that each assignment between items and leaf nodes equals a matching of . Given an assignment , the total loss (2) can be computed by
where is the corpus size. Therefore, equals to find the maximum weighted matching of . ∎
Traditional algorithms for assignment problems such as the classic Hungarian algorithm are hard to apply for large corpus because of their high complexity. Even for the simplest greedy algorithm that greedily chooses the unassigned pair with the largest weight , a big weight matrix needs to be computed and stored in advance, which is not acceptable. To conquer this issue, we propose a segmented tree learning algorithm.
Instead of assigning items directly to leaf nodes, we assign the items every levels from top to bottom. Denote the partial weight of from level to level given projection function as
We firstly find an assignment to maximize w.r.t. the projection function , which is equivalent to assign all the items to nodes in level . For a complete binary tree with max level , each level node is assigned with no more than items. This is also a maximum matching problem which can be efficiently solved by a greedy algorithm, since the number of possible locations for each item is largely decreased if is well chosen (e.g. for , the number is ). Keeping each item ’s corresponding ancestor node in level (which is ) unchanged, we then successively maximize the next levels. The recursion stops until each item is assigned to a leaf node. The proposed algorithm is detailed in Algorithm 2.
In line 5 of Algorithm 2, we use a greedy algorithm with rebalance strategy to solve the subproblem. Each item is firstly assigned to the child of in level with largest weight . Then, to guarantee that each child is assigned with no more than items, a rebalance process is applied. To promote the stability of tree learning and facilitate the convergence of the whole framework, for nodes that have more than items, we keep those items that have the same assignment in level with the former iteration (i.e., ) in priority. The other items assigned to the node are sorted in descending order of weight , and the exceeded part of items are moved to other nodes that still have redundant space, according to the descending order of each item’s weight . Algorithm 2 helps us avoid storing a single big weight matrix. Furthermore, each subtask can run in parallel to further improve the efficiency.
3.3. Hierarchical User Preference Representation
As shown in Section 3.1, TDM is a hierarchical retrieval model to generate the candidate items hierarchically from coarse to detailed. In retrieval, a topdown beam search is carried out levelly through the tree index by the user preference prediction model . Therefore, task in each level are heterogeneous. Based on this, a layerspecific input of is necessary to raise the recommendation accuracy.
A series of related work (Zhang et al., 2017; Davidson et al., 2010; Linden et al., 2003; Koren et al., 2009; Zhou et al., 2018b; Zhu et al., 2017, 2018) has shown that the user’s historical behaviors play a key role in predicting the user’s interests. Besides, since each item in user behaviors is a one hot ID feature, a common way in the generation of the deep model’s input is firstly embedding each item into a continuous feature space. Based on the fact that a nonleaf node is an abstraction of its children in the tree hierarchy, given a user behavior sequence where is the th item the user interacts, we propose to use together with the target node and other possible features such as the user profile to generate the input of in layer to predict the usernode preference, as shown in Fig 1(a). In this way, the ancestor nodes of items the user interacts are used as the abstract user’s behaviors, with which we replace the original user behavior sequence in training for the corresponding layer. Generally, the hierarchical user preference representation brings two main benefits:

Layer independence. As a common way, shared item embeddings between layers will bring noises in training as the user preference prediction model for different layers because the targets differ for different layers. An explicit way to solve this is to attach an item with an independent embedding for each layer to generate the input of . However, this will greatly increase the number of parameters and make the system hard to optimize and apply. The proposed abstract user behaviors generate the input of with node embeddings in the corresponding layer and achieve layer independence in training without increasing the number of parameters.

Precise description. generates the candidate items hierarchically through the tree index. With the increase of retrieval level, the candidate nodes in each level describe the ultimate recommended items from coarsely to precisely until the leaf level is reached. The proposed hierarchical user preference representation grasps the nature of the retrieval process and gives a precise description of user behaviors with nodes in corresponding layer, which promotes the predictability of user preference by reducing the confusion brought by too detailed or coarse description. For example, ’s task in upper layers is to coarsely select a candidate set and the user behaviors are also coarsely described with homogeneous node embeddings in the same upper layers in training and prediction.
4. Experimental Study
We study the performance of our proposed method in this section both offline and online. In offline experiments, we use two largescale realworld datasets to evaluate different methods: Amazon Books dataset (McAuley et al., 2015; He and McAuley, 2016) and UserBehavior dataset (Zhu et al., 2018). We firstly compare the overall performance of the proposed method with other existing recommendation models to show the effectiveness of the joint learning framework. And then, ablation study results are given to help comprehend how each part of the framework works in detail. At last, we evaluate the proposed framework in Taobao display advertising platform with real online traffic.
4.1. Datasets
The offline experiments are conducted in two largescale realworld datasets: 1) userbook review dataset from Amazon; 2) useritem behavior dataset from Taobao called UserBehavior. The details are as follows:
Amazon Books^{2}^{2}2http://jmcauley.ucsd.edu/data/amazon/: This dataset is made up of product reviews from Amazon. Here we use its largest subset Books and only keep users who have reviewed no less than books. Each review record forms a userbook pair, with the format of user ID, book ID, rating and timestamp.
UserBehavior^{3}^{3}3https://tianchi.aliyun.com/dataset/dataDetail?dataId=649&userId=1: It’s a subset of Taobao user behavior data, containing about million randomly sampled users who have behaviors from November to December , . Similar to Amazon Books, only users with at least behaviors are kept. Each useritem behavior consists of user ID, item ID, item’s unique category ID, behavior type and timestamp. All behavior types are treated equal in our experiments.
Table 1 summarizes the details of the above two datasets after preprocessing.
Amazon Books  UserBehavior  
# of users  294,739  969,529 
# of items  1,477,922  4,162,024 
# of categories  3,700  9,439 
# of records  8,654,619  100,020,395 
4.2. Compared Methods and Experiment Settings


Method  Amazon Books  UserBehavior  
Precision  Recall  FMeasure  Precision  Recall  FMeasure  
ItemCF  0.52%  8.18%  0.92%  1.56%  6.75%  2.30% 
YouTube productDNN  0.53%  8.26%  0.93%  2.25%  10.15%  3.36% 
TDM  0.51%  7.58%  0.89%  2.23%  10.84%  3.40% 
TDMA  0.56%  8.57%  0.98%  2.81%  13.45%  4.23% 
JTM  0.79%  12.45%  1.38%  3.06%  14.54%  4.61% 

To evaluate the performance of the proposed framework, we compare the following methods:

YouTube productDNN (Covington et al., 2016) is a practical method used in YouTube video recommendation. It’s the representative work of vector kNN search based methods. The innerproduct of the learnt user and item’s vector representation reflects the preference. And we use the exact kNN search to retrieve candidates in our experiments.

TDM (Zhu et al., 2018) is the treebased deep model for recommendation. It enables arbitrary advanced models to retrieve user interests using the tree index. We use the proposed DNN version of TDM without tree learning.

TDMA is a variant of TDM without tree index. The only difference is that it directly learns a useritem preference model and linearly scan all items in prediction to retrieve the topk candidates. TDMA is not computationally tractable in online system but a strong baseline in offline comparison.

JTM is the proposed joint learning framework of the tree index and user preference prediction model.
We follow the settings of TDM (Zhu et al., 2018) to split the dataset. Considering the user amount of two datasets, we randomly sample disjoint users to create Amazon Books’ validation set and testing set, while disjoint users are selected as UserBehavior’s validation and testing set each. Other users in two datasets compose training set accordingly. For each user in validation and testing set, we take the first half of behaviors along the time line as known features and the latter half as ground truth.
We implement YouTube productDNN in Alibaba’s deep learning platform XDeepLearning (XDL) and the source code is given ^{4}^{4}4https://github.com/alibaba/xdeeplearning/tree/master/xdlalgorithmsolution/TDM/script/tdm_ub_vector_ubuntu. TDM, TDMA and JTM are also implemented in XDL ^{5}^{5}5https://github.com/alibaba/xdeeplearning/tree/master/xdlalgorithmsolution/TDM/script/tdm_ub_att_ubuntu, and we use the same user preference prediction model for them. The user preference model is a threelayer plainDNN, each layer of which has , and hidden units respectively with PReLU (Xu et al., 2015)activation function. For all compared methods except ItemCF, we use the same user behavior feature as input. Each user behavior sequence has at most useritem pairs. To utilize the sequential information, input user behaviors are divided into time windows in time order. The ultimate user feature is the concatenation of each time window’s average item embedding vector. YouTube productDNN uses the innerproduct of learnt user and candidate item’s vector representations to reflect user preference, while other methods compute useritem preference with DNN using the concatenation of user feature and candidate item’s embedding as input. We deploy negative sampling for all methods except ItemCF and use the same negative sampling ratio. One implicit feedback has negative samples in Amazon Books and in UserBehavior.
TDM and JTM requires an initial tree in advance of training process. Amazon Books uses a random tree, in which items are randomly arranged in the leaf layer, since there’s no finer categories under books for about of books. By taking advantage of itemcategory relation of UserBehavior, a category tree can be created, where items from the same category aggregate in the leaf layer. The tree learning layer gap is set to in all experiments that have joint optimization.
Precision, Recall and FMeasure are three general metrics and we use them to evaluate the performance of different methods. For a user , suppose (—) is the recalled set and is the ground truth set. The equations of three metrics are
4.3. Comparison Results
Table 2 shows the quantitative results of all methods in two datasets. It clearly shows that our proposed JTM outperforms other baselines in all metrics. Compared with the previous best model TDMA in two datasets, JTM achieves and recall lift in Amazon Books and UserBehavior respectively.
As mentioned in Section 4.2, though computationally intractable in online system, TDMA is a significantly strong baseline for offline comparison and the theoretical upperbound for YouTube productDNN and treebased models (TDM and JTM) on the condition of similar DNN user preference model. Comparison results of TDMA and other methods give insights in many aspects.
Firstly, results of YouTube productDNN and TDMA indicate the limitation of innerproduct form. Evidently, these two methods adopt the same input. The difference is that YouTube productDNN give the rank list with innerproduct of learnt user and item’s vector, while TDMA computes the score with DNN using user and item’s vector concatenation as input. Such a slight change brings apparent improvement, which verifies the effectiveness of the neural network over innerproduct form.
Next, TDM performs worse than TDMA as a result of tree hierarchy. Tree hierarchy takes effect in both training and prediction process. Usernode samples are generated along the tree to fit maxheap like preference distribution, and layerwise beam search is deployed in the tree index when prediction. Without a welldefined tree hierarchy, user preference prediction model may converge to a suboptimal version with confused generated samples, and it’s possible to lose targets in the nonleaf layers so that inaccurate candidate sets are returned. Especially in sparse data like Amazon Books, learnt embedding of each node in tree hierarchy is not distinguishable enough so that TDM with a random tree doesn’t perform well than other baselines. This phenomenon illustrates the influence of tree and necessity to learn more advanced tree.
By joint learning of tree index and user preference prediction model, JTM outperforms TDMA on all metrics in two datasets. More precise user preference prediction model and more advanced tree hierarchy are obtained, as well as better item set selection. Hierarchical user preference representation alleviates the data sparsity problem in upper layers, because the feature space of user behavior feature is much smaller while having the same number of samples. And it helps model training in a layerwise way to reduce the propagation of noises between layers. Besides, tree hierarchy learning makes similar items aggregate in the leaf layer, so that the internal layer models can get training samples with more consistent and unambiguous distribution. Benefited from the above two reasons, a unified optimization of hierarchical user preference model and tree index makes it possible for JTM to provide better item set selection than TDMA. More specific analysis of each part can be seen in Section 4.4.
4.4. Ablation Study
Hierarchical User Preference Representation
To explore why the proposed hierarchical user preference representation works, we perform additional experiments on three variants of user preference representation in treebased model in two datasets. Treebased model samples target nodes from all layers of the tree and uses the concatenation embedding of user behaviors and target node as input. The difference of three variants lies in user behavior features and they utilize a fixed initial tree as explained in Section 4.2 with no tree learning algorithms adopted. More details are as follows:

TDM is the basic treebased model introduced in section 4.2. Each node has only one embedding. When dealing with samples of all layers, user behavior feature for one user is totally the same.

JTMHI is an advanced version of TDM which uses layerindependent feature space. More specifically, the user behavior features are directly mapped to different embedding spaces when training different layers’ models. Compared to TDM, the parameter size increases multiple times according to the height of the tree.

JTMH is TDM with hierarchical user preference representation. It simplifies JTMHI by taking advantage of tree hierarchy. The same as TDM, each node in tree has only one embedding. User behaviors in the leaf layer map to corresponding layers naturally by embeddings of their ancestors in the tree. No more parameters is needed.


Dataset  Method  Metric  
Precision  Recall  FMeasure  
Amazon Books  TDM  0.51%  7.58%  0.89% 
JTMHI  0.59%  8.53%  1.02%  
JTMH  0.69%  10.71%  1.22%  
UserBehavior  TDM  2.23%  10.84%  3.40% 
JTMHI  2.40%  11.44%  3.62%  
JTMH  2.66%  12.93%  4.02%  

From Table 3, we have several observations. JTMHI outperforms TDM in both datasets, which proves that the layerindependent feature space indeed reduces the noises brought by sharing embedding space of user behavior feature in all layers of the tree. JTMH gets higher performance than JTMHI with less parameters, which demonstrates that hierarchical user preference representation works well. On the one hand, tree hierarchy provides a natural hierarchical representation. Node embedding in the same layer of tree are homogeneous, thus it’s easier to capture latent feature cross in the same layer than between leaf and nonleaf layers. On the other hand, with hierarchical user preference representation, the parameter space of user behavior feature shrinks a lot in upper layers, which partially solves the data sparsity problem.
In UserBehavior, the recall metric raises from to with layerindependent feature, as a result of feature confusion alleviation between layers. Another recall improvement from to comes from homogeneity and appropriate granularity features inside each layer. The relative improvements are more significant in Amazon Books, because the data sparsity problem is more serious and the random initialized tree introduces more interlayer noise, which can be well solved by the proposed method.
Iterative Joint Learning
Tree hierarchy decides sample generation and search path. A suitable tree would benefit model training and inference a great deal. Fig 2 gives the comparison results of iterative joint learning between clusteringbased tree learning algorithm proposed in TDM (Zhu et al., 2018) and the proposed tree learning algorithm.
Obviously, the proposed tree learning algorithm has two merits: 1) It can converge to an optimal tree stably; 2) Tree learning and user preference prediction model training share the same goal, guaranteeing the accuracy of recommendation. From Fig 2, we can see that results increase iteratively on all three metrics. Besides, though clustering method has better results at early iterations, tree learning algorithm makes the model stably converge to a better result through joint learning in both datasets. The above results demonstrate the effectiveness of iterative joint learning. It helps optimize the information maintained in tree hierarchy, thus facilitating training and inference.
Joint Performance of User Preference Prediction Model and Tree Learning
To further study the contribution of each part and their joint performance, we perform some contrast experiments on each part of JTM. Detailed descriptions are as follows:

TDM is the basic treebased model. It uses a plainDNN for user preference model and applies no tree learning.

JTMJ learns tree hierarchy and user preference prediction model jointly and iteratively. It adopts the same user preference prediction model as TDM.

JTMH deploys hierarchical representation in user preference prediction model with the fixed initial tree hierarchy.

JTM optimizes user preference prediction model with hierarchical representations and tree hierarchy alternatively in a joint framework.
The corresponding results^{6}^{6}6Note that JTMJ and JTM jointly optimize user preference model and tree hierarchy iteratively. Here we only list the converged result of these two methods. of the above variants are in Table 4. Take recall metric as an example. In Amazon Books, the recall increases from to with hierarchical representation. Limited by the confusion and difficulty from initial feature constitution and the random tree hierarchy, only a slight increase occurs after joint learning of tree hierarchy and the former user preference prediction model. However, it lifts to by adopting the joint learning framework of more expressive features and tree learning. A more obvious comparison result can be seen in UserBehavior. Tree learning and hierarchical representation of user preference brings (TDM JTMJ ) and (TDM JTMH ) absolute gain separately on recall metric. Furthermore, more improvement up to TDM JTM ) absolute recall is achieved by simultaneous optimization of both, which is more than the sum of and .


Dataset  Method  Metric  
Precision  Recall  FMeasure  
Amazon Books  TDM  0.51%  7.58%  0.89% 
JTMJ  0.51%  7.60%  0.89%  
JTMH  0.69%  10.71%  1.22%  
JTM  0.79%  12.45%  1.38%  
UserBehavior  TDM  2.23%  10.84%  3.40% 
JTMJ  2.48%  11.72%  3.73%  
JTMH  2.66%  12.93%  4.02%  
JTM  3.06%  14.54%  4.61%  

The above results in Table 4 clearly show the effectiveness of hierarchical representation and tree learning, as well as the joint learning framework. Evidently, the joint learning of user preference model with hierarchical representation and learnt tree brings more promotion than the arithmetic sum of each single one in all metrics. Thus it’s beneficial to optimize user preference model with hierarchical representation and learn tree hierarchy in a unified framework.
4.5. Online Results
The proposed JTM is also evaluated in production environments. We conduct the experiments in display advertising scenario of Guess What You Like column of Taobao App Homepage. We use clickthrough rate (CTR) and revenue per mille (RPM) to measure the performance, which are the key performance indicators for online display advertising. The equations of online metrics are:
In our advertising systems, advertisers bid on plenty of granularities, such as ad clusters, items, shops, etc. Several simultaneously running recommendation approaches in all granularities produce candidate sets and the combination of them are passed to subsequent stages, like CTR prediction (Zhou et al., 2018b, a), ranking (Zhu et al., 2017; Jin et al., 2018), etc. Our baseline is such a combination of all running recommendation methods. To assess the effectiveness of JTM, we deploy JTM to replace ItemCF, which is one of the major candidategeneration approaches in granularity of items in our systems. TDM is evaluated in the same way as JTM. The corpus to deal with contains tens of millions of items. Each comparison bucket has of all online traffic. Under our efforts, we have accomplished the first version of JTM and evaluated its performance online. The results are presented in Table 5.
Method  CTR  RPM 
Baseline  0.0%  0.0% 
TDM  +5.4%  +7.6% 
JTM  +11.3%  +12.9% 
Table 5 reveals the lift on two online metrics. growth on CTR exhibits that more precise items have been recommended with JTM. As for RPM, it has a improvement, indicating JTM can bring more income for Taobao advertising platform. The experimented JTM only works with basic fully connected network to capture user preference. Under joint optimization framework, more advanced user preference model can achieve more significant performance improvements. Note that TDM is a strong baseline with significant improvement, however JTM still has and gain in CTR and RPM respectively compared with TDM in exactly the same scenario.
5. Conclusion
Recommender system plays a key role in various kinds of applications such as video streaming and ecommerce. In this paper, we propose a joint learning framework of the tree index and user preference prediction model used in treebased deep recommendation model. The tree index and deep model are alternatively optimized under a global loss function. An efficient greedy algorithm is proposed in tree learning. Besides, a novel hierarchical user preference representation is proposed to make a precise description of user behaviors utilizing the tree hierarchy. Both online and offline experimental results show the advantages of the proposed framework over other related largescale recommendation models.
References
 (1)
 Agrawal et al. (2013) Rahul Agrawal, Archit Gupta, Yashoteja Prabhu, and Manik Varma. 2013. Multilabel learning with millions of labels: Recommending advertiser bid phrases for web pages. In Proceedings of the 22nd international conference on World Wide Web. ACM, 13–24.
 Beutel et al. (2018) Alex Beutel, Paul Covington, Sagar Jain, Can Xu, Jia Li, Vince Gatto, and Ed H Chi. 2018. Latent Cross: Making Use of Context in Recurrent Recommender Systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 46–54.

Bottou (2010)
Léon Bottou.
2010.
Largescale machine learning with stochastic gradient descent.
In Proceedings of COMPSTAT’2010. Springer, 177–186. 
Cao
et al. (2016)
Yue Cao, Mingsheng Long,
Jianmin Wang, Han Zhu, and
Qingfu Wen. 2016.
Deep Quantization Network for Efficient Image Retrieval.. In
AAAI. 3457–3463.  Cheng et al. (2016) HengTze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 7–10.
 Choromanska and Langford (2015) Anna E Choromanska and John Langford. 2015. Logarithmic time online multiclass prediction. In Advances in Neural Information Processing Systems. 55–63.
 Covington et al. (2016) Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for YouTube Recommendations. In ACM Conference on Recommender Systems. 191–198.
 Daumé III et al. (2017) Hal Daumé III, Nikos Karampatziakis, John Langford, and Paul Mineiro. 2017. Logarithmic time oneagainstsome. In Proceedings of the 34th International Conference on Machine LearningVolume 70. JMLR. org, 923–932.
 Davidson et al. (2010) James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, Taylor Van Vleet, Ullas Gargi, Sujoy Gupta, Yu He, Mike Lambert, Blake Livingston, et al. 2010. The YouTube video recommendation system. In Proceedings of the fourth ACM conference on Recommender systems. ACM, 293–296.

Gutmann and
Hyvärinen (2010)
Michael Gutmann and Aapo
Hyvärinen. 2010.
Noisecontrastive estimation: A new estimation
principle for unnormalized statistical models. In
Proceedings of the 13th International Conference on Artificial Intelligence and Statistics
. 297–304.  Han et al. (2018) Lei Han, Yiheng Huang, and Tong Zhang. 2018. Candidates vs. Noises Estimation for Large MultiClass Classification Problem. In Proceedings of the 35th International Conference on Machine Learning. PMLR, 1890–1899.
 He and McAuley (2016) Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with oneclass collaborative filtering. In proceedings of the 25th international conference on world wide web. International World Wide Web Conferences Steering Committee, 507–517.
 He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and TatSeng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web. 173–182.
 Hidasi et al. (2015) Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2015. Sessionbased recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939 (2015).
 Jain et al. (2016) Himanshu Jain, Yashoteja Prabhu, and Manik Varma. 2016. Extreme multilabel loss functions for recommendation, tagging, ranking & other missing label applications. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 935–944.
 Jin et al. (2018) Junqi Jin, Chengru Song, Han Li, Kun Gai, Jun Wang, and Weinan Zhang. 2018. RealTime Bidding with MultiAgent Reinforcement Learning in Display Advertising. arXiv preprint arXiv:1802.09756 (2018).
 Johnson et al. (2017) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billionscale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017).
 Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
 Koren et al. (2009) Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Techniques for Recommender Systems. Computer 42, 8 (2009), 30–37.
 Kraska et al. (2018) Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data. ACM, 489–504.
 Lian et al. (2018) Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems. arXiv preprint arXiv:1803.05170 (2018).
 Linden et al. (2003) Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon.com recommendations: Itemtoitem collaborative filtering. IEEE Internet computing 7, 1 (2003), 76–80.
 Liu et al. (2005) Ting Liu, Andrew W Moore, Ke Yang, and Alexander G Gray. 2005. An investigation of practical approximate nearest neighbor algorithms. In Advances in neural information processing systems. 825–832.
 McAuley et al. (2015) Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. 2015. Imagebased recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 43–52.
 Okura et al. (2017) Shumpei Okura, Yukihiro Tagami, Shingo Ono, and Akira Tajima. 2017. Embeddingbased news recommendation for millions of users. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1933–1942.
 Prabhu et al. (2018) Yashoteja Prabhu, Anil Kag, Shrutendra Harsola, Rahul Agrawal, and Manik Varma. 2018. Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 993–1002.
 Prabhu and Varma (2014) Yashoteja Prabhu and Manik Varma. 2014. Fastxml: A fast, accurate and stable treeclassifier for extreme multilabel learning. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 263–272.
 Rendle (2010) Steffen Rendle. 2010. Factorization Machines. In IEEE International Conference on Data Mining. 995–1000.
 Salakhutdinov and Mnih (2007) Ruslan Salakhutdinov and Andriy Mnih. 2007. Probabilistic Matrix Factorization. In International Conference on Neural Information Processing Systems. 1257–1264.
 Sarwar et al. (2001) Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Itembased collaborative filtering recommendation algorithms. In International Conference on World Wide Web. 285–295.
 Tan et al. (2016) Yong Kiam Tan, Xinxing Xu, and Yong Liu. 2016. Improved recurrent neural networks for sessionbased recommendations. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 17–22.
 Weston et al. (2013) Jason Weston, Ameesh Makadia, and Hector Yee. 2013. Label partitioning for sublinear ranking. In International Conference on Machine Learning. 181–189.
 Wu et al. (2017) ChaoYuan Wu, Amr Ahmed, Alex Beutel, Alexander J Smola, and How Jing. 2017. Recurrent recommender networks. In Proceedings of the tenth ACM international conference on web search and data mining. ACM, 495–503.
 Xu et al. (2015) Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. 2015. Empirical evaluation of rectified activations in convolutional network. arXiv:1505.00853 (2015).
 Zhang et al. (2017) Shuai Zhang, Lina Yao, and Aixin Sun. 2017. Deep Learning based Recommender System: A Survey and New Perspectives. (2017).
 Zhou et al. (2018a) Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2018a. Deep Interest Evolution Network for ClickThrough Rate Prediction. arXiv preprint arXiv:1809.03672 (2018).
 Zhou et al. (2018b) Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018b. Deep interest network for clickthrough rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1059–1068.
 Zhu et al. (2017) Han Zhu, Junqi Jin, Chang Tan, Fei Pan, Yifan Zeng, Han Li, and Kun Gai. 2017. Optimized Cost Per Click in Taobao Display Advertising. In Proceedings of the 23rd ACM SIGKDD Conference. ACM, 2191–2200.
 Zhu et al. (2018) Han Zhu, Xiang Li, Pengye Zhang, Guozheng Li, Jie He, Han Li, and Kun Gai. 2018. Learning Treebased Deep Model for Recommender Systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1079–1088.