1. Introduction
Discovering valuable products or services from massive available options on the Internet for users has become a fundamental functionality for modern online applications such as ecommerce (Ni et al., 2018; Chen et al., 2019; Sun et al., 2019; Wen et al., 2019), social networking (Golbeck et al., 2006; Naruchitparames et al., 2011), advertising (Zhou et al., 2019b, 2018), etc. Recommendation System(RS), as a widely used information filtering tool (Zhu et al., 2018; Feng et al., 2019; Lv et al., 2019), serves this role to provide accurate, timely and even personalized services to users. Taking the online recommendation in ecommerce platform as an example, there are usually two phases, , System Recommendation Phase and User Action Phase, as shown in Figure 1. After analysing user’s long and shortterm behaviours, the RS
first generates a large number of related items. Then, they are ranked and delivered to users according to the estimated ClickThrough Rate (
CTR) (Zhou et al., 2019b, 2018) and Conversion Rate (CVR) (Wen et al., 2019; Ma et al., 2018a) , which aims for maximizing the probabilities of clicking and buying these impression items. On the user side, after receiving the impression item information, they click on the items that they are interested in and may finally buy some of them. Obviously, accurately estimating CTR and CVR is crucial for providing expected products to users as well as increasing sales. To this end, the logs of user clicking and buying behaviours, which provides valuable feedback to RS, is used to improve it further.However, there are two critical issues making the aforementioned estimation intractable, , Sample Selection Bias (SSB) (Zadrozny, 2004) and Data Sparsity (DS) (Lee et al., 2012). SSB refers the sample bias between model training phase and inference phase, , conventional CVR models are trained only on clicked samples while being used for inference on all impression samples. Since clicked samples are only a very small portion of the impression samples, SSB plays a severe burden on the inference accuracy. Besides, after clicking the items, users only buy very few of them eventually. It leads to the DS problem that samples from the sequential behaviour path ClickBuy are insufficient to train a CVR model with strong representative ability. As illustrated in Figure 2, how to deal with the SSB and DS problems is crucial for the accuracy of a CVR model.
Several studies have been carried out to tackle these challenges. Pan propose a negative example weighting and sampling method to deal with the absence of the negative examples in conventional recommendation systems (Pan et al., 2008). Although it can reduce the side effect of SSB by introducing negative examples, it also leads to underestimated predictions. Zhang propose a modelfree learning framework by fitting the underlying distribution in the context of advertising (Zhang et al., 2016). However, it may encounter numerical instability when weighting samples. Lee propose to build several hierarchical estimators with different features whose distribution parameters are estimated individually (Lee et al., 2012). It indeed relies on prior knowledge to construct hierarchical structures, which is difficult to be applied in recommendation systems with tens of millions of users and items. Recently, Ma propose an approach named Entire Space Multitask Model (ESMM) to model over the entire space by considering the path ImpressionClickBuy (Ma et al., 2018a). ESMM models the postview clickthrough rate (pCTR) and postview clickthrough conversion rate (pCTCVR) together in a multitask framework. Consequently, the postclick conversion rate (pCVR) can be derived from pCTR and pCTCVR over the entire space. In this way, ESMM addresses the SSB and DS issues by making abundant use of all the training samples and supervisory signals from two auxiliary tasks based on the paths of ImpressionClick and ImpressionBuy, respectively.
Although ESMM achieves better performance than conventional methods, the DS problem still exists since the training samples on the path ClickBuy are indeed much less. To deal with this problem, we observe that users always take some specific actions after clicking items if they buy them eventually, , Add to Cart (Cart), Add to Wish List (Wish), etc. Therefore, we can change the original path ClickBuy to ClickCartBuy or ClickWishBuy, etc., where data sparsity on the intermediate paths ClickCart or CartBuy can be alleviated compared with the original path ClickBuy. Moreover, we can model CVR on these new paths by exploiting the extra training samples with supervisory signals from these specific actions. Motivated by this observation, different from prior ESMM work modeling on the path ImpressionClickBuy, we insert parallel disjoint Deterministic Action (DAction) and Other Action(OAction) between Click and Buy to elaborately model over the entire space, which specifically changes the conventional path ImpressionClickBuy to the novel path ImpressionClickDAction/OActionBuy by decomposing the postclick behaviour.
Specifically, we propose a novel deep neural recommendation algorithm named Elaborated Entire Space Supervised Multitask Model (), which consists of three modules: 1) a shared embedding module (SEM), 2) a decomposed prediction module (DPM), and 3) a sequential composition module (SCM). First, SEM
embeds onehot feature vector of ID features into dense representation through a linear fully connected layer. Then, these embeddings are fed into the subsequent
DPM, where individual prediction network estimates the probability of decomposed target on the path ImpressionClick, ClickDAction, DActionBuy, OActionBuy, respectively. Finally, SCM integrates them together according to the defined behaviour paths sequentially to calculate the final pCVR and some auxiliary probabilities including pCTR, pCTCVR, etc. In a nutshell, the proposed method addresses the SSB and DS problems simultaneously, and essentially improves the final prediction accuracy by employing multitask learning framework and supervisory signals from intermediate actions between Click and Buy.The main contributions of this paper are as follows:
To the extent of our knowledge, it is the first time to introduce the idea of PostClick Behaviour Decomposition to model CVR over the entire space, which specifically changes the conventional path ImpressionClickBuy to the novel path ImpressionClickDAction/OActionBuy by decomposing the postclick behaviour.
We propose a novel deep neural recommendation algorithm named based on the above idea, which is able to simultaneously predict the probabilities of decomposed targets, and sequentially compose them together to calculate the final pCVR and some auxiliary targets. This multitask learning model can efficiently address the SSB and DS problems.
Our proposed model achieves better performance on realworld offline dataset than representative stateoftheart methods. To further demonstrate its efficiency in industrial applications, we successfully deploy on our online recommendation module and achieve significant improvement.
The rest of this paper is organized as follows. Section 2 presents a brief survey of related work, followed by the details of the proposed model in Section 3. Experiments results and analysis are presented in Section 4. Finally, we conclude the paper in Section 5.
2. Related Work
Generally speaking, recommendation methods include contentbased methods (Gopalan et al., 2014; Van den Oord et al., 2013; Wilson et al., 2019), collaborative filtering based methods (Zhang et al., 2019; Thakkar et al., 2019; Zhou et al., 2019a), and hybrid strategies based methods (Logesh and Subramaniyaswamy, 2019; Tsai et al., 2019), where contentbased methods recommend items similar with user’s past interests, collaborative filtering based recommendations make users recommended items that people with similar tastes preferred in the past, and hybrid recommendations integrate two or more types of recommendation strategies. Our method falls into the collaborative filtering based category and specifically tackles the postclick conversion rate prediction problem using multitask learning framework via postclick behaviour decomposition. Therefore, we briefly review the most related work from the following two aspects: 1) conversion rate prediction; 2) multitask learning.
Conversion Rate Prediction: Conversion rate prediction is a key component of many online applications, such as search engines (Dupret and Piwowarski, 2008; Zhang et al., 2014), recommendation systems (Guo et al., 2017; Qu et al., 2016) and online advertising (Graepel et al., 2010; He et al., 2014). However, CVR
is very challenging since conversions are extreme rare events that only a very small portion of impression items are eventually being clicked and bought. Prior work studies both linear models and nonlinear models including logistic regression
(Effendi and Ali, 2017)(Wen et al., 2019; Zhou and Feng, [n.d.]), factorization machines (Rendle, 2010; Xiao et al., 2017). Recently, deep neural network has achieved significant progress in many areas due to its strong ability in feature representation and endtoend modeling (Hinton et al., 2006; Graves et al., 2013; Krizhevsky et al., 2012; Shen et al., 2014; Feng et al., 2019; Lv et al., 2019). Following these works, we also adopt the deep neural network to embed useritem features and predict the conversion rate.MultiTask Learning: Due to the temporal multistage nature of users’ purchasing behaviour, , Impression, Click, and Buy, prior work attempts to formulate the conversion rate prediction tasks by a multitask learning framework. For example, Hadash propose a multitask learning based recommendation system by modeling the ranking and rating prediction tasks simultaneously (Hadash et al., 2018). Ma propose multitask learning approach named Multigate MixtureofExperts to explicitly learn the task relationship from data (Ma et al., 2018b). Gao propose a neural multitask recommendation model named NMTR to learn the cascading relationship among different types of behaviours (Gao et al., 2019). In contrast, we model the CTR and CVR tasks simultaneously by associating with users’ sequential behaviour paths, where the task relationship is explicitly defined by the conditional probability (See section 3). Ni propose to learn universal user representations across multiple tasks for more effective personalization (Ni et al., 2018). We also adopt such an idea by sharing the embedded features across different tasks.
Recently, Ma propose an entire space multitask model (ESMM) for estimating postclick conversion rate (Ma et al., 2018a). ESMM uses the pCTR task and pCTCVR task as parallel auxiliary tasks of the main pCVR task. Our method follows ESMM but has the following difference: we propose to decompose the postclick behaviour into DAction and OAction parts, and insert them into the original sequential path ClickBuy. In this way, it can leverage the supervisory signals from users’ postclick behaviours, which consequently alleviates the data sparsity issue.
3. Proposed Method
3.1. Motivation
There exists multiple kinds of behaviour paths for Buy, such as Impression first, sequentially followed by Click, Cart and Buy, as shown in Figure 3(a). After analyzing our online realworld data, we use a digraph to describe the purchasing process as shown in Figure 3(b). First, some items are displayed to users. Then, when users click an item, they will always take some specific actions before buying it eventually. For example, 10% of them add the item into their carts (Add to Cart, short as Cart) so that they will buy it with other items together later (12% of them buy it eventually). 3.5% of them add the item into their wish lists (Add to Wish List, short as Wish) as if they like it but can not buy it instantly for some reasons, , can not afford it now or wait for promotion(31% of them buy it eventually). How do these sequential behaviours benefit the CVR prediction more accurately? One possible solution is to exhaust all possible disjoint behaviour paths, then use the intermediate supervisory signals within individual paths. However, 1) If we distinguish these behaviour paths thoroughly, it will further make the data on each individual path more sparse. 2) It is nearly impossible to exhaust all the behaviours paths due to the intricacy of user behaviours. 3) It is not straightforward to integrate the final prediction target, such as CVR, from individual paths.
As shown in Figure 3(c), instead of treating these behaviour paths individually, we define a single node named Deterministic Action (DAction) to integrate specific actions, such as Cart or Wish. It can be seen that DAction has the following two properties: 1) It has deterministic supervisory signals, , 1 for taking some actions and 0 for none. 2) It further alleviates the DS problem due to integrating multiple kinds of actions, , 13% of the users take some specific actions after they click an item, and 9% of them eventually buy it (due to the overlapping of actions, the number is 13% instead of 13.5%). Therefore, we can use these abundant training samples on the intermediate paths of ClickDAction and DActionBuy to supervise the CVR model. We also add a node named Other Action(short as OAction) between Click and Buy to deal with other cases except DAction. In this way, the original path ImpressionClickBuy is changed to a more elaborated one ImpressionClickDAction/OActionBuy. We call the above idea as PostClick Behaviour Decomposition and devise our CVR model accordingly. The detail will be presented as follows.
3.2. Probability Decomposition of Cvr
In this section, we present the probability decomposition of conversion rate according to the digraph defined in Figure 3(c).
First, the probability of post view clickthrough rate of an item , denoted as , is defined as the probability of being clicked given that it has been viewed, which depicts the path ImpressionClick in the digraph. Mathematically, it can be written as:
(1) 
where denotes whether the item is being clicked, , is the label spaces of all the items being clicked or not, and is the number of items. Similarly, denotes whether the item is being viewed(, Impression), , is the label spaces of all the items being viewed or not. is a surrogate symbol for simplicity.
Then, the probability of clickthrough DAction conversion rate of an item , denoted as , is defined as the probability of being taken DAction given that it has been viewed, which depicts the path ImpressionClickDAction in the digraph. Mathematically, it can be written as:
(2) 
where denotes whether the item is being taken some specific actions defined in Section 3.1, , is the label spaces of all the items being taken some specific actions or not. , depicting the path ClickDAction, is a surrogate symbol for simplicity as . It is trivial that since all the samples are impression samples (, ). It is noteworthy that Eq. (2) holds due to the fact that no action occurs without being clicked, , .
Next, the probability of conversion rate of an item , denoted as , is defined as the probability of being bought given that it has been clicked, which depicts the paths ClickDAction/OActionBuy in the digraph. Mathematically, it can be written as:
(3) 
where denotes whether the item is being bought, , is the label spaces of all the items being bought or not. , are some surrogate symbols for simplicity as . or depicts the path DActionBuy or OActionBuy in the digraph, respectively.
The probability of clickthrough conversion rate of an item , denoted as , is defined as the probability of being bought given that it has been viewed, which depicts the complete path ImpressionClickDAction/OActionBuy in the digraph. Mathematically, it can be written as:
(4) 
Here, we use to replace in the third equality for simplicity without causing any ambiguity. It is noteworthy that the fourth equality holds due to the fact that no items will be bought without being clicked, , equals to zero, . Indeed, Eq. (4) can be derived by decomposing the path ImpressionClickDAction/OActionBuy into ImpressionClick and ClickDAction/OActionBuy, and integrating Eq. (1) and Eq. (3
) together according to the chain rule.
3.3. Elaborated Entire Space Supervised Multitask Model
Given the users’ behaviour logs, we can easily obtain the ground truth labels of , , and defined in the above section. In other word, we can model them simultaneously by using multitask learning framework. To this end, we propose a novel deep neural recommendation algorithm named Elaborated Entire Space Supervised Multitask Model() for conversion rate prediction. gets its name since: 1) , , and are modeled over the entire space, , using all the impression samples; 2) the derived from Eq. (3) also benefits from the entire space multitask modeling which will be validated in the experiment part. As shown in Figure 4, the proposed is modeled by using deep neural network and consists of three key modules: 1) a shared embedding module, 2) a decomposed prediction module, and 3) a sequential composition module. We present each of them in detail as follows.
Shared Embedding Module (SEM): First, we use SEM to embed all the sparse ID features and dense numerical features coming from user field, item field, and useritem cross field. The user features include users’ ID, ages, genders and purchasing powers, etc. The item features include items’ ID, prices, accumulated CTR and CVR from historical logs, etc. The useritem features include users’ historical preference scores on items, etc. Dense numerical features are first discretized based on their boundary values and then represented as onehot vectors. Here, we use to denote the onehot features of the training sample, where
denotes the index set of all kinds of features. Due to the sparseness nature of onehot encoding, we employ linear fully connected layers to embed them into dense representation, which can be formulated as :
(5) 
where denotes the embedding matrix for the kind of features, represents the network parameters.
Decomposed Prediction Module (DPM): Then, once all the feature embeddings are obtained, they are concatenated together, fed into the decomposed prediction module, shared by each of the subsequent networks. Each individual prediction network in DPM estimates the probability of decomposed target on the path ImpressionClick, ClickDAction, DActionBuy, OActionBuy, respectively. In this paper, we employ MultiLayer Perception (MLP
) as the prediction network. All the nonlinear activation function is
ReLU except the output layer, where we use a Sigmoid function to map the output into a probability taking real value from 0 to 1. Mathematically, it can be formulated as:(6) 
where denotes the Sigmoid function, denotes the mapping function learned by the MLP, denotes its network parameters. For example, as shown in the first MLP in Figure 4, it output the estimated probability , which is indeed the postview clickthrough rate.
Sequential Composition Module (SCM): Finally, SCM composes the above predicted probabilities sequentially according to Eq. (1) Eq.(4) to calculate the conversion rate and some auxiliary targets including the postview clickthrough rate , clickthrough DAction conversation rate , and clickthrough conversion rate , respectively. As shown in the top part of Figure 4, SCM
is a parameterfree feed forward neural network which represents the underlying conditional probabilities defined by the purchasing decision digraph in Figure
3.3.4. Training Objective
We use to denote the training set, where , , , represent the ground truth label whether the impression sample is being clicked, taken deterministic actions, and bought. Then, we can define the joint postview clickthrough probability of all training samples as follows:
(7) 
where and denote the positive and negative samples in the label space , respectively. After taking negative logarithm on Eq.(7), we obtain the logloss of , which is widely used in recommendation systems, ,
(8) 
Similarly, we can obtain the loss function of
and as follows:(9) 
and
(10) 
The final training objective to be minimized is defined as:
(11) 
where denotes all the network parameters in . , , are loss weights of , , , respectively.
4. Experiments
To evaluate the effectiveness of the proposed model, we conducted extensive experiments on both offline dataset collected from realworld ecommerce scenarios and online deployment. is compared with some representative stateoftheart (SOTA) methods including GBDT (Friedman, 2001), DNN (Hinton et al., 2006), DNN using oversampling idea (Pan et al., 2008) and ESMM (Ma et al., 2018b)
. First, we present the evaluation settings including the dataset preparation, evaluation metrics, a brief description of these SOTA methods, and the implementation details. Then, we present the comparison results and analysis. Ablation studies are presented next, followed by the performance analysis on different postclick behaviours.
4.1. Evaluation settings
4.1.1. Dataset Preparation
To the best of our knowledge, there is no public benchmark datasets with sequential behaviour labels, , Cart or Wish for entire space modeling of CVR prediction. To address this issue, we collect the transaction logs in several consecutive days of September, 2019 from our online ecommerce platform, which is one of the largest thirdparty retail platforms in the world. More than 300 million instances with user/item/useritem features and behaviour labels are filtered out. They are further divided into three disjoint sets, , training set, validation set, and test set, respectively.
Category  #User  #Item  #Impression 

Number  13,383,415  10,399,095  326,325,042 
Category  #Click  #Buy  #Action 
Number  20,637,192  226,918  2,501,776 
The statistics of this offline dataset are listed in Table 1. For example, only 6% of items have been clicked after being viewed. In addition, Among these clicked items, only 1% of them have been bought eventually, which is fairly sparse. However, when compared in the context of items being taken specific actions, more than 9% of them have been bought eventually. The data volume increases by about 9 times relatively. Therefore, Our can benefit from the extra supervisory signals and the postclick behaviour decomposition (See section 3.1), as will be validated in the following experiments.
Moreover, we also deploy each model in our online recommendation system and carry out the A/B test to compare their performances in realworld scenario. The details will be presented in Section 4.2.2.
4.1.2. Evaluation Metrics
To comprehensively evaluate the effectiveness of the proposed model and compare it with SOTA methods, we adopt three widely used metrics in recommendation and advertising system, , Area Under Curve (AUC), GAUC (Zhou et al., 2018; Zhu et al., 2017) and score, where AUC reflecting the ranking ability, defined as :
(12) 
where and denote the set of positive/negative samples, respectively, and denote the number of samples in and , is the prediction function, is indicator function respectively.
GAUC (Zhu et al., 2017) is calculated as follows. First, all the test data are partitioned into different groups according to individual user ID. Then, the AUC is calculated in each single group. Finally, we average these weighted AUC. Mathematically, GUC is defined as:
(13) 
where denotes the weight for user (set as 1 for our offline evaluations). denotes the AUC for user .
Moreover, score is defined as:
(14) 
where and
denote the precision and recall, respectively. They are defined as:
(15) 
and
(16) 
where , , and denote the number of true positive, false positive, and false negative predictions, respectively.
4.1.3. Brief Description of comparison methods
The representative stateoftheart methods used to compared with the proposed are described as follows.
GBDT (Friedman, 2001)
: The gradient boosting decision tree (
GBDT) model follows the idea of gradient boosting machine (GBM), is able to produce competitive, highly robust, interpretable procedures for regression or classification tasks (Wen et al., 2019). In this paper, we use it as the representative of nonedeep learning based method and a strong baseline.
DNN (Hinton et al., 2006): We also design a deep neural network baseline model, which has the exactly same structure and hyperparameters with each of the individual branches in our model. Different from our model, it is trained using samples on the path ClickBuy or ImpressionClick for conversion rate or clickthrough rate , respectively.
DNNOS (Pan et al., 2008): Due to the data sparsity on the path ClickBuy or ImpressionBuy, it is hard to train a deep neural network with good generalization ability. To address this issue, one strategy is to augment positive samples during training, called oversampling. In this paper, we leverage this oversampling strategy to train another deep model named DNNOS, which has the same structure and hyperparameters with the aforementioned DNN model.
ESMM (Ma et al., 2018b): For a fair comparison, we use the same backbone structure as the above deep models. It uses multitask learning to predict and over the entire space, where feature representation is shared by both tasks. However, it directly models the conversion rate on the path ImpressionClickBuy without considering the postclick behaviours. Therefore, its performance maybe degraded due to the data sparsity issue.
The first three methods learn to predict probabilities of and using samples on the paths ImpressionClick and ClickBuy, then multiply them together to derive the clickthrough conversion rate . As for ESMM and our proposed , they directly predict and by modeling over the entire space.
4.1.4. Hyperparameters Settings
For the GBDT model, the number of trees, the tree depth, minimum instance numbers for splitting a node, sampling rate of train set for each iteration, sampling rate of features for each iteration, and the type of loss function, are set as 150, 8, 20, 0.6, 0.6 and logistic loss, respectively, which are chosen according to the AUC
score on the validation set. For the deep neural network based models, they are implemented in TensorFlow and trained on
GPUusing Adam optimizer for two epochs. The learning rate is set to 0.0005, and the minibatch size is set to 1000. Logistic loss is used as the loss function of each prediction task of all the models. There are 5 layers in the
MLP, where the dimension of each layer is set to 512, 256, 128, 32, and 2, respectively. These hyperparameters are summarized in Table 2.Hyperparameter  Choice 

Loss function  Logistic Loss 
Optimizer  Adam 
Number of layers in MLP  5 
Dimensions of layers in MLP  [512,256,128,32,2] 
Batch size  1000 
Learning rate  0.0005 
Dropout ratio  0.5 
4.2. Main Results
4.2.1. Comparisons on Offline Data Set
Method  CVR AUC  CTCVR AUC  CTCVR GAUC 

GBDT  0.7823  0.8059  0.7747 
DNN  0.8065  0.8161  0.7864 
DNNOS  0.8124  0.8192  0.7893 
ESMM  0.8398  0.8270  0.7906 
0.8486  0.8371  0.8051 
CVR@top0.1%  CVR@top0.6%  CVR@top1%  
Method  Recall  Precision  F1Score  Recall  Precision  F1Score  Recall  Precision  F1Score 
GBDT  4.382%  14.348%  6.714%  16.328%  9.894%  12.322%  27.384%  7.384%  11.631% 
DNN  4.938%  15.117%  7.445%  17.150%  10.495%  13.021%  28.481%  8.196%  12.729% 
DNNOS  5.383%  15.837%  8.034%  17.38%  10.839%  13.353%  29.032%  8.423%  13.058% 
ESMM  5.813%  16.295%  8.570%  18.585%  11.577%  14.267%  29.789%  8.961%  13.777% 
6.117%  17.145%  9.017%  23.492%  10.574%  14.584%  30.032%  9.034%  13.890% 
CTCVR@top0.1%  CTCVR@top0.6%  CTCVR@top1%  
Method  Recall  Precision  F1Score  Recall  Precision  F1Score  Recall  Precision  F1Score 
GBDT  2.937%  0.701%  1.132%  4.870%  0.649%  1.145%  8.894%  0.531%  1.002% 
DNN  3.168%  0.851%  1.341%  5.269%  0.768%  1.340%  9.461%  0.643%  1.204% 
DNNOS  3.382%  0.871%  1.385%  5.369%  0.801%  1.395%  9.863%  0.673%  1.260% 
ESMM  3.858%  0.915%  1.479%  5.504%  0.828%  1.439%  10.088%  0.691%  1.294% 
4.219%  1.001%  1.618%  5.987%  0.900%  1.566%  10.991%  0.753%  1.410% 
In this subsection, we report the AUC, GAUC, and scores of all the competitors on the offline test data set. Table 3 summarizes the results of AUC and GAUC. It can be seen that the DNN method achieves gains of 0.0242, 0.0102, 0.0117 for CVR AUC, CTCVR AUC, and CTCVR GAUC over the baseline GBDT model, respectively. It demonstrates the strong representation ability of deep neural networks. Different from the vanilla DNN, DNNOS utilizes oversampling strategy to weights the sparse positive samples to address the data sparsity issue. It achieves better performance than DNN. As for ESMM, it models and on the path ImpressionClickBuy, which tries to address the SSB and DS issues simultaneously. Benefiting from the abundant training samples, it outperforms DNNOS. However, ESMM neglects the impact of postclick behaviour which is further utilized by the proposed . It efficiently models and several related targets including , , and together under a multitask learning framework, which benefits from the extra supervisory signals by decomposing the postclick behaviours and integrating them into the behaviour paths as described in Section 3.1 and Section 3.2. As can be seen, it obtains the best scores among all the methods. For example, the gains over ESMM are 0.0088, 0.0101, 0.0145 for CVR AUC, CTCVR AUC, and CTCVR GAUC, respectively. It is worth mentioning that a gain of 0.01 in offline AUC always means a significant increment in revenue in our online recommendation system (Ma et al., 2018a; Wen et al., 2019).
As for the score, we report several values by setting different thresholds for CVR and CTCVR, respectively. First, we sort all the instances in a descending order according to the predicted CVR or CTCVR score. Then, due to the sparsity of CVR task (about 1% of the predicted samples are positive), we choose three thresholds namely top@0.1%, top@0.6% and top@1% to split the predictions into positive and negative groups accordingly. Finally, we calculate the precision, recall and scores of these predictions at these different thresholds. Results are summarized in Table 4 and Table 5, respectively. Similar trend to Table 3 can be observed. Again, the proposed method achieves the best performance at different settings.
In conclusion, we have the following assertions after analyzing the gains of the proposed over other models: 1) The deep neural network has stronger representation ability than the decision tree based GBDT. 2) The multitask learning framework over the entire sample space serves as an efficient tool to address the SSB and DS problems simultaneously. 3) Decomposing the postclick behaviours, integrating them into the behaviour paths, and modeling the entire sample space further alleviates the DS problem and leads to better performance.
4.2.2. Comparisons on Online Deployment
It is not a easy job to deploy deep network models in our recommendation system since our online system servers hundreds of millions of users everyday. It can be more than 100 million users per second at a traffic peak. Therefore, a practical model is required to make realtime CVR predictions with high throughput and low latency. For example, hundreds of recommendation items for each visitor should be predicted in less than 100 milliseconds in our system. To make the online evaluation fair, confident and comparable, each deployed method during an A/B test should include the same number of users, , millions of users. To this end, we carefully conducted the A/B test in our online recommendation system in seven consecutive days in September, 2019. The results are summarized in Figure 5, where we use the GBDT model as the baseline. As can be seen, DNN, DNNOS and ESMM have the similar performance but significantly outperform the baseline GBDT model. As for the proposed , there is a significant margin between it and the above methods, which clearly demonstrates its superiority. Besides, it contributes up to 3% CVR promotion compared with the ESMM, which indicates a significant business value for the ecommercial platform.
4.3. Ablation Studies
In this part, we present the detailed ablation studies including hyperparameter settings of deep neural network, effectiveness of sampling important numerical features, embedding dense numerical features, decomposing postclick behaviours, and the influence of including nondeterministic supervisory signals, respectively.
4.3.1. Hyperparameters of Deep Neural Networks
Here, we take three critical parameters, namely dropout ratio, the number of hidden layers and the dimension of item feature embeddings, as example to illustrate the process of parameter selection in our model.
Dropout (Srivastava et al., 2014) refers to the regularization technique which randomly drops some neural nodes during training. It can strengthen deep neural networks’ generalization ability by introducing randomness. We try different choices of the dropout ratio from 0.2 to 0.7 in our model. As shown in Figure 6(a), a dropout ratio of 0.5 leads to the best performance. Therefore, we set the dropout ratio as 0.5 in all the experiments if not specified.
Increasing the depth of network layers can enhance the capacity of deep models but also potentially leads to overfitting problem. Therefore, we carefully set this hyperparameter according to the AUC scores on the validation set. As can be seen from Figure 6(b), at the beginning stage, , from 2 layers to 5 layers, increasing of the number of hidden layers consistently improves the model’s performance. However, it saturates at 5 layers that increasing more layers even marginally decreases the AUC scores, where the model may be overfitted. Therefore, we stack 5 hidden layers in all experiments if not specified.
The dimension of item feature embeddings is a critical parameter that highdimension features reserve more information but also lead to potential noise and higher model complexity. We try different settings of the parameter and plot the results in Figure 6(c). As can be seen, increasing the dimension generally improve the performance. It finally saturates at 128 while doubling it leads no more gains. Therefore, to make a tradeoff between model capacity and complexity, we set the dimension of item feature embeddings to 128 in all the experiments if not specified.
4.3.2. Effectiveness of Sampling Important Numerical Features
In decision tree based models such as GBDT, a common practice is to iteratively select the features with the largest statistical information gain and combine the most useful features to fit the model. Inspired by it, we hypothesize that sampling important features from the numerical features to train the proposed may also lead to better performance while reducing the model complexity. To validate the hypothesis, we employ a GBDT model to evaluate the importance of all numerical features and choose the top of them, with the embedded ID features together, as the input of our model. The results for different settings of are summarized in Table 6. As can be seen, keeping the top64 features achieves the best performance. Therefore, we set this hyperparameter as 64 in all the experiments if not specified.
K  500  256  128  64  32  8 

CVR AUC  0.8479  0.8481  0.8483  0.8486  0.8441  0.8385 
4.3.3. Effectiveness of Embedding Dense Numerical Features
After selecting the most important dense numerical features, a common practice is to discretize them into onehot vectors first and then concatenate them with the ID features together, which are then embedded into dense features through a linear projection layer as described in Section 3.3. However, we hypothesize that the onehot vector representation of numerical features may degrade the precision during discretization. Therefore, we try another solution by normalizing the numerical features first and then embed them by using a tanh activation function, ,
(17) 
where and
denotes the mean and standard deviation of the
kind of features. Then, we concatenate the embedded features with the embedded ID features together as the input of our model. In our experiment, it achieves a gain of 0.004 AUC over the discretization.4.3.4. Effectiveness of Decomposing PostClick Behaviours
When decomposing the postclick behaviours, we can integrate different behaviours into the node, , only Cart, only Wish, and both Cart and Wish (Cart Wish), etc. Here, we evaluate the effectiveness of decomposing postclick behaviours by choosing different combinations of actions. The results are summarized in Table 7. As can be seen, the combination of both actions achieves the best AUC scores. It is reasonable since the data sparsity issue is less server than the other two cases. For example, only 10% (3.5%) of clicked items are added to cart (wish list), while the number becomes to 13% if we adopt the combination of them.
CVR AUC  CTCVR AUC  CTCVR GAUC  

Cart  0.8457  0.8359  0.7996 
Wish  0.8403  0.8319  0.7962 
CartWish  0.8486  0.8371  0.8051 
CartWishIntent  0.8462  0.8350  0.8013 
Apart from the specific behaviours which are integrated into the DAction node, there are other behaviours such as Browsing The Detail Page or Click Again which means high intent to buy. For these behaviours, we can also merge them into the DAction node and add supervisory signals. However, in contrast to the specific behaviours with explicit and deterministic supervisory signals, it is not straightforward to assign deterministic labels to them. Instead, we predict an intent score based on users’ history behaviours on the item, select those samples with high intent scores as positive actions and add supervisory signals on them. To distinguish with the deterministic signals, we call them as nondeterministic supervisory signals in this paper. The corresponding results are listed in the last row of Table 7. As can be seen, the performance of degrades compared with the one only using deterministic supervisory signals, , CartWish. It implies that the decomposition of postclick behaviours indeed matters: 1) the specific behaviours with deterministic signals are preferred to be categories into the node; 2) the nondeterministic supervisory signals may confuse the model.
4.4. Performance Analysis on Different PostClick Behaviours
To understand the performance of and its difference with ESMM, we further partition the test set into four groups according to the number of users’ purchasing behaviours within last three months, , [0,10], [11,20], [21,50], [50,+). We report AUC scores of CVR and CTCVR for both methods at different groups. The results are plotted in Figure 7. As can be seen, the CVR AUC(CTCVR AUC) of both methods decrease with number of purchasing behaviours increasing. However, we observe that the gain of over ESMM in each group increases, , 0.72%, 0.81%, 1.13%, 1.30%. Generally, users having more purchasing behaviours always have more active postclick behaviours such as Cart and Wish, etc. Our model deals with such postclick behaviours by adding an DAction node and is supervised with deterministic signals on it. Therefore, it has better representation ability on those samples than ESMM and achieves better performances on the users with highfrequency purchasing behaviours.
Method Name  CVR  CTCVR  

OAction  DAction  OAction  DAction  
ESMM  0.8802  0.7419  0.8510  0.7074 
0.8851  0.7463  0.8578  0.7241 
To further validate the above analysis, we also report AUC scores of CVR and CTCVR for both methods on the respective paths, such as ClickDActionBuy or ImpressionClickOActionBuy, etc., by splitting test samples on them, respectively. The results are listed in Table 8. As can be seen, our model outperforms ESMM on both paths, and the improvement of CTCVR on the path ImpressionClickDActionBuy is much more significant than the path ImpressionClickOActionBuy.
5. Conclusion
In this paper, we propose an Elaborated Entire Space Supervised Multitask Model() for online recommendation. By introducing the idea of PostClick Behaviour Decomposition, it efficiently addresses the sample selection bias and data sparsity problems. Three specific modules named a shared embedding module, a decomposed prediction module, and a sequential composition module, are devised to construct the deep neural network and model over the entire space by employing multitask learning. The prediction of conversion rate prediction benefits from the abundant training samples derived from the decomposed behaviours, as well as the related auxiliary tasks, including the postview clickthrough rate , clickthrough DAction conversation rate, and clickthrough conversion rate. Extensive experiments on both offline and online environments demonstrate the superiority of over stateoftheart models.
References
 (1)
 Chen et al. (2019) Qiwei Chen, Huan Zhao, Wei Li, Pipei Huang, and Wenwu Ou. 2019. Behavior Sequence Transformer for Ecommerce Recommendation in Alibaba. arXiv preprint arXiv:1905.06874 (2019).
 Dupret and Piwowarski (2008) Georges E Dupret and Benjamin Piwowarski. 2008. A user browsing model to predict search engine click data from past observations.. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 331–338.
 Effendi and Ali (2017) Muhammad Junaid Effendi and Syed Abbas Ali. 2017. Click through rate prediction for contextual advertisment using linear regression. arXiv preprint arXiv:1701.08744 (2017).
 Feng et al. (2019) Yufei Feng, Fuyu Lv, Weichen Shen, Menghan Wang, Fei Sun, Yu Zhu, and Keping Yang. 2019. Deep Session Interest Network for ClickThrough Rate Prediction. arXiv preprint arXiv:1905.06482 (2019).
 Friedman (2001) Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189–1232.
 Gao et al. (2019) Chen Gao, Xiangnan He, Dahua Gan, Xiangning Chen, Fuli Feng, Yong Li, TatSeng Chua, and Depeng Jin. 2019. Neural MultiTask Recommendation from MultiBehavior Data. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 1554–1557.
 Golbeck et al. (2006) Jennifer Golbeck, James Hendler, et al. 2006. Filmtrust: Movie recommendations using trust in webbased social networks. In Proceedings of the IEEE Consumer communications and networking conference, Vol. 96. Citeseer, 282–286.
 Gopalan et al. (2014) Prem K Gopalan, Laurent Charlin, and David Blei. 2014. Contentbased recommendations with Poisson factorization. In Advances in Neural Information Processing Systems. 3176–3184.
 Graepel et al. (2010) Thore Graepel, Joaquin Quinonero Candela, Thomas Borchert, and Ralf Herbrich. 2010. Webscale bayesian clickthrough rate prediction for sponsored search advertising in microsoft’s bing search engine. Omnipress.

Graves
et al. (2013)
Alex Graves, Abdelrahman
Mohamed, and Geoffrey Hinton.
2013.
Speech recognition with deep recurrent neural networks. In
2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 6645–6649.  Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorizationmachine based neural network for CTR prediction. arXiv preprint arXiv:1703.04247 (2017).
 Hadash et al. (2018) Guy Hadash, Oren Sar Shalom, and Rita Osadchy. 2018. Rank and rate: multitask learning for recommender systems. In Proceedings of the 12th ACM Conference on Recommender Systems. ACM, 451–454.
 He et al. (2014) Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, et al. 2014. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising. ACM, 1–9.
 Hinton et al. (2006) Geoffrey E Hinton, Simon Osindero, and YeeWhye Teh. 2006. A fast learning algorithm for deep belief nets. Neural computation 18, 7 (2006), 1527–1554.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.
 Lee et al. (2012) Kuangchih Lee, Burkay Orten, Ali Dasdan, and Wentong Li. 2012. Estimating conversion rate in display advertising from past erformance data. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 768–776.
 Logesh and Subramaniyaswamy (2019) R Logesh and V Subramaniyaswamy. 2019. Exploring hybrid recommender systems for personalized travel applications. In Cognitive informatics and soft computing. Springer, 535–544.
 Lv et al. (2019) Fuyu Lv, Taiwei Jin, Changlong Yu, Fei Sun, Quan Lin, Keping Yang, and Wilfred Ng. 2019. SDM: Sequential Deep Matching Model for Online Largescale Recommender System. arXiv preprint arXiv:1909.00385 (2019).
 Ma et al. (2018b) Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. 2018b. Modeling task relationships in multitask learning with multigate mixtureofexperts. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1930–1939.
 Ma et al. (2018a) Xiao Ma, Liqin Zhao, Guan Huang, Zhi Wang, Zelin Hu, Xiaoqiang Zhu, and Kun Gai. 2018a. Entire space multitask model: An effective approach for estimating postclick conversion rate. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 1137–1140.

Naruchitparames et al. (2011)
Jeff Naruchitparames,
Mehmet Hadi Güneş, and
Sushil J Louis. 2011.
Friend recommendations in social networks using genetic algorithms and network topology. In
2011 IEEE Congress of Evolutionary Computation (CEC)
. IEEE, 2207–2214.  Ni et al. (2018) Yabo Ni, Dan Ou, Shichen Liu, Xiang Li, Wenwu Ou, Anxiang Zeng, and Luo Si. 2018. Perceive your users in depth: Learning universal user representations from multiple ecommerce tasks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 596–605.
 Pan et al. (2008) Rong Pan, Yunhong Zhou, Bin Cao, Nathan N Liu, Rajan Lukose, Martin Scholz, and Qiang Yang. 2008. Oneclass collaborative filtering. In 2008 Eighth IEEE International Conference on Data Mining. IEEE, 502–511.
 Qu et al. (2016) Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang. 2016. Productbased neural networks for user response prediction. In 2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, 1149–1154.
 Rendle (2010) Steffen Rendle. 2010. Factorization machines. In 2010 IEEE International Conference on Data Mining. IEEE, 995–1000.
 Shen et al. (2014) Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014. A latent semantic model with convolutionalpooling structure for information retrieval. In Proceedings of the 23rd ACM international conference on conference on information and knowledge management. ACM, 101–110.

Srivastava et al. (2014)
Nitish Srivastava,
Geoffrey Hinton, Alex Krizhevsky,
Ilya Sutskever, and Ruslan
Salakhutdinov. 2014.
Dropout: a simple way to prevent neural networks
from overfitting.
The journal of machine learning research
15, 1 (2014), 1929–1958.  Sun et al. (2019) Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer. arXiv preprint arXiv:1904.06690 (2019).
 Thakkar et al. (2019) Priyank Thakkar, Krunal Varma, Vijay Ukani, Sapan Mankad, and Sudeep Tanwar. 2019. Combining UserBased and ItemBased Collaborative Filtering Using Machine Learning. In Information and Communication Technology for Intelligent Systems. Springer, 173–180.
 Tsai et al. (2019) ChunHua Tsai, Peter Brusilovsky, and Behnam Rahdari. 2019. Exploring UserControlled Hybrid Recommendation in a Conference Context.. In IUI Workshops.
 Van den Oord et al. (2013) Aaron Van den Oord, Sander Dieleman, and Benjamin Schrauwen. 2013. Deep contentbased music recommendation. In Advances in neural information processing systems. 2643–2651.

Wen
et al. (2019)
Hong Wen, Jing Zhang,
Quan Lin, Keping Yang, and
Pipei Huang. 2019.
MultiLevel Deep Cascade Trees for Conversion Rate
Prediction in Recommendation System. In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33. 338–345.  Wilson et al. (2019) Nathan R Wilson, Emily A Hueske, Thomas C Copeman, Evan Favermann Eisert, Jana B EGGERS, Raymond J PLANTE, and Michael D Houle. 2019. Systems and methods for providing recommendations based on collaborative and/or contentbased nodal interrelationships. US Patent App. 14/687,742.
 Xiao et al. (2017) Jun Xiao, Hao Ye, Xiangnan He, Hanwang Zhang, Fei Wu, and TatSeng Chua. 2017. Attentional factorization machines: Learning the weight of feature interactions via attention networks. arXiv preprint arXiv:1708.04617 (2017).

Zadrozny (2004)
Bianca Zadrozny.
2004.
Learning and evaluating classifiers under sample selection bias. In
Proceedings of the twentyfirst international conference on Machine learning. ACM, 114.  Zhang et al. (2019) Feng Zhang, Victor E Lee, Ruoming Jin, Saurabh Garg, KimKwang Raymond Choo, Michele Maasberg, Lijun Dong, and Chi Cheng. 2019. Privacyaware smart city: A case study in collaborative filtering recommender systems. J. Parallel and Distrib. Comput. 127 (2019), 145–159.
 Zhang et al. (2016) Weinan Zhang, Tianxiong Zhou, Jun Wang, and Jian Xu. 2016. Bidaware gradient descent for unbiased learning with censored data in display advertising. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 665–674.
 Zhang et al. (2014) Yuyu Zhang, Hanjun Dai, Chang Xu, Jun Feng, Taifeng Wang, Jiang Bian, Bin Wang, and TieYan Liu. 2014. Sequential click prediction for sponsored search with recurrent neural networks. In TwentyEighth AAAI Conference on Artificial Intelligence.
 Zhou et al. (2019b) Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2019b. Deep interest evolution network for clickthrough rate prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 5941–5948.
 Zhou et al. (2018) Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for clickthrough rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1059–1068.
 Zhou et al. (2019a) Wang Zhou, Jianping Li, Yongluan Zhou, and Muhammad Hammad Memon. 2019a. Bayesian pairwise learning to rank via oneclass collaborative filtering. Neurocomputing (2019).
 Zhou and Feng ([n.d.]) ZH Zhou and J Feng. [n.d.]. Deep forest: Towards an alternative to deep neural networks. arXiv 2017. arXiv preprint arXiv:1702.08835 ([n. d.]).
 Zhu et al. (2017) Han Zhu, Junqi Jin, Chang Tan, Fei Pan, Yifan Zeng, Han Li, and Kun Gai. 2017. Optimized cost per click in taobao display advertising. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2191–2200.
 Zhu et al. (2018) Han Zhu, Xiang Li, Pengye Zhang, Guozheng Li, Jie He, Han Li, and Kun Gai. 2018. Learning Treebased Deep Model for Recommender Systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1079–1088.
Comments
There are no comments yet.