Conversion Rate Prediction via Post-Click Behaviour Modeling

10/15/2019 ∙ by Hong Wen, et al. ∙ 0

Effective and efficient recommendation is crucial for modern e-commerce platforms. It consists of two indispensable components named Click-Through Rate (CTR) prediction and Conversion Rate (CVR) prediction, where the latter is an essential factor contributing to the final purchasing volume. Existing methods specifically predict CVR using the clicked and purchased samples, which has limited performance affected by the well-known sample selection bias and data sparsity issues. To address these issues, we propose a novel deep CVR prediction method by considering the post-click behaviors. After grouping deterministic actions together, we construct a novel sequential path, which elaborately depicts the post-click behaviors of users. Based on the path, we define the CVR and several related probabilities including CTR, etc., and devise a deep neural network with multiple targets involved accordingly. It takes advantage of the abundant samples with deterministic labels derived from the post-click actions, leading to a significant improvement of CVR prediction. Extensive experiments on both offline and online settings demonstrate its superiority over representative state-of-the-art methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Discovering valuable products or services from massive available options on the Internet for users has become a fundamental functionality for modern online applications such as e-commerce (Ni et al., 2018; Chen et al., 2019; Sun et al., 2019; Wen et al., 2019), social networking (Golbeck et al., 2006; Naruchitparames et al., 2011), advertising (Zhou et al., 2019b, 2018), etc. Recommendation System(RS), as a widely used information filtering tool (Zhu et al., 2018; Feng et al., 2019; Lv et al., 2019), serves this role to provide accurate, timely and even personalized services to users. Taking the online recommendation in e-commerce platform as an example, there are usually two phases, , System Recommendation Phase and User Action Phase, as shown in Figure 1. After analysing user’s long and short-term behaviours, the RS

first generates a large number of related items. Then, they are ranked and delivered to users according to the estimated Click-Through Rate (

CTR) (Zhou et al., 2019b, 2018) and Conversion Rate (CVR) (Wen et al., 2019; Ma et al., 2018a) , which aims for maximizing the probabilities of clicking and buying these impression items. On the user side, after receiving the impression item information, they click on the items that they are interested in and may finally buy some of them. Obviously, accurately estimating CTR and CVR is crucial for providing expected products to users as well as increasing sales. To this end, the logs of user clicking and buying behaviours, which provides valuable feedback to RS, is used to improve it further.

Figure 1. The architecture for online recommendation in e-commerce platform, which comprises of two fundamental components, i.e., System Recommendation Phase and User Action Phase.

However, there are two critical issues making the aforementioned estimation intractable, , Sample Selection Bias (SSB) (Zadrozny, 2004) and Data Sparsity (DS) (Lee et al., 2012). SSB refers the sample bias between model training phase and inference phase, , conventional CVR models are trained only on clicked samples while being used for inference on all impression samples. Since clicked samples are only a very small portion of the impression samples, SSB plays a severe burden on the inference accuracy. Besides, after clicking the items, users only buy very few of them eventually. It leads to the DS problem that samples from the sequential behaviour path ClickBuy are insufficient to train a CVR model with strong representative ability. As illustrated in Figure 2, how to deal with the SSB and DS problems is crucial for the accuracy of a CVR model.

Figure 2. Illustration of sample selection bias problem in conventional CVR prediction, where training space only composes of clicked samples, while inference space is the entire space for all impression samples. And data volume gradually decreased from Impression to Buy.

Several studies have been carried out to tackle these challenges. Pan propose a negative example weighting and sampling method to deal with the absence of the negative examples in conventional recommendation systems (Pan et al., 2008). Although it can reduce the side effect of SSB by introducing negative examples, it also leads to underestimated predictions. Zhang propose a model-free learning framework by fitting the underlying distribution in the context of advertising (Zhang et al., 2016). However, it may encounter numerical instability when weighting samples. Lee propose to build several hierarchical estimators with different features whose distribution parameters are estimated individually (Lee et al., 2012). It indeed relies on prior knowledge to construct hierarchical structures, which is difficult to be applied in recommendation systems with tens of millions of users and items. Recently, Ma propose an approach named Entire Space Multi-task Model (ESMM) to model over the entire space by considering the path ImpressionClickBuy (Ma et al., 2018a). ESMM models the post-view click-through rate (pCTR) and post-view click-through conversion rate (pCTCVR) together in a multi-task framework. Consequently, the post-click conversion rate (pCVR) can be derived from pCTR and pCTCVR over the entire space. In this way, ESMM addresses the SSB and DS issues by making abundant use of all the training samples and supervisory signals from two auxiliary tasks based on the paths of ImpressionClick and ImpressionBuy, respectively.

Although ESMM achieves better performance than conventional methods, the DS problem still exists since the training samples on the path ClickBuy are indeed much less. To deal with this problem, we observe that users always take some specific actions after clicking items if they buy them eventually, , Add to Cart (Cart), Add to Wish List (Wish), etc. Therefore, we can change the original path ClickBuy to ClickCartBuy or ClickWishBuy, etc., where data sparsity on the intermediate paths ClickCart or CartBuy can be alleviated compared with the original path ClickBuy. Moreover, we can model CVR on these new paths by exploiting the extra training samples with supervisory signals from these specific actions. Motivated by this observation, different from prior ESMM work modeling on the path ImpressionClickBuy, we insert parallel disjoint Deterministic Action (DAction) and Other Action(OAction) between Click and Buy to elaborately model over the entire space, which specifically changes the conventional path ImpressionClickBuy to the novel path ImpressionClickDAction/OActionBuy by decomposing the post-click behaviour.

Specifically, we propose a novel deep neural recommendation algorithm named Elaborated Entire Space Supervised Multi-task Model (), which consists of three modules: 1) a shared embedding module (SEM), 2) a decomposed prediction module (DPM), and 3) a sequential composition module (SCM). First, SEM

embeds one-hot feature vector of ID features into dense representation through a linear fully connected layer. Then, these embeddings are fed into the subsequent

DPM, where individual prediction network estimates the probability of decomposed target on the path ImpressionClick, ClickDAction, DActionBuy, OActionBuy, respectively. Finally, SCM integrates them together according to the defined behaviour paths sequentially to calculate the final pCVR and some auxiliary probabilities including pCTR, pCTCVR, etc. In a nutshell, the proposed method addresses the SSB and DS problems simultaneously, and essentially improves the final prediction accuracy by employing multi-task learning framework and supervisory signals from intermediate actions between Click and Buy.

The main contributions of this paper are as follows:

To the extent of our knowledge, it is the first time to introduce the idea of Post-Click Behaviour Decomposition to model CVR over the entire space, which specifically changes the conventional path ImpressionClickBuy to the novel path ImpressionClickDAction/OActionBuy by decomposing the post-click behaviour.

We propose a novel deep neural recommendation algorithm named based on the above idea, which is able to simultaneously predict the probabilities of decomposed targets, and sequentially compose them together to calculate the final pCVR and some auxiliary targets. This multi-task learning model can efficiently address the SSB and DS problems.

Our proposed model achieves better performance on real-world offline dataset than representative state-of-the-art methods. To further demonstrate its efficiency in industrial applications, we successfully deploy on our online recommendation module and achieve significant improvement.

The rest of this paper is organized as follows. Section 2 presents a brief survey of related work, followed by the details of the proposed model in Section 3. Experiments results and analysis are presented in Section 4. Finally, we conclude the paper in Section 5.

2. Related Work

Generally speaking, recommendation methods include content-based methods (Gopalan et al., 2014; Van den Oord et al., 2013; Wilson et al., 2019), collaborative filtering based methods (Zhang et al., 2019; Thakkar et al., 2019; Zhou et al., 2019a), and hybrid strategies based methods (Logesh and Subramaniyaswamy, 2019; Tsai et al., 2019), where content-based methods recommend items similar with user’s past interests, collaborative filtering based recommendations make users recommended items that people with similar tastes preferred in the past, and hybrid recommendations integrate two or more types of recommendation strategies. Our method falls into the collaborative filtering based category and specifically tackles the post-click conversion rate prediction problem using multi-task learning framework via post-click behaviour decomposition. Therefore, we briefly review the most related work from the following two aspects: 1) conversion rate prediction; 2) multi-task learning.

Conversion Rate Prediction: Conversion rate prediction is a key component of many online applications, such as search engines (Dupret and Piwowarski, 2008; Zhang et al., 2014), recommendation systems (Guo et al., 2017; Qu et al., 2016) and online advertising (Graepel et al., 2010; He et al., 2014). However, CVR

is very challenging since conversions are extreme rare events that only a very small portion of impression items are eventually being clicked and bought. Prior work studies both linear models and non-linear models including logistic regression

(Effendi and Ali, 2017)

, decision trees

(Wen et al., 2019; Zhou and Feng, [n.d.]), factorization machines (Rendle, 2010; Xiao et al., 2017). Recently, deep neural network has achieved significant progress in many areas due to its strong ability in feature representation and end-to-end modeling (Hinton et al., 2006; Graves et al., 2013; Krizhevsky et al., 2012; Shen et al., 2014; Feng et al., 2019; Lv et al., 2019). Following these works, we also adopt the deep neural network to embed user-item features and predict the conversion rate.

Multi-Task Learning: Due to the temporal multi-stage nature of users’ purchasing behaviour, , Impression, Click, and Buy, prior work attempts to formulate the conversion rate prediction tasks by a multi-task learning framework. For example, Hadash propose a multi-task learning based recommendation system by modeling the ranking and rating prediction tasks simultaneously (Hadash et al., 2018). Ma propose multi-task learning approach named Multi-gate Mixture-of-Experts to explicitly learn the task relationship from data (Ma et al., 2018b). Gao propose a neural multi-task recommendation model named NMTR to learn the cascading relationship among different types of behaviours (Gao et al., 2019). In contrast, we model the CTR and CVR tasks simultaneously by associating with users’ sequential behaviour paths, where the task relationship is explicitly defined by the conditional probability (See section 3). Ni propose to learn universal user representations across multiple tasks for more effective personalization (Ni et al., 2018). We also adopt such an idea by sharing the embedded features across different tasks.

Recently, Ma propose an entire space multi-task model (ESMM) for estimating post-click conversion rate (Ma et al., 2018a). ESMM uses the pCTR task and pCTCVR task as parallel auxiliary tasks of the main pCVR task. Our method follows ESMM but has the following difference: we propose to decompose the post-click behaviour into DAction and OAction parts, and insert them into the original sequential path ClickBuy. In this way, it can leverage the supervisory signals from users’ post-click behaviours, which consequently alleviates the data sparsity issue.

3. Proposed Method

3.1. Motivation

Figure 3. Several kinds of supervisory signals are introduced on the path ClickBuy. (a) The multiple paths for Buy, such as ImpressionClickCartBuy. (b) A digraph is used to describe the purchasing process, where the numbers above the lines represents the sparsity of different paths, respectively. (c) These supervisory signals are integrated into a unified node DAction. Here, DAction represents the union of Cart and Wish.

There exists multiple kinds of behaviour paths for Buy, such as Impression first, sequentially followed by Click, Cart and Buy, as shown in Figure 3(a). After analyzing our online real-world data, we use a digraph to describe the purchasing process as shown in Figure 3(b). First, some items are displayed to users. Then, when users click an item, they will always take some specific actions before buying it eventually. For example, 10% of them add the item into their carts (Add to Cart, short as Cart) so that they will buy it with other items together later (12% of them buy it eventually). 3.5% of them add the item into their wish lists (Add to Wish List, short as Wish) as if they like it but can not buy it instantly for some reasons, , can not afford it now or wait for promotion(31% of them buy it eventually). How do these sequential behaviours benefit the CVR prediction more accurately? One possible solution is to exhaust all possible disjoint behaviour paths, then use the intermediate supervisory signals within individual paths. However, 1) If we distinguish these behaviour paths thoroughly, it will further make the data on each individual path more sparse. 2) It is nearly impossible to exhaust all the behaviours paths due to the intricacy of user behaviours. 3) It is not straightforward to integrate the final prediction target, such as CVR, from individual paths.

As shown in Figure 3(c), instead of treating these behaviour paths individually, we define a single node named Deterministic Action (DAction) to integrate specific actions, such as Cart or Wish. It can be seen that DAction has the following two properties: 1) It has deterministic supervisory signals, , 1 for taking some actions and 0 for none. 2) It further alleviates the DS problem due to integrating multiple kinds of actions, , 13% of the users take some specific actions after they click an item, and 9% of them eventually buy it (due to the overlapping of actions, the number is 13% instead of 13.5%). Therefore, we can use these abundant training samples on the intermediate paths of ClickDAction and DActionBuy to supervise the CVR model. We also add a node named Other Action(short as OAction) between Click and Buy to deal with other cases except DAction. In this way, the original path ImpressionClickBuy is changed to a more elaborated one ImpressionClickDAction/OActionBuy. We call the above idea as Post-Click Behaviour Decomposition and devise our CVR model accordingly. The detail will be presented as follows.

3.2. Probability Decomposition of Cvr

In this section, we present the probability decomposition of conversion rate according to the digraph defined in Figure 3(c).

First, the probability of post view click-through rate of an item , denoted as , is defined as the probability of being clicked given that it has been viewed, which depicts the path ImpressionClick in the digraph. Mathematically, it can be written as:


where denotes whether the item is being clicked, , is the label spaces of all the items being clicked or not, and is the number of items. Similarly, denotes whether the item is being viewed(, Impression), , is the label spaces of all the items being viewed or not. is a surrogate symbol for simplicity.

Then, the probability of click-through DAction conversion rate of an item , denoted as , is defined as the probability of being taken DAction given that it has been viewed, which depicts the path ImpressionClickDAction in the digraph. Mathematically, it can be written as:


where denotes whether the item is being taken some specific actions defined in Section 3.1, , is the label spaces of all the items being taken some specific actions or not. , depicting the path ClickDAction, is a surrogate symbol for simplicity as . It is trivial that since all the samples are impression samples (, ). It is noteworthy that Eq. (2) holds due to the fact that no action occurs without being clicked, , .

Next, the probability of conversion rate of an item , denoted as , is defined as the probability of being bought given that it has been clicked, which depicts the paths ClickDAction/OActionBuy in the digraph. Mathematically, it can be written as:


where denotes whether the item is being bought, , is the label spaces of all the items being bought or not. , are some surrogate symbols for simplicity as . or depicts the path DActionBuy or OActionBuy in the digraph, respectively.

The probability of click-through conversion rate of an item , denoted as , is defined as the probability of being bought given that it has been viewed, which depicts the complete path ImpressionClickDAction/OActionBuy in the digraph. Mathematically, it can be written as:


Here, we use to replace in the third equality for simplicity without causing any ambiguity. It is noteworthy that the fourth equality holds due to the fact that no items will be bought without being clicked, , equals to zero, . Indeed, Eq. (4) can be derived by decomposing the path ImpressionClickDAction/OActionBuy into ImpressionClick and ClickDAction/OActionBuy, and integrating Eq. (1) and Eq. (3

) together according to the chain rule.

Figure 4. The diagram of model over the entire space, which consists of three key modules: 1) a Shared Embedding Module(SEM), 2) a Decomposed Prediction Module(DPM), and 3) a Sequential Composition Module(SCM). SEM embeds sparse features into dense representation. DPM predicts the individual decomposed target on the path ImpressionClick, ClickDAction, DActionBuy, OActionBuy, respectively. SCM integrates them together sequentially to calculate the final CVR.

3.3. Elaborated Entire Space Supervised Multi-task Model

Given the users’ behaviour logs, we can easily obtain the ground truth labels of , , and defined in the above section. In other word, we can model them simultaneously by using multi-task learning framework. To this end, we propose a novel deep neural recommendation algorithm named Elaborated Entire Space Supervised Multi-task Model() for conversion rate prediction. gets its name since: 1) , , and are modeled over the entire space, , using all the impression samples; 2) the derived from Eq. (3) also benefits from the entire space multi-task modeling which will be validated in the experiment part. As shown in Figure 4, the proposed is modeled by using deep neural network and consists of three key modules: 1) a shared embedding module, 2) a decomposed prediction module, and 3) a sequential composition module. We present each of them in detail as follows.

Shared Embedding Module (SEM): First, we use SEM to embed all the sparse ID features and dense numerical features coming from user field, item field, and user-item cross field. The user features include users’ ID, ages, genders and purchasing powers, etc. The item features include items’ ID, prices, accumulated CTR and CVR from historical logs, etc. The user-item features include users’ historical preference scores on items, etc. Dense numerical features are first discretized based on their boundary values and then represented as one-hot vectors. Here, we use to denote the one-hot features of the training sample, where

denotes the index set of all kinds of features. Due to the sparseness nature of one-hot encoding, we employ linear fully connected layers to embed them into dense representation, which can be formulated as :


where denotes the embedding matrix for the kind of features, represents the network parameters.

Decomposed Prediction Module (DPM): Then, once all the feature embeddings are obtained, they are concatenated together, fed into the decomposed prediction module, shared by each of the subsequent networks. Each individual prediction network in DPM estimates the probability of decomposed target on the path ImpressionClick, ClickDAction, DActionBuy, OActionBuy, respectively. In this paper, we employ Multi-Layer Perception (MLP

) as the prediction network. All the non-linear activation function is

ReLU except the output layer, where we use a Sigmoid function to map the output into a probability taking real value from 0 to 1. Mathematically, it can be formulated as:


where denotes the Sigmoid function, denotes the mapping function learned by the MLP, denotes its network parameters. For example, as shown in the first MLP in Figure 4, it output the estimated probability , which is indeed the post-view click-through rate.

Sequential Composition Module (SCM): Finally, SCM composes the above predicted probabilities sequentially according to Eq. (1) Eq.(4) to calculate the conversion rate and some auxiliary targets including the post-view click-through rate , click-through DAction conversation rate , and click-through conversion rate , respectively. As shown in the top part of Figure 4, SCM

is a parameter-free feed forward neural network which represents the underlying conditional probabilities defined by the purchasing decision digraph in Figure 


3.4. Training Objective

We use to denote the training set, where , , , represent the ground truth label whether the impression sample is being clicked, taken deterministic actions, and bought. Then, we can define the joint post-view click-through probability of all training samples as follows:


where and denote the positive and negative samples in the label space , respectively. After taking negative logarithm on Eq.(7), we obtain the logloss of , which is widely used in recommendation systems, ,


Similarly, we can obtain the loss function of

and as follows:




The final training objective to be minimized is defined as:


where denotes all the network parameters in . , , are loss weights of , , , respectively.

4. Experiments

To evaluate the effectiveness of the proposed model, we conducted extensive experiments on both offline dataset collected from real-world e-commerce scenarios and online deployment. is compared with some representative state-of-the-art (SOTA) methods including GBDT (Friedman, 2001), DNN (Hinton et al., 2006), DNN using over-sampling idea (Pan et al., 2008) and ESMM (Ma et al., 2018b)

. First, we present the evaluation settings including the dataset preparation, evaluation metrics, a brief description of these SOTA methods, and the implementation details. Then, we present the comparison results and analysis. Ablation studies are presented next, followed by the performance analysis on different post-click behaviours.

4.1. Evaluation settings

4.1.1. Dataset Preparation

To the best of our knowledge, there is no public benchmark datasets with sequential behaviour labels, , Cart or Wish for entire space modeling of CVR prediction. To address this issue, we collect the transaction logs in several consecutive days of September, 2019 from our online e-commerce platform, which is one of the largest third-party retail platforms in the world. More than 300 million instances with user/item/user-item features and behaviour labels are filtered out. They are further divided into three disjoint sets, , training set, validation set, and test set, respectively.

Category #User #Item #Impression
Number 13,383,415 10,399,095 326,325,042
Category #Click #Buy #Action
Number 20,637,192 226,918 2,501,776
Table 1. Statistics of the offline dataset.

The statistics of this offline dataset are listed in Table 1. For example, only 6% of items have been clicked after being viewed. In addition, Among these clicked items, only 1% of them have been bought eventually, which is fairly sparse. However, when compared in the context of items being taken specific actions, more than 9% of them have been bought eventually. The data volume increases by about 9 times relatively. Therefore, Our can benefit from the extra supervisory signals and the post-click behaviour decomposition (See section 3.1), as will be validated in the following experiments.

Moreover, we also deploy each model in our online recommendation system and carry out the A/B test to compare their performances in real-world scenario. The details will be presented in Section 4.2.2.

4.1.2. Evaluation Metrics

To comprehensively evaluate the effectiveness of the proposed model and compare it with SOTA methods, we adopt three widely used metrics in recommendation and advertising system, , Area Under Curve (AUC), GAUC (Zhou et al., 2018; Zhu et al., 2017) and score, where AUC reflecting the ranking ability, defined as :


where and denote the set of positive/negative samples, respectively, and denote the number of samples in and , is the prediction function, is indicator function respectively.

GAUC (Zhu et al., 2017) is calculated as follows. First, all the test data are partitioned into different groups according to individual user ID. Then, the AUC is calculated in each single group. Finally, we average these weighted AUC. Mathematically, GUC is defined as:


where denotes the weight for user (set as 1 for our offline evaluations). denotes the AUC for user .

Moreover, score is defined as:


where and

denote the precision and recall, respectively. They are defined as:




where , , and denote the number of true positive, false positive, and false negative predictions, respectively.

4.1.3. Brief Description of comparison methods

The representative state-of-the-art methods used to compared with the proposed are described as follows.

GBDT (Friedman, 2001)

: The gradient boosting decision tree (

GBDT) model follows the idea of gradient boosting machine (GBM), is able to produce competitive, highly robust, interpretable procedures for regression or classification tasks (Wen et al., 2019)

. In this paper, we use it as the representative of none-deep learning based method and a strong baseline.

DNN (Hinton et al., 2006): We also design a deep neural network baseline model, which has the exactly same structure and hyper-parameters with each of the individual branches in our model. Different from our model, it is trained using samples on the path ClickBuy or ImpressionClick for conversion rate or click-through rate , respectively.

DNN-OS (Pan et al., 2008): Due to the data sparsity on the path ClickBuy or ImpressionBuy, it is hard to train a deep neural network with good generalization ability. To address this issue, one strategy is to augment positive samples during training, called over-sampling. In this paper, we leverage this over-sampling strategy to train another deep model named DNN-OS, which has the same structure and hyper-parameters with the aforementioned DNN model.

ESMM (Ma et al., 2018b): For a fair comparison, we use the same backbone structure as the above deep models. It uses multi-task learning to predict and over the entire space, where feature representation is shared by both tasks. However, it directly models the conversion rate on the path ImpressionClickBuy without considering the post-click behaviours. Therefore, its performance maybe degraded due to the data sparsity issue.

The first three methods learn to predict probabilities of and using samples on the paths ImpressionClick and ClickBuy, then multiply them together to derive the click-through conversion rate . As for ESMM and our proposed , they directly predict and by modeling over the entire space.

4.1.4. Hyper-parameters Settings

For the GBDT model, the number of trees, the tree depth, minimum instance numbers for splitting a node, sampling rate of train set for each iteration, sampling rate of features for each iteration, and the type of loss function, are set as 150, 8, 20, 0.6, 0.6 and logistic loss, respectively, which are chosen according to the AUC

score on the validation set. For the deep neural network based models, they are implemented in TensorFlow and trained on


using Adam optimizer for two epochs. The learning rate is set to 0.0005, and the mini-batch size is set to 1000. Logistic loss is used as the loss function of each prediction task of all the models. There are 5 layers in the

MLP, where the dimension of each layer is set to 512, 256, 128, 32, and 2, respectively. These hyper-parameters are summarized in Table 2.

Hyper-parameter Choice
Loss function Logistic Loss
Optimizer Adam
Number of layers in MLP 5
Dimensions of layers in MLP [512,256,128,32,2]
Batch size 1000
Learning rate 0.0005
Dropout ratio 0.5
Table 2. Hyper-parameters of deep neural network based models including DNN, DNN-OS, ESMM, and .

4.2. Main Results

4.2.1. Comparisons on Offline Data Set

GBDT 0.7823 0.8059 0.7747
DNN 0.8065 0.8161 0.7864
DNN-OS 0.8124 0.8192 0.7893
ESMM 0.8398 0.8270 0.7906
0.8486 0.8371 0.8051
Table 3. The AUC and GAUC scores of all the methods.
CVR@top0.1% CVR@top0.6% CVR@top1%
Method Recall Precision F1-Score Recall Precision F1-Score Recall Precision F1-Score
GBDT 4.382% 14.348% 6.714% 16.328% 9.894% 12.322% 27.384% 7.384% 11.631%
DNN 4.938% 15.117% 7.445% 17.150% 10.495% 13.021% 28.481% 8.196% 12.729%
DNN-OS 5.383% 15.837% 8.034% 17.38% 10.839% 13.353% 29.032% 8.423% 13.058%
ESMM 5.813% 16.295% 8.570% 18.585% 11.577% 14.267% 29.789% 8.961% 13.777%
6.117% 17.145% 9.017% 23.492% 10.574% 14.584% 30.032% 9.034% 13.890%
Table 4. The Precision, Recall and scores of all the methods for CVR.
CTCVR@top0.1% CTCVR@top0.6% CTCVR@top1%
Method Recall Precision F1-Score Recall Precision F1-Score Recall Precision F1-Score
GBDT 2.937% 0.701% 1.132% 4.870% 0.649% 1.145% 8.894% 0.531% 1.002%
DNN 3.168% 0.851% 1.341% 5.269% 0.768% 1.340% 9.461% 0.643% 1.204%
DNN-OS 3.382% 0.871% 1.385% 5.369% 0.801% 1.395% 9.863% 0.673% 1.260%
ESMM 3.858% 0.915% 1.479% 5.504% 0.828% 1.439% 10.088% 0.691% 1.294%
4.219% 1.001% 1.618% 5.987% 0.900% 1.566% 10.991% 0.753% 1.410%
Table 5. The Precision, Recall and scores of all the methods for CTCVR.

In this subsection, we report the AUC, GAUC, and scores of all the competitors on the off-line test data set. Table 3 summarizes the results of AUC and GAUC. It can be seen that the DNN method achieves gains of 0.0242, 0.0102, 0.0117 for CVR AUC, CTCVR AUC, and CTCVR GAUC over the baseline GBDT model, respectively. It demonstrates the strong representation ability of deep neural networks. Different from the vanilla DNN, DNN-OS utilizes over-sampling strategy to weights the sparse positive samples to address the data sparsity issue. It achieves better performance than DNN. As for ESMM, it models and on the path ImpressionClickBuy, which tries to address the SSB and DS issues simultaneously. Benefiting from the abundant training samples, it outperforms DNN-OS. However, ESMM neglects the impact of post-click behaviour which is further utilized by the proposed . It efficiently models and several related targets including , , and together under a multi-task learning framework, which benefits from the extra supervisory signals by decomposing the post-click behaviours and integrating them into the behaviour paths as described in Section 3.1 and Section 3.2. As can be seen, it obtains the best scores among all the methods. For example, the gains over ESMM are 0.0088, 0.0101, 0.0145 for CVR AUC, CTCVR AUC, and CTCVR GAUC, respectively. It is worth mentioning that a gain of 0.01 in off-line AUC always means a significant increment in revenue in our online recommendation system (Ma et al., 2018a; Wen et al., 2019).

As for the score, we report several values by setting different thresholds for CVR and CTCVR, respectively. First, we sort all the instances in a descending order according to the predicted CVR or CTCVR score. Then, due to the sparsity of CVR task (about 1% of the predicted samples are positive), we choose three thresholds namely top@0.1%, top@0.6% and top@1% to split the predictions into positive and negative groups accordingly. Finally, we calculate the precision, recall and scores of these predictions at these different thresholds. Results are summarized in Table 4 and Table 5, respectively. Similar trend to Table 3 can be observed. Again, the proposed method achieves the best performance at different settings.

In conclusion, we have the following assertions after analyzing the gains of the proposed over other models: 1) The deep neural network has stronger representation ability than the decision tree based GBDT. 2) The multi-task learning framework over the entire sample space serves as an efficient tool to address the SSB and DS problems simultaneously. 3) Decomposing the post-click behaviours, integrating them into the behaviour paths, and modeling the entire sample space further alleviates the DS problem and leads to better performance.

4.2.2. Comparisons on Online Deployment

It is not a easy job to deploy deep network models in our recommendation system since our online system servers hundreds of millions of users everyday. It can be more than 100 million users per second at a traffic peak. Therefore, a practical model is required to make real-time CVR predictions with high throughput and low latency. For example, hundreds of recommendation items for each visitor should be predicted in less than 100 milliseconds in our system. To make the online evaluation fair, confident and comparable, each deployed method during an A/B test should include the same number of users, , millions of users. To this end, we carefully conducted the A/B test in our online recommendation system in seven consecutive days in September, 2019. The results are summarized in Figure 5, where we use the GBDT model as the baseline. As can be seen, DNN, DNN-OS and ESMM have the similar performance but significantly outperform the baseline GBDT model. As for the proposed , there is a significant margin between it and the above methods, which clearly demonstrates its superiority. Besides, it contributes up to 3% CVR promotion compared with the ESMM, which indicates a significant business value for the e-commercial platform.

4.3. Ablation Studies

In this part, we present the detailed ablation studies including hyper-parameter settings of deep neural network, effectiveness of sampling important numerical features, embedding dense numerical features, decomposing post-click behaviours, and the influence of including non-deterministic supervisory signals, respectively.

Figure 5. The results of A/B test for CVR by deploying different models in our recommendation system.
Figure 6. The results of different hyper-parameter settings in .

4.3.1. Hyper-parameters of Deep Neural Networks

Here, we take three critical parameters, namely dropout ratio, the number of hidden layers and the dimension of item feature embeddings, as example to illustrate the process of parameter selection in our model.

Dropout (Srivastava et al., 2014) refers to the regularization technique which randomly drops some neural nodes during training. It can strengthen deep neural networks’ generalization ability by introducing randomness. We try different choices of the dropout ratio from 0.2 to 0.7 in our model. As shown in Figure 6(a), a dropout ratio of 0.5 leads to the best performance. Therefore, we set the dropout ratio as 0.5 in all the experiments if not specified.

Increasing the depth of network layers can enhance the capacity of deep models but also potentially leads to over-fitting problem. Therefore, we carefully set this hyper-parameter according to the AUC scores on the validation set. As can be seen from Figure 6(b), at the beginning stage, , from 2 layers to 5 layers, increasing of the number of hidden layers consistently improves the model’s performance. However, it saturates at 5 layers that increasing more layers even marginally decreases the AUC scores, where the model may be over-fitted. Therefore, we stack 5 hidden layers in all experiments if not specified.

The dimension of item feature embeddings is a critical parameter that high-dimension features reserve more information but also lead to potential noise and higher model complexity. We try different settings of the parameter and plot the results in Figure 6(c). As can be seen, increasing the dimension generally improve the performance. It finally saturates at 128 while doubling it leads no more gains. Therefore, to make a trade-off between model capacity and complexity, we set the dimension of item feature embeddings to 128 in all the experiments if not specified.

4.3.2. Effectiveness of Sampling Important Numerical Features

In decision tree based models such as GBDT, a common practice is to iteratively select the features with the largest statistical information gain and combine the most useful features to fit the model. Inspired by it, we hypothesize that sampling important features from the numerical features to train the proposed may also lead to better performance while reducing the model complexity. To validate the hypothesis, we employ a GBDT model to evaluate the importance of all numerical features and choose the top of them, with the embedded ID features together, as the input of our model. The results for different settings of are summarized in Table 6. As can be seen, keeping the top-64 features achieves the best performance. Therefore, we set this hyper-parameter as 64 in all the experiments if not specified.

K 500 256 128 64 32 8
CVR AUC 0.8479 0.8481 0.8483 0.8486 0.8441 0.8385
Table 6. Comparison results for different settings of top-K numerical features.

4.3.3. Effectiveness of Embedding Dense Numerical Features

After selecting the most important dense numerical features, a common practice is to discretize them into one-hot vectors first and then concatenate them with the ID features together, which are then embedded into dense features through a linear projection layer as described in Section 3.3. However, we hypothesize that the one-hot vector representation of numerical features may degrade the precision during discretization. Therefore, we try another solution by normalizing the numerical features first and then embed them by using a tanh activation function, ,


where and

denotes the mean and standard deviation of the

kind of features. Then, we concatenate the embedded features with the embedded ID features together as the input of our model. In our experiment, it achieves a gain of 0.004 AUC over the discretization.

4.3.4. Effectiveness of Decomposing Post-Click Behaviours

When decomposing the post-click behaviours, we can integrate different behaviours into the node, , only Cart, only Wish, and both Cart and Wish (Cart Wish), etc. Here, we evaluate the effectiveness of decomposing post-click behaviours by choosing different combinations of actions. The results are summarized in Table 7. As can be seen, the combination of both actions achieves the best AUC scores. It is reasonable since the data sparsity issue is less server than the other two cases. For example, only 10% (3.5%) of clicked items are added to cart (wish list), while the number becomes to 13% if we adopt the combination of them.

Cart 0.8457 0.8359 0.7996
Wish 0.8403 0.8319 0.7962
CartWish 0.8486 0.8371 0.8051
CartWishIntent 0.8462 0.8350 0.8013
Table 7. Comparison results for different choices on decomposing post-click behaviours.

Apart from the specific behaviours which are integrated into the DAction node, there are other behaviours such as Browsing The Detail Page or Click Again which means high intent to buy. For these behaviours, we can also merge them into the DAction node and add supervisory signals. However, in contrast to the specific behaviours with explicit and deterministic supervisory signals, it is not straightforward to assign deterministic labels to them. Instead, we predict an intent score based on users’ history behaviours on the item, select those samples with high intent scores as positive actions and add supervisory signals on them. To distinguish with the deterministic signals, we call them as non-deterministic supervisory signals in this paper. The corresponding results are listed in the last row of Table 7. As can be seen, the performance of degrades compared with the one only using deterministic supervisory signals, , CartWish. It implies that the decomposition of post-click behaviours indeed matters: 1) the specific behaviours with deterministic signals are preferred to be categories into the node; 2) the non-deterministic supervisory signals may confuse the model.

4.4. Performance Analysis on Different Post-Click Behaviours

To understand the performance of and its difference with ESMM, we further partition the test set into four groups according to the number of users’ purchasing behaviours within last three months, , [0,10], [11,20], [21,50], [50,+). We report AUC scores of CVR and CTCVR for both methods at different groups. The results are plotted in Figure 7. As can be seen, the CVR AUC(CTCVR AUC) of both methods decrease with number of purchasing behaviours increasing. However, we observe that the gain of over ESMM in each group increases, , 0.72%, 0.81%, 1.13%, 1.30%. Generally, users having more purchasing behaviours always have more active post-click behaviours such as Cart and Wish, etc. Our model deals with such post-click behaviours by adding an DAction node and is supervised with deterministic signals on it. Therefore, it has better representation ability on those samples than ESMM and achieves better performances on the users with high-frequency purchasing behaviours.

Figure 7. The AUC scores of CVR and CTCVR for and ESMM at different groups. Please refer to Section 4.4.
Method Name CVR CTCVR
OAction DAction OAction DAction
ESMM 0.8802 0.7419 0.8510 0.7074
0.8851 0.7463 0.8578 0.7241
Table 8. Comparison path AUC results for all competitors

To further validate the above analysis, we also report AUC scores of CVR and CTCVR for both methods on the respective paths, such as ClickDActionBuy or ImpressionClickOActionBuy, etc., by splitting test samples on them, respectively. The results are listed in Table 8. As can be seen, our model outperforms ESMM on both paths, and the improvement of CTCVR on the path ImpressionClickDActionBuy is much more significant than the path ImpressionClickOActionBuy.

5. Conclusion

In this paper, we propose an Elaborated Entire Space Supervised Multi-task Model() for online recommendation. By introducing the idea of Post-Click Behaviour Decomposition, it efficiently addresses the sample selection bias and data sparsity problems. Three specific modules named a shared embedding module, a decomposed prediction module, and a sequential composition module, are devised to construct the deep neural network and model over the entire space by employing multi-task learning. The prediction of conversion rate prediction benefits from the abundant training samples derived from the decomposed behaviours, as well as the related auxiliary tasks, including the post-view click-through rate , click-through DAction conversation rate, and click-through conversion rate. Extensive experiments on both offline and online environments demonstrate the superiority of over state-of-the-art models.


  • (1)
  • Chen et al. (2019) Qiwei Chen, Huan Zhao, Wei Li, Pipei Huang, and Wenwu Ou. 2019. Behavior Sequence Transformer for E-commerce Recommendation in Alibaba. arXiv preprint arXiv:1905.06874 (2019).
  • Dupret and Piwowarski (2008) Georges E Dupret and Benjamin Piwowarski. 2008. A user browsing model to predict search engine click data from past observations.. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 331–338.
  • Effendi and Ali (2017) Muhammad Junaid Effendi and Syed Abbas Ali. 2017. Click through rate prediction for contextual advertisment using linear regression. arXiv preprint arXiv:1701.08744 (2017).
  • Feng et al. (2019) Yufei Feng, Fuyu Lv, Weichen Shen, Menghan Wang, Fei Sun, Yu Zhu, and Keping Yang. 2019. Deep Session Interest Network for Click-Through Rate Prediction. arXiv preprint arXiv:1905.06482 (2019).
  • Friedman (2001) Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189–1232.
  • Gao et al. (2019) Chen Gao, Xiangnan He, Dahua Gan, Xiangning Chen, Fuli Feng, Yong Li, Tat-Seng Chua, and Depeng Jin. 2019. Neural Multi-Task Recommendation from Multi-Behavior Data. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 1554–1557.
  • Golbeck et al. (2006) Jennifer Golbeck, James Hendler, et al. 2006. Filmtrust: Movie recommendations using trust in web-based social networks. In Proceedings of the IEEE Consumer communications and networking conference, Vol. 96. Citeseer, 282–286.
  • Gopalan et al. (2014) Prem K Gopalan, Laurent Charlin, and David Blei. 2014. Content-based recommendations with Poisson factorization. In Advances in Neural Information Processing Systems. 3176–3184.
  • Graepel et al. (2010) Thore Graepel, Joaquin Quinonero Candela, Thomas Borchert, and Ralf Herbrich. 2010. Web-scale bayesian click-through rate prediction for sponsored search advertising in microsoft’s bing search engine. Omnipress.
  • Graves et al. (2013) Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013.

    Speech recognition with deep recurrent neural networks. In

    2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 6645–6649.
  • Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. arXiv preprint arXiv:1703.04247 (2017).
  • Hadash et al. (2018) Guy Hadash, Oren Sar Shalom, and Rita Osadchy. 2018. Rank and rate: multi-task learning for recommender systems. In Proceedings of the 12th ACM Conference on Recommender Systems. ACM, 451–454.
  • He et al. (2014) Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, et al. 2014. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising. ACM, 1–9.
  • Hinton et al. (2006) Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. 2006. A fast learning algorithm for deep belief nets. Neural computation 18, 7 (2006), 1527–1554.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.
  • Lee et al. (2012) Kuang-chih Lee, Burkay Orten, Ali Dasdan, and Wentong Li. 2012. Estimating conversion rate in display advertising from past erformance data. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 768–776.
  • Logesh and Subramaniyaswamy (2019) R Logesh and V Subramaniyaswamy. 2019. Exploring hybrid recommender systems for personalized travel applications. In Cognitive informatics and soft computing. Springer, 535–544.
  • Lv et al. (2019) Fuyu Lv, Taiwei Jin, Changlong Yu, Fei Sun, Quan Lin, Keping Yang, and Wilfred Ng. 2019. SDM: Sequential Deep Matching Model for Online Large-scale Recommender System. arXiv preprint arXiv:1909.00385 (2019).
  • Ma et al. (2018b) Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. 2018b. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1930–1939.
  • Ma et al. (2018a) Xiao Ma, Liqin Zhao, Guan Huang, Zhi Wang, Zelin Hu, Xiaoqiang Zhu, and Kun Gai. 2018a. Entire space multi-task model: An effective approach for estimating post-click conversion rate. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 1137–1140.
  • Naruchitparames et al. (2011) Jeff Naruchitparames, Mehmet Hadi Güneş, and Sushil J Louis. 2011.

    Friend recommendations in social networks using genetic algorithms and network topology. In

    2011 IEEE Congress of Evolutionary Computation (CEC)

    . IEEE, 2207–2214.
  • Ni et al. (2018) Yabo Ni, Dan Ou, Shichen Liu, Xiang Li, Wenwu Ou, Anxiang Zeng, and Luo Si. 2018. Perceive your users in depth: Learning universal user representations from multiple e-commerce tasks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 596–605.
  • Pan et al. (2008) Rong Pan, Yunhong Zhou, Bin Cao, Nathan N Liu, Rajan Lukose, Martin Scholz, and Qiang Yang. 2008. One-class collaborative filtering. In 2008 Eighth IEEE International Conference on Data Mining. IEEE, 502–511.
  • Qu et al. (2016) Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang. 2016. Product-based neural networks for user response prediction. In 2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, 1149–1154.
  • Rendle (2010) Steffen Rendle. 2010. Factorization machines. In 2010 IEEE International Conference on Data Mining. IEEE, 995–1000.
  • Shen et al. (2014) Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014. A latent semantic model with convolutional-pooling structure for information retrieval. In Proceedings of the 23rd ACM international conference on conference on information and knowledge management. ACM, 101–110.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting.

    The journal of machine learning research

    15, 1 (2014), 1929–1958.
  • Sun et al. (2019) Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer. arXiv preprint arXiv:1904.06690 (2019).
  • Thakkar et al. (2019) Priyank Thakkar, Krunal Varma, Vijay Ukani, Sapan Mankad, and Sudeep Tanwar. 2019. Combining User-Based and Item-Based Collaborative Filtering Using Machine Learning. In Information and Communication Technology for Intelligent Systems. Springer, 173–180.
  • Tsai et al. (2019) Chun-Hua Tsai, Peter Brusilovsky, and Behnam Rahdari. 2019. Exploring User-Controlled Hybrid Recommendation in a Conference Context.. In IUI Workshops.
  • Van den Oord et al. (2013) Aaron Van den Oord, Sander Dieleman, and Benjamin Schrauwen. 2013. Deep content-based music recommendation. In Advances in neural information processing systems. 2643–2651.
  • Wen et al. (2019) Hong Wen, Jing Zhang, Quan Lin, Keping Yang, and Pipei Huang. 2019. Multi-Level Deep Cascade Trees for Conversion Rate Prediction in Recommendation System. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    , Vol. 33. 338–345.
  • Wilson et al. (2019) Nathan R Wilson, Emily A Hueske, Thomas C Copeman, Evan Favermann Eisert, Jana B EGGERS, Raymond J PLANTE, and Michael D Houle. 2019. Systems and methods for providing recommendations based on collaborative and/or content-based nodal interrelationships. US Patent App. 14/687,742.
  • Xiao et al. (2017) Jun Xiao, Hao Ye, Xiangnan He, Hanwang Zhang, Fei Wu, and Tat-Seng Chua. 2017. Attentional factorization machines: Learning the weight of feature interactions via attention networks. arXiv preprint arXiv:1708.04617 (2017).
  • Zadrozny (2004) Bianca Zadrozny. 2004.

    Learning and evaluating classifiers under sample selection bias. In

    Proceedings of the twenty-first international conference on Machine learning. ACM, 114.
  • Zhang et al. (2019) Feng Zhang, Victor E Lee, Ruoming Jin, Saurabh Garg, Kim-Kwang Raymond Choo, Michele Maasberg, Lijun Dong, and Chi Cheng. 2019. Privacy-aware smart city: A case study in collaborative filtering recommender systems. J. Parallel and Distrib. Comput. 127 (2019), 145–159.
  • Zhang et al. (2016) Weinan Zhang, Tianxiong Zhou, Jun Wang, and Jian Xu. 2016. Bid-aware gradient descent for unbiased learning with censored data in display advertising. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 665–674.
  • Zhang et al. (2014) Yuyu Zhang, Hanjun Dai, Chang Xu, Jun Feng, Taifeng Wang, Jiang Bian, Bin Wang, and Tie-Yan Liu. 2014. Sequential click prediction for sponsored search with recurrent neural networks. In Twenty-Eighth AAAI Conference on Artificial Intelligence.
  • Zhou et al. (2019b) Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2019b. Deep interest evolution network for click-through rate prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 5941–5948.
  • Zhou et al. (2018) Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1059–1068.
  • Zhou et al. (2019a) Wang Zhou, Jianping Li, Yongluan Zhou, and Muhammad Hammad Memon. 2019a. Bayesian pairwise learning to rank via one-class collaborative filtering. Neurocomputing (2019).
  • Zhou and Feng ([n.d.]) ZH Zhou and J Feng. [n.d.]. Deep forest: Towards an alternative to deep neural networks. arXiv 2017. arXiv preprint arXiv:1702.08835 ([n. d.]).
  • Zhu et al. (2017) Han Zhu, Junqi Jin, Chang Tan, Fei Pan, Yifan Zeng, Han Li, and Kun Gai. 2017. Optimized cost per click in taobao display advertising. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2191–2200.
  • Zhu et al. (2018) Han Zhu, Xiang Li, Pengye Zhang, Guozheng Li, Jie He, Han Li, and Kun Gai. 2018. Learning Tree-based Deep Model for Recommender Systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1079–1088.