Click-through rate (CTR) prediction is a core problem in the multi-billion dollar online advertising industry. To improve the accuracy of CTR prediction, more and more data are involved, making CTR prediction a large scale learning problem, with massive samples and high dimension features.
Traditional solution is to apply a linear logistic regression (LR) model, trained in a parallel manner (Brendan et al. 2013, Andrew & Gao 2007). LR model withregularization can generate sparse solution, making it fast for online prediction. Unfortunately, CTR prediction problem is a highly nonlinear problem. In particular, user-click generation involves many complex factors, like ad quality, context information, user interests, and complex interactions of these factors. To help LR model catch the nonlinearity, feature engineering technique is explored, which is both time and humanity consuming.
Another direction, is to capture the nonlinearity with well-designed models. Facebook (He et al. 2014) uses a hybrid model which combines decision trees with logistic regression. Decision tree plays a nonlinear feature transformation role, whose output is fed to LR model. However, tree-based method is not suitable for very sparse and high dimensional data (Safavian S. R. & Landgrebe D. 1990). (Rendle S. 2010) introduces Factorization Machines(FM), which involves interactions among features using 2nd order functions (or using other given-number-order functions). However, FM can not fit all general nonlinear patterns in data (like other higher order patterns).
In this paper, we present a piece-wise linear model and its training algorithm for large scale data. We name it Large Scale Piecewise Linear Model (LS-PLM). LS-PLM follows the divide-and-conquer strategy, that is, first divides the feature space into several local regions, then fits a linear model in each region, resulting in the output with combinations of weighted linear predictions. Note that, these two steps are learned simultaneously in a supervised manner, aiming to minimize the prediction loss. LS-PLM is superior for web-scale data mining in three aspects:
Nonlinearity. With enough divided regions, LS-PLM can fit any complex nonlinear function.
Scalability. Similar to LR model, LS-PLM is scalable both to massive samples and high dimensional features. We design a distributed system which can train the model on hundreds of machines parallel. In our online product systems, dozens of LS-PLM models with tens of million parameters are trained and deployed everyday.
Sparsity. As pointed in (Brendan et al. 2013), model sparsity is a practical issue for online serving in industrial setting. We show LS-PLM with and regularizer can achieve good sparsity.
The learning of LS-PLM with sparsity regularizer can be transformed to be a non-convex and non-differential optimization problem, which is difficult to be solved. We propose an efficient optimization method for such problems, based on directional derivatives and quasi-Newton method. Due to the ability of capturing nonlinear patterns and scalability to massive data, LS-PLMs have become main CTR prediction models in the online display advertising system in alibaba, serving hundreds of millions users since 2012 early. It is also applied in recommendation systems, search engines and other product systems.
The paper is structured as follows. In Section 2, we present LS-PLM model in detail, including formulation, regularization and optimization issues. In Section 3 we introduce our parallel implementation structure. in Section 4, we evaluate the model carefully and demonstrate the advantage of LS-PLM compared with LR. Finally in Section 5, we close with conclusions.
We focus on the large scale CTR prediction application. It is a binary classification problems, with dataset . and is usually high dimensional and sparse.
To model the nonlinearity of massive scale data, we employ a divide-and-conquer strategy, similar with (Jordan & Jacobs 1994). We divide the whole feature space into some local regions. For each region we employ an individual generalized linear-classification model. In this way, we tackle the nonlinearity with a piece-wise linear model. We give our model as follows:
Here denote the model parameters. is the parameters for dividing function , and for fitting function . Given instance , our predicating model consists of two parts: the first part divides feature space into (hyper-parameter) different regions, the second part gives prediction in each region. Function
ensures our model to satisfy the definition of probability function.
Special Case. Take softmax ( Kivinen & Warmuth 1998) as dividing function and sigmoid (Hilbe 2009) as fitting function and , we get a specific formulation:
In this case, our mixture model can be seen as a FOE model (Jordan & Jacobs 1994, Wang & Puterman 1998) as follows:
Eq.(2) is the most common used formulation in our real applications. In the reminder of the paper, without special declaration, we take Eq.(2) as our prediction model. Figure 1 illustrates the model compared with LR in a demo dataset, which shows clearly LS-PLM can capture the nonlinear pattern of data.
The objective function of LS-PLM model is formalized as Eq.(4):
Here defined in Eq.(5
) is the neg-likelihood loss function andand are two regularization terms providing different properties. First, regularization (
) is employed for feature selection. As in our model, each dimension of feature is associated withparameters. regularization are expected to push all the parameters of one dimension of feature to be zero, that is, to suppress those less important features. Second, regularization () is employed for sparsity. Except with the feature selection property, regularization can also force those parameters of left features to be zero as much as possible, which can help improve the interpretation ability as well as generalization performance of the model.
However, both norm and norm are non-smooth functions. This causes the objective function of Eq.(4) to be non-convex and non-smooth, making it difficult to employ those traditional gradient-descent optimization methods (Andrew & Gao 2007, Zhang 2004, Bertsekas 2003) or EM method (Wang & Puterman 1998).
Note that, while (Wang & Puterman 1998) gives the same mixture model formulation as Eq.(3), our model is more general for employing different kinds of prediction functions. Besides, we propose a different objective function for large scale industry data, taking the feature sparsity into consideration explicitly. This is crucial for real-world applications, as prediction speed and memory usage are two key indicators for online model serving. Furthermore, we give a more efficient optimization method to solve the large-scale non-convex problem, which is described in the following section.
Before we present our optimization method, we establish some notations and definitions that will be used in the reminder of the paper. Let denote the right partial derivative of at with respect to :
where is the
standard basis vector. The directional derivative ofas in direction is denoted as , which is defined as:
A vector is regarded as a descent direction if . is the sign function takes value . The projection function
means projecting onto the orthant defined by .
2.2.1 Choose descent direction
As discussed above, our objective function for large scale CTR prediction problem is both non-convex and non-smooth. Here we propose a general and efficient optimization method to solve such kind of non-convex problems. Since the negative-gradients of our objective function do not exists for all , we take the direction which minimizes the directional derivative of with as a replace. The directional derivative exists for any and direction , whic is declared as Lemma 1.
When an objective function is composed by a smooth loss function with and norm, for example the objective function given in Eq. (4), the directional derivative exists for any and direction .
We leave the proof in Appendix A. Since the directional derivative always exists, we choose the direction as the descent direction which minimizes the directional derivative when the negative gradient of does not exist. The following proposition 2 explicitly gives the direction.
Given a smooth loss function and an objective function , the bounded direction which minimizes the directional derivative is denoted as follows:
where and .
More details about the proof can be found in Appendix B. According to the proof, we can see that the negative pseudo-gradient defined in Gao’s work (Andrew & Gao 2007) is a special case of our descent direction. Our proposed method is more general to find the descent direction for those non-smooth and non-convex objective functions.
Based on the direction in Eq.(9), we update the model parameters along a descent direction calculated by limited-memory quasi-newton method (LBFGS) (Wang & Puterman 1998), which approximates the inverse Hessian matrix of Eq.(4) on the given orthant. Motivated by the OWLQN method (Andrew & Gao 2007), we also restrict the signs of model parameters not changing in each iteration. Given the chosen direction and the old , we constrain the orthant of current iteration as follows:
When , the new would not change sign in current iteration. When , we choose the orthant decided by the selected direction for the new .
2.2.2 Update direction constraint and line search
Given the descent direction , we approximate the inverse-Hessian matrix using LBFGS method with a sequence of . Then the final update direction is . Here we give two tricks to adjust the update direction. First, we constrain the update direction in the orthant with respect to . Second, as our objective function is non-convex, we cannot guarantee that is positive-definite. We use as a condition to ensure is a positive-definite matrix. If , we switch to as the update direction. The final update direction is defined as follows:
Given the update direction, we use backtracking line search to find the proper step size . Same as OWLQN, we project the new onto the given orthant decided by the Eq. (10).
A pseudo-code description of optimization is given in Algorithm 1. In fact, only a few steps of the standard LBFGS algorithm need to change. These modifications are:
The direction which minimizes the direction derivative of the non-convex objective is used in replace of negative gradient.
The update direction is constrained to the given orthant defined by the chosen direction . Switch to when the is not positive-definite.
During the line search, each search point is projected onto the orthant of the previous point.
In this section, we first provide our parallel implementation of LS-PLM model for large-scale data, then introduce an important trick which helps to accelerate the training procedue greatly.
3.1 Parallel implementation
To apply Algorithm 1 in large-scale settings, we implement it with a distributed learning framework, as illustrated in Figure 2. It’s a variant of parameter server. In our implementation, each computation node runs with both a server node and a worker node, aiming to
Maximize the utility of CPU’s computation power. In traditional parameter server setting, server nodes work as a distributed KV storer with interfaces of push and pull operations, which are low computation costly. Running with worker nodes can make full use of the computation power.
Maximize the utility of memory. Machines today usually have big memory, for example 128GB. Running on the same computation node, server node and worker node can share and utilize the big memory better.
In brief, there are two roles in the framework. The first role is the work node. Each node stores a part of training data and a local model, which only saves the model parameters used for the local training data. The second role is the server node. Each node stores a part of global model, which is mutually-exclusive. In each iteration, all of the worker nodes first calculate the loss and the descent direction with local model and local data in parallel(data parallelism). Then server nodes aggregate the loss and the direction as well as the corresponding entries of needed to calculate the revised gradient (model parallelism). After finishing calculating the steepest descent direction in Step 1, workers synchronize the corresponding entries of , and then, perform Step 2–6 locally.
3.2 Common Feature Trick
In addition to the general-purpose parallel implementation, we also optimized the implementation in online advertising context. Training samples in CTR prediction tasks usually have similar common feature pattern. Take display advertising as an example, as illustrated in Figure 3, during each page view, a user will see several different ads at the same time. For example, user U1 in Figure 3 sees three ads in one visit session, thus generates three samples. In this situation, features of user U1 can be shared across these three samples. These features include user profiles (sex, age, etc.) and user behavior histories during visits of Alibaba¡¯s e-commerce websites, for example, his/her shopping item IDs, preferred brands or favorite shop IDs.
Remember the model defined in Eq. 2, most computation cost focus on and . By employing the common feature trick, we can split the calculation into common and non-common parts and rewrite them as follows:
Hence, for the common feature part, we need just calculate once and then index the result, which will be utilized by the following samples.
Specifically, we optimize the parallel implementation with common features trick in the following three aspects:
Group training samples with common features and make sure these samples are stored in the same worker.
Save memory by storing common features shared by multiple samples only once.
Speed up iteration by updating loss and gradient w.r.t. common features only once.
Due to the common feature pattern of our production data, employing the common feature trick improves the performance of training procedure greatly, which will be shown in the following section 4.3.
In this session, we evaluate the performance of LS-PLM. Our datasets are generated from Alibaba’s mobile display advertising product system. As shown in Table 1, we collect seven datasets in continuous sequential periods, aiming to evaluate the consist performance of the proposed model, which is important for online product serving. In each dataset, training/validation/testing samples are disjointly collected from different days, with proportion about 7:1:1. AUC (Fawcett 2006) metric is used to evaluate the model performance.
4.1 Effectiveness of division number
LS-PLM is a piece-wise linear model, with division number m controlling the model capacity. We evaluate the division effectiveness on model’s performance. It is executed on dataset 1 and results are shown in Figure 4.
Generally speaking, larger means more parameters and thus leads to larger model capacity. But the training cost will also increase, both time and memory. Hence, in real applications we have to balance the model performance with the training cost.
Figure 4 shows the training and testing AUC with different division number m. We try , the testing AUC for is markedly better than , and improvement for is relatively gentle. Thus in all the following experiments , the parameter is set as 12 for LS-PLM model.
4.2 Effectiveness of regularization
As stated in Session 2, in order to keep our model simpler and more generalized, we prefer to constrain the model parameters sparse by both and norms. Here we evaluate the strength of both the regularization terms.
|#features||#non-zero parameters||testing auc|
Table 2 gives the results. As expected, both and norms can push our model to be sparse. Model trained with norm has only 9.4% non-zero parameters left and 18.7% features are kept back. While in norm case, there are only 1.9% non-zero parameters left. Combining them together, we get the sparsest result. Meanwhile, models trained with different norm get different AUC performance. Again adding the two norms together the model reaches the best AUC performance.
In this experiment, the hyper-parameter is set to be 12. Parameters and are selected by grid search. are tried for both norms in the all cases. The model with and preforms best.
4.3 Effectiveness of common feature trick
We prove the effectiveness of common features trick. Specifically, we set up the experiments with workers, each of which uses CPU cores, with up to 110 GB memory totally. As shown in Table 3, compressing instances with common feature trick does not affect the actual dimensions of feature space. However, in practice we can significantly reduce memory usage (reduce to about ) and speed up the calculation (around 12 times faster) compared to the training without common feature trick.
|Cost||Without CF.||With CF.||Cost Saving|
|Memory cost/node||89.2 GB||31 GB||65.2%|
4.4 Comparison with LR
We now compare LS-PLM with LR, the widely used CTR prediction model in product setting. The two models are trained using our distributed implementation architecture, running with hundreds of machines for speed-up. The choice of the and parameters for LS-PLM and the parameter for LR are based on grid search. and are tried. The best parameters are and for LS-PLM, and for LR.
As shown in Figure 5, LS-PLM outperforms LR clearly. The average improvement of AUC for LR is 1.44%, which has significant impact to the overall online ad system performance. Besides, the improvement is stable. This ensures LS-PLM can be safely deployed for daily online production system.
In this paper, a piece-wise linear model, LS-PLM for CTR prediction problem is proposed. It can capture the nonlinear pattern from sparse data and save us from heavy feature engineering jobs, which is crucial for real industry applications. Additionally, powered by our distributed and optimized implementation, our algorithm can handle problems of billion samples with ten million parameters, which is the typical industrial data volume. Regularization terms of and are utilized to keep the model sparse. Since 2012, LS-PLM has become the main CTR prediction model in alibaba’s online display advertising system, serving hundreds of millions users every day.
We would like to thank Xingya Dai and Yanghui Yan for their help for this work.
Appendix A Proof of Lemma 2.1
the definition of is given as follows:
As the gradient of loss function exists for any , the directional derivative for the first part is
For the second part, we know if , the norm’s partial derivative exists. So the directional derivative is
However, when , it means . Then its directional derivative can be denoted as follows:
Same as the second part, the direction derivative for the third part is:
Appendix B Proof of Proposition 2.2
Finding the expected direction turns to an optimization problem, which is formulated as follows:
Here the direction is bounded by a constant scalar . To solve this problem, we employ lagrange function to combine the objective function and inequality function together:
Here is the lagrange multiplier. Setting the partial derivative of with respect to to zero has three cases.
a. when , it implies that
b. when and , it is easy to have
c. We give more details when and . For we have
Here we use for simplicity. Then we get
which implies that . When , it implies . Inversely, we have when So we define and . So
As , we have . Thus
The lagrange multiplier is a scalar, and it has equivalent influence for all . We can see that the optimal direction which is bounded by is the same direction as we defined in Eq. (9) without considering the constant scalar . Here we finish our proof.
-  Andrew G. and Gao J. (2007) Scalable Training of -Regularized Log-Linear Models. Proceedings of the 24-th International Conference on Machine Learning.
-  Bertsekas, D. (2003) Nonlinear Programming. Springer US, 51–88.
-  Brendan H., Holt G., Sculley D., Young M., Ebner D., Grady J., Nie L., Phillips. T, Davydov E., Golovin D., Chikkerur S., Liu D., Wattenberg M., Hrafnkelsson A., Boulos T., Kubica J. (2013) Ad Click Prediction: a View from the Trenches. Proceedings of the 19-th KDD.
-  Fawcett T. (2006) An introduction to ROC analysis. Pattern Recognition Letters, 27, 861–874.
Friedman J. (1999) Greedy Function Approximation: A Gradient Boosting Machine.Technical Report, Dept. of Statistics, Stanford University.
-  Hilbe M. (2009) Logistic regression models. CRC Press.
-  He X., Pan J., Jin O., Xu T., Liu B., Xu T, Shi Y., Atallah A, Herbrich R., Bowers S., Candela J. (2014) Practical Lessons from Predicting Clicks on Ads at Facebook. Proceedings of the 20-th KDD.
-  Jordan I., Jacobs A (1994) Hierarchical mixtures of experts and the EM algorithm. Neural computation, 6(2): 181-214.
-  Kivinen J., Warmuth M K. (1998) Relative Loss Bounds for Multidimensional Regression Problems. Machine Learning, 45(3):301-329.
-  Rendle S. (2010) Factorization Machines. Proceedings of the 10th IEEE International Conference on Data Mining.
Roth S, Black M J. (2009) Fields of experts.
International Journal of Computer Vision, 82(2): 205–229.
Safavian S. R., Landgrebe D. (1990) A survey of decision tree classifier methodology[J].
-  Wang P.-M and Puterman M. (1998) Mixed Logistic Regression Models. Journal of Agricultural, Biological, and Environmental Statistics, 3(2), 175–200.
Zhang T. (2004) Solving large scale linear prediction problems using stochastic gradient descent algorithms.Proceedings of the twenty-first international conference on Machine learning. ACM, 116.
-  Gai K. http://club.alibabatech.org/resource_detail.htm?topicId=106