jensen-ol
Online Learning built on top of Jensen
view repo
We present a unified framework for Batch Online Learning (OL) for Click Prediction in Search Advertisement. Machine Learning models once deployed, show non-trivial accuracy and calibration degradation over time due to model staleness. It is therefore necessary to regularly update models, and do so automatically. This paper presents two paradigms of Batch Online Learning, one which incrementally updates the model parameters via an early stopping mechanism, and another which does so through a proximal regularization. We argue how both these schemes naturally trade-off between old and new data. We then theoretically and empirically show that these two seemingly different schemes are closely related. Through extensive experiments, we demonstrate the utility of of our OL framework; how the two OL schemes relate to each other and how they trade-off between the new and historical data. We then compare batch OL to full model retrains, and show how online learning is more robust to data issues. We also demonstrate the long term impact of Online Learning, the role of the initial Models in OL, the impact of delays in the update, and finally conclude with some implementation details and challenges in deploying a real world online learning system in production. While this paper mostly focuses on application of click prediction for search advertisement, we hope that the lessons learned here can be carried over to other problem domains.
READ FULL TEXT VIEW PDF
Online learning to rank is a core problem in information retrieval and
m...
read it
Inverse optimization is a powerful paradigm for learning preferences and...
read it
Most traditional online learning algorithms are based on variants of mir...
read it
This work aims at unveiling the potential of Transfer Learning (TL) for
...
read it
This paper introduces a novel approach for dengue fever classification b...
read it
Energy-efficient navigation constitutes an important challenge in electr...
read it
Online learning is a familiar problem setting within Machine-Learning in...
read it
Online Learning built on top of Jensen
Click prediction is an important and central component in any online advertisement system. Predicting the probability of clicks and click through rate is central in sponsored search advertising and display advertising, and several downstream systems including our auction mechanism rely on being able to predict the probability of click accurately and reliably. Most click prediction systems are modeled via the standard machine learning classification framework. We design features relevant to the user, ads and query, with the goal of predicting if a given user, will click a given ad for a given search query. A training data period is selected, and the click prediction model is trained and validated. Machine learning scientists then analyze the model performance offline, and if things look good, deploy the models in production.
The main caveat of this, is that the models get stale quickly and the performance degrades over time. The distribution of users, queries and ads change over time and correspondingly models are evaluated on a different data distribution compared to what they have learnt from. For this reason, we need to retrain model periodically. Figure 1 demonstrates that we can achieve a significant gain in model metrics after retraining the model in just a month.
This paper addresses the issue of model staleness via online learning. We investigate a unified model adaptation framework, where the models continuously and gradually adapt over time, to learn the distribution changes. A natural alternative is to just run an automated experimental pipeline which continuously retrains the model from scratch. We not only show that gradual learning outperforms this, but we also show that via continuous adaptation, we are able to control how much the model adapts and ensure the model does not overfit to a single distribution. For example, we know that the distribution of queries, ads and users are quite different on Labor Day, compared to other week days. Training a model from scratch on data containing Labor Day can be problematic since it will learn a potentially different input distribution, that is not seen on other days. This problem gets intensified when there is data or system corruption issues. Our comprehensive evaluation shows that our batch online learning framework via gradual and continuous model adaption, not only improves performance, but is also more reliable and safe compared to automatic model retraining.
The problem of online learning has been heavily studied in the theoretical machine learning community. Most of this work has revolved regret minimization, where the regret of an online learning algorithm is defined as the gap in performance of the online algorithm compared to the solution obtained from an offline algorithm, which has access to all the data in hindsight [Zinkevich2003, Shalev-Shwartz and others2012, Shalev-Shwartz and Singer2007]
. Since several machine learning problems naturally involve solving convex optimization problems, online machine learning can naturally be posed as online convex optimization. This is the case with all linear models like logistic regression, SVMs etc. Zinkevich
[Zinkevich2003] was one of the first papers to study online gradient descent for online learning and show that OGD enjoys low regret. Following this seminal work, several papers have extended upon this paradigm (see [Shalev-Shwartz and others2012, Shalev-Shwartz and Singer2007] for a survey).While most literature around online learning has focused on proving theoretical bounds, a few of these have been proved successful in real world problems. [McMahan et al.2013, McMahan2011] proposed a Follow the Regularized Leader Scheme (FTRL) for online learning on a logistic regression model. Their problem setting consists of a sparse feature set (with more than a million features) with a L1 logistic regression model. The authors argue how their framework naturally handles both L1 and L2 regularization, and in the case of L2 regularization, boils down to online Gradient Descent. The authors provide extensive empirical validation of their framework and some hints into the deployment of such a large scale system for serving ads at Google. Following this [He et al.2014]
from Facebook provide a framework of Online Learning with a combined Decision Tree and Logistic Regression Model. Similar to
[McMahan et al.2013], the authors go into a lot of details into the deployment of a real world online learning system in production. [Ciaramita, Murdock, and Plachouras2008] propose an online learning click prediction system on multi-layer nueral networks. Another large scale Online Learning system for Click Prediction was proposed by [Graepel et al.2010], where they propose Online Learning system on Bayesian Probit Regression Models, and provide compelling details into deploying such a system in practice. Similarly [Liu et al.2017] describes the click prediction system at Tencent where they use a Bayesian Online Learning scheme similar to [Graepel et al.2010]. [Cheng and Cantú-Paz2010] investigate the role of personalization in click prediction systems. [McMahan and Streeter2014] investigate a distributed online learning framework for large scale click prediction problems.Beyond Click Prediction in Search Advertisement, Online Learning schemes have been used in other scenarios as well. [Chapelle and Li2011, Chapelle, Manavoglu, and Rosales2015]
propose an Thompson Sampling based contextual bandit scheme for Display advertisement. They propose a Proximal Update Algorithm similar to the one discussed in this paper. Similarly,
[Kirkpatrick et al.] look into the problem of learning from a new domain, while simultaneously not forgetting about the previous domain. Similarly [Ma et al.2009] investigate online learning for identifying suspicious URLS.The following are our main contributions.
This paper studies two different views of Online Learning: One which performs iterative training (like Online Gradient Descent, FTRL etc.) with early stopping, and another, which minimizes at every round, a proximal regularized objective function (which ensures the current solution does not move too much from the previous solution). Both approaches provide tradeoff between historical and new data, via the learning rate and number of iterations in the early stopping, and the regularization parameter in the proximal scheme.
We empirically and theoretically show that both these paradigms are closely related to each other. In particular, we show that with a right choice of these parameters (learning rate, number of iterations and proximal regularization), the two OL paradigms achieve very similar solutions.
We next prove the benefit of incremental learning schemes, by showing how this can substantially improve upon simple retraining of models. We argue how online learning not only ensures automatic model updates, but also can improve upon model metrics because of the fact it retains a larger history. Moreover, we also show how it is much more robust to data corruption and other distributional changes compared to simple model retrainings.
We then look into several important challenges of production systems and study the effect of online learning with data delays, how different initializations affect the performance of OL, and we conclude by discussing engineering issues in deploying such a model in production systems serving Search Ads to Hundreds of Millions of Users.
In this section, we go over our modeling framework, features, evaluation metrics and our system overview. Given a user, ad and query, our task is to accurately predict the probability that the user will click on this ad. It is not just important to rank the ads correctly, but the resulting probability must be calibrated (in that the predicted probability must match the true click through rate). For this reason, we shall compare both the Area under the curve (AUC), which measures the ranking of the ads and the Relative Information Gain (RIG) metrics which measures the calibration. The RIG of a Model
can be defined as(1) |
where is the LogLoss of the empirical CTR of the data. Since is a constant, RIG is proportional to the LogLoss of the Model.
Next, we go over the features for our problem. Our features include Ad, Query and User features. Ad features include Ad Title, Ad id, Decoration information etc. Query features include query category, query text etc. User features include IP address, Browser, Location, age/gender information etc. We encode our features as Counting features [Ling et al.2017]
, representing the click through rate for that feature. We resort to two of the most popular choices of supervised learning techniques, namely gradient boosted decision trees and Neural Networks. Both these techniques outperform other non-linear and generalized linear models on our data. To incrementally train models over time however, it is more natural to do so over generalized linear models. We achieve best of both worlds, by training a generalized linear model over features extracted from non-linear models. For example, we can extract tree and leaf value features from a GBDT (shown in the Figure 2), and train a Logistic Regression model on these features. This can also be done if we use a Neural Network as the feature extractor, and if we extract features, say from the last layer. In this paper, we shall focus on Online Learning over a Logistic Regression Model using features extracted from a fixed GBDT model.
In Click Prediction systems, we get near instant feedback from users based on whether they click on an ad or not. Assume we have a Base LR Model , trained on a given dataset
(say, for example, one week of data). Once a user searches for a query on a search engine, the system sees a feature vector
. Using the Model currently in production, the system then predicts the probability of click . The auction then ranks ads based on the pClick, bid and other factors, finally creating a set of ads which are shown to the user. We then receive the feedback whether the user clicks or not. This data is then collected in batches. Denote as the different batches of data (each batch, is for example, a day of data or four hours). The predictions made in batch are made using the Model from the previous batch – i.e. .The most important piece of this story is how do we update the model. A critical challenge here is to be able to learn from the incremental data coming in, and yet, not forget what was learned in the past. In the below sections, we describe two schemes of incremental updates of the models.
The first scheme, is what we call early stopping scheme, abbreviated as ES. We initialize the model with the base model . At round , we initialize the incremental learning algorithm Alg, with the model from the previous round , and limit the number of passes on the data to be . We denote this by,
(2) |
LBFGS/TRON: One example of Alg is LBFGS [Liu and Nocedal1989]. Limited-memory BFGS (LBFGS) algorithm belongs to a family of quasi-Newton methods which approximates the BFGS algorithm with a limited memory. The BFGS algorithm itself is an iterative technique, where the Hessian matrix is updated at every iteration using the past gradient evaluations. BFGS requires storing the dense approximation of the Inverse Hessian Matrix, while L-BFGS just stores the past updates of the positions and gradients and uses them for the updates. In practice, is chosen around – . Another example of a similar algorithm is a Trust Region Newton [Lin, Weng, and Keerthi2007].
OGD/SGD/GD: Another choice of Alg is Online Gradient Descent [Zinkevich2003]
or Stochastic Gradient Descent
[Bottou, Curtis, and Nocedal2018]. This is akin to a Gradient descent scheme, except that the (stochastic) gradient is computed based on a single example or a minibatch, rather than using the entire batch. There are two flavors of this, either using a fixed learning rate, a decaying learning rate or an adaptive learning rate (as in AdaGrad [Duchi, Hazan, and Singer2011]). In this paper, we focus on the simplest version of fixed learning rates for SGD or GD. The main hyper-parameters under consideration for Early Stopping algorithms is and the learning rate, which determines the tradeoff between the new data and history. Having too large a , implies that we overfit to the distribution in the current batch, thereby generalizing poorly to the next batch. Having a small implies that we might learn the data changes too slowly. We shall demonstrate the interplay between these quantities in detail in our experiments. Similarly, a large learning rate can cause the incremental learning to diverge and a small learning rate could mean slow learning. One can also have a per coordinate learning rate [McMahan et al.2013, He et al.2014]. One way to define a per coordinate rate, is to set , where is the total number of times feature is seen till round [He et al.2014].FTRL: Follow The Regularized Leader (FTRL) [McMahan et al.2013] can be seen as another instance of this paradigm. In the case of L2 regularization, FTRL updates are equivalent to the one from OGD.
Given a Batch , the proximal based incremental learning scheme, abbreviated as Prox, minimizes the following objective function:
(3) |
This formulation ensures, we minimize the objective function on the current batch, while still not moving too much away from the previous solution. Here, is a tradeoff between the new data and the history. If is too small, we overfit completely to the current data (similar to a large in the early stopping scheme). Similarly if is too large, we will not move much from the initial model.
In Equation 3, each coordinate has the same weight . Often, however, we want some of the coordinates to move less compared to other coordinates. For example, coordinates that have covered many training examples in the recent history can have a higher penalty for change compared to parameters covering relatively fewer number of examples. The Proximal update equation is the same as Equation 3, except that we have a per coordinate regularization .
(4) |
where is the th coordinate of the weight vector. This looks similar to the per coordinate learning rate in an Online Gradient Descent scheme above. One way of setting the per coordinate regularization parameter is the diagonal of the Fisher Information of the data [Kirkpatrick et al.]. This scheme is called Elastic Weight Consolidation in [Kirkpatrick et al.]. This comes naturally as an approximation of the Posterior of the weights, which contains information of the parameters important to the historical data. In the case of Logistic Regression, this is exactly the Double derivative of the Log Likelihood Function. Incidentally, this scheme was also proposed as an online learning scheme with Thompson Sampling for Click Prediction Problems [Chapelle and Li2011].
The individual optimization problems of the L2 Proximal Update (Equation 3) and the Per coordinate one (EWC) are convex optimization problems and can be optimized via methods like LBFGS [Liu and Nocedal1989] or Trust Region Newton [Lin, Weng, and Keerthi2007].
We next study the relationship between the early stopping algorithms and the proximal update scheme. For simplicity, we shall analyze the case when the ES algorithm is Gradient Descent with a fixed Learning rate. The learning rate and the number of iterations for ES, and the regularization parameter for Prox determine the trade-off and performance. We show here that there is a close relationship between the two.
Assume we initialize ES and Prox with , i.e. Prox minimizes , where . Denote as the weights obtained via an ES scheme. For this analysis, we assume we use gradient descent. We then show the following result.
Denote as the optimal solution of the Prox objective function with regularization parameter . Denote by the solution obtained by running ES on with a learning rate for iterations. If and satisfy , then the solution satisfies,
(5) |
where .
The above theorem shows that as long as the gradients of the loss function
do not change much from iteration to in the early stopping scheme, is close to the optimal solution of Prox provided the parameters satisfy . The proof of this result is in the Appendix.Theorem 1 can be extended to show a relationship between a per co-ordinate learning rate and per co-ordinate regularization. In particular, a per co-ordinate learning rate is closely related to a per co-ordinate regularization if .
Denote as the optimal solution of the Prox objective function with per-coordinate regularization . Denote by the solution obtained by running ES on with a per-coordinate learning rate for iterations. If and satisfy , then the solution satisfies,
(6) |
where .
We make several important remarks about Theorem 1. Firstly, as noted earlier, (obtained via rounds of the ES scheme) is close to the optimal solution of the Proximal scheme if is small. The quantity being small implies that the gradients of the subsequent iterations of the Early stopping scheme are close to each other. Secondly, notice that the bound also depends on . We expect the bound to be looser if is large (everything else remaining the same). In the next section, we investigate this relationship empirically. We observe that for several parameter values of and satisfying , ES and Prox methods obtain similar solutions. We show that in those cases, the gradient differences are small. We also show cases where the solutions of Prox and those of ES are not close to each other, and argue how in those cases the bound from Theorem 1 is weaker.
This section provides details of our extensive evaluation of our online learning framework, with a goal of providing a better understanding to the model performance in various scenarios, and to understand the theoretical results discussed above. The experiments shared below have been run for over a year in our production systems. We have evaluated the models on various feature sets, various times of the year (holiday and regular time periods), and various parameter choices. The model performance is consistent over all these experiments. In the interest of space, we provide only a summary of the results below. The results do not drastically change with different batch sizes (daily, four hourly etc.) Smaller batch sizes only ensure quicker model updates. All our results are on batch sizes of one day. Also, all the results below were conducted over a span of 15 days to three months with daily updates. Each day of data consists of around 2 Million instances. We show the results as time series graphs to demonstrate the gains of online learning over time. Our C++ code (built on top of [Iyer, Halloran, and Wei2018]) and dataset used for our experiments is available at https://github.com/rishabhk108/jensen-ol.
Figure 3 demonstrate the gains of online learning by comparing the model metrics to the stale model. We show the gains in both AUC (ranking) and RIG. We see that the relative RIG gains of about 0.5% in the middle and close to 1.5% towards the end. We also observe AUC gains of around 0.1%. These experiments are run over a span of three months. Both these are significant gains in our system, and are better than the gains we would expect from retraining the models (we shall compare both in later sections).
This section investigates the critical trade-off parameters for Early stopping, namely the choice of the ES algorithm, the Learning rate ( and the number of iterations (). Figure 4 show the results.
We first compare the different ES algorithms. In this setup, we compare LBFGS, SGD and Gradient Descent. SGD and GD are gradient descent style algorithms and in both cases, we use a fixed learning rate, wherever applicable, and compare their performance for varying numbers of iterations. LBFGS adapts the learning rate as the algorithm proceeds. We see that incremental training with GD and SGD perform similar to one another for the same learning rate and number of iterations – we run both algorithms with and obtain results for and . LBFGS, however, performs worse than both these (with we see that LBFGS already overfits to the new data). The added benefit of SGD and GD comes from the flexibility of a fixed learning rate, whereas LBFGS tries to minimize the objective function completely as quickly as possible. In our case, we do not want to overfit to the new data, and it is desirable to have the right knobs to tradeoff between the historical and new data. This consideration does not favor LBFGS as the algorithm for use in an ES scheme. This comparison is shown in the top two graphs in Figure 4.
We next compare the different number of iterations. We set the ES algorithm to be SGD and fix the learning rate as . With a small number of iterations (), the model does not learn enough of the new data while with large number of iterations (), the model overfits to the new data and we see a loss in performance. The optimal performance is achieved for . This parameter, along with the learning rate, needs to be tuned for each model, depending on the amount of historical data and incremental data. We see the RIG and AUC gains in the third and the fourth graphs (from top) in Figure 4.
Finally, we compare the learning rate. A large learning rate (), causes the weights to diverge while with a small learning rate (), the learning is slow. We achieve the best results with . The RIG and AUC gains are shown in the last two plots (from top) in Figure 4.
We next compare the effect of different regularization parameters for Prox. Using a small regularization parameter () tends to make the model overfit to the new data, while when using a large regularization and above, the model hardly learns. The optimal performance comes from and in this case. Again, this parameter will need to be tuned depending on the amount of historical and new data. The results are shown in Figure 5.
In this section, we look into the early stopping and proximal updates, and their connection. The goal of this exercise is to compare several ES schemes (for different ) and different Prox schemes by varying . The results of this are in Figure 6. Firstly, we compare the following sets of ES and Prox schemes, 1) and , 2) and , 3) and , and 4) and . Notice that all these sets satisfy . We see that the Prox and ES gains are very similar to each other (the blue, orange, red and dark gray lines in Figure 6). The results hold for both the AUC and RIG gains. We next consider two additional settings: and and . We see here that there is a gap between the Prox and ES schemes with the Prox method consistently outperforming the ES schemes in both these cases (the three green lines in Figure 6).
To understand this better, we plot the difference in loss function , and the upper bound from Theorem 1 in Figure 7. We see that the settings, and
have small values of the loss function difference, as expected. We see that the upper bound estimate is also small in this case (around 1e-02). However, with the settings
and and , we see a larger difference between and (i.e. the Prox and the ES solutions). Note that all four of these satisfy . We also see that the upper bound estimate is also larger. With a larger learning rate , the gradient difference between subsequent iterations will be larger, so will . In the second case, the learning rate is smaller , but we run it for more iterations. Correspondingly, the bound (which depends on both and ) is larger.ES and Prox yield very similar update rules if we choose the right set of hyper-parameters. Fortunately, in practice, we observe that the the optimal performance comes from smaller values of and , and in those settings the Prox and the ES schemes coincide. Unlike the Prox scheme, the ES stopping does not require solving a convex optimization scheme to completion. We just need to run a few iterations of SGD with the right learning rate. On the flip side, ES scheme has more hyper parameters () which make it slightly harder to tune compared to Prox, where we just need to tune the regularization. In the rest of the paper, we choose the setting with and and use the ES update scheme.
We have illustrated how batch OL can help resolve the issue of model staleness that a fixed model suffers from. Another obvious alternative to resolve this problem is to automatically retrain and update the base model periodically. To compare these 2 approaches, we setup a baseline where the model was being completed retrained daily on the previous week’s data and evaluated on just the next day. This was compared against a batch OL model being incrementally trained daily. Figure 9 demonstrates the results.
Firstly we notice that both models follow the same trend over time. Next we see that on several days, the online learning outperforms the moving window baseline. This can be attributed to the fact that online learning is done incrementally in each batch, so the model has seen more data than just the previous period. This gives the OL model the ability to generalize better to trends it has learnt in earlier batches, while also having the ability to learn and adapt to the recent trends. The model retraining however learns solely from the last few days and may overfit to this distribution. We also see that the moving window baseline is more sensitive to overfitting to data corruption and distributional changes compared to online learning. The reader will notice large drops in AUC and RIG owing to distributional changes (see the two large srops with the Moving window approach). This effect can be even more pronounced with data corruption issues (like system bugs or livesites). Since online learning adapts slowly while remembering the past historical data, it does not overfit as much. Moreover, it is easier to also implement validation checks and safeguards to ensure it does not learn from corrupt data.
We next study the effect of online learning on different initializations. We consider different starting points of online learning. In this experiment, we train four different base models with one week of initial data each. We start four different OL schemes, each of which begin one week apart. We see that after about a month of online learning, all the four model converge to roughly the same performance with respect to a fixed base model. Figure 8 shows the results of this.
In this section, we investigate the effect of delayed predictions made by online learning models. In other words, we fix the model evaluation period and compare the performance of multiple OL models trained till different time periods. The most recent model was trained on daily batches till one day before this period. The other models in comparison are trained till one week, 15 days and so on up till more than 2 months before the evaluation period. We also compare the performance of a base model trained from scratch approximately 1 month before the evaluation period. The results are in Figure 10. Here we can see that for both AUC and RIG, the performance degrades with increased delay. This inference is intuitive since the delayed models haven’t seen the latest trends in data closer to the evaluation period. The reduction in performance however is small as we move to older models. Even the model trained till 03/05, which is more than 2 months before the evaluation period, retains most of the gains in AUC and RIG over the base model. We also compare these delayed models to a fixed baseline model trained to completion around one month before the evaluation period (marked as April Retrained, trained on 04/01 to 04/07). Notice that there are a few OL model snapshots that have been updated till before this time-period, namely till 03/25 and 03/05. As seen in the figure, even these OL models perform better than the retrained baseline, even though the baseline model is trained closer to the evaluation period. The reason for this is that the OL models are trained incrementally and have actually seen data across several months, hence they generalize better. The fixed model on the other hand trains on just one week of data and stands to learn just from the distribution in this time period. This again underscores the point that online learning models are superior compared to simple retrained models with a data refresh.
This paper presents a unified framework for online learning, by showing how two seemingly different views of online learning, namely iterative early stopping scheme and a Proximal Update algorithm (both of which have been extensively in literature for this problem), are closely related. We provide conditions when the two algorithms achieve the same updates and empirically validate them. We demonstrate several results proving the benefit of online learning, by understanding the tradeoff between historical and new data, the impact of initializations and delay in the system, and proving that Online Learning is a superior and more stable method compared to model retrainings.
Finally, we discuss some important validation and safeguard mechanisms required for online learning systems in production systems. This is important since models are getting automatically updated. Some of our validation checks include:
Check daily differences in model metrics such as RIG and AUC (day over day differences). We do not expect the day over day differences to be large due to the incremental nature of our online learning schemes.
Check differences to the base model. We expect to see non trivial improvements compared to stale initial model.
Comparison the moving window ensures that the online learning is no worse than a retrained baseline model.
Day over day CTR and other data checks is required to ensure we do not incrementally train models over livesight data. For data checks, we check the volume of the training data over various slices as we do not expect a drastic difference in the volume of the input data.
We also have monitoring dashboards to monitor daily model metrics, input data volumes, CTR etc. These dashboards allow us to monitor the daily performance of the models, and investigate potential issues. In case of model issues, it is also easy to rollback to previous snapshots of the model.
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
, 525–533.Proof of Theorem 1.
To prove this result, we note that the ES produces a set of weights . Given that we run ES with a gradient descent with a fixed learning rate , it is easy to see that . Also note that . From the above, . Therefore,
(7) |
If , then we have,
(8) | ||||
(9) |
It then follows that
(10) |
Given that , we can show that . Adding this up, we get that . Since is convex, we have that,
(11) |
From which, we have . ∎
The proof of Corollary 1 follows exactly as above, except that we consider a co-ordinate wise sum for each of the expressions.
Comments
There are no comments yet.