1 Introduction
With the growing popularity of the Internet as a media, new technologies for targeting advertisements in the digital domain, a discipline generally referred to as computational advertising, have opened up to new business models for publishers and advertisers to finance their services and sell their products. Online advertising entails using banner ads as a means to attract user attention towards a certain brand or product. The clicks, known as clickthroughs, take a user to a website specified by the advertiser and generates revenue for the page displaying the banner, which we call the publisher.
In realtime bidding (RTB) banner ads are determined and placed in realtime based on an auction initiated by the publisher between all potential advertisers, asking them to place a bid of what they are willing to pay for the current impression (displaying the ad), given information about the page, the user engaging the page, a description of the banner format and placement on the page. The advertiser with the highest bid wins the auction and their banner is displayed to the user. RTB thus requires advertisers, or more commonly, the demand side platform
s (DSPs) acting on behalf of the advertisers, to be able to estimate the potential value of an impression, given the available information. A key measure for evaluating the potential values of impressions is the
clickthrough rate (CTR), calculated as the ratio of the number of clicks over the total number of impressions in a specific context. What we are investigating in the present work, is a model for predicting CTRs, even in the face of contexts without any previous clicks and/or very few impressions available, such that the empirical CTR can be unknown or very poorly estimated.1.1 Dyadic prediction
We frame our main objective of estimating clickthrough rates for web banner advertisements in the general scope of a dyadic prediction task. Dyadic prediction concerns the task of predicting an outcome (or label) for a dyad, , whose members are uniquely identified by and , but which may include additional attributes of the dyad being observed.
In this paper we are interested in predicting the binary labels being either click or not click, in general referred to as clickthrough rate prediction, given the pair of a domain and a web banner advertisement. In the following, we give a formal introduction of this problem.
We are given a transaction log of banner advertisements being shown to users. In the logs, various dimensions are recorded, including a banner ID and a domain ID, as well as a number of other attributes, which we shall elaborate more on later. For each record in the log, henceforth called a view, it is recorded whether the banner was clicked or it was displayed without any subsequent click (nonclick). Let index the banner dimension and the domain dimension. We can then construct a matrix, , summarizing the records in the log in terms of empirical clickthrough rates, i.e., let the entries of the matrix be defined by
(1.1) 
Here is the number of clicks and is the number of views involving dyad . Note that per definition, both clicks and nonclicks count as views, so we must always have . The “?” denotes unobserved pairs, where there is no historical data in the log, hence for such dyads is undefined.
With this formulation, our clickthrough rate prediction task is to learn models estimating . Naturally, any such model should be able to predict the missing entries “?”, as well as being able to smoothen predictions, such that the model does not get overconfident in situations with too few views. For instance, if and , a CTR estimate of is probably too extreme, as well as the case , where the natural assumption should rather be that not enough pairs have yet been observed.
One possible approach to the above is where additional features about the entities and are known. This sideinformation
can then be used as predictors in a supervised learning model, such as logistic regression. We refer to this approach as
featurebased.In the complete lack of sideinformation, one can instead borrow ideas from collaborative filtering. In collaborative filtering the classic setup, e.g., the Netflix movie rating problem [4], is where dyads are (user,item) pairs and each observed pair is labeled with a rating, for instance on the scale 1 to 5. The task is then to predict the ratings for unobserved pairs, which can then be used as a component in a recommender system. In our case we can identify a similar collaborative filtering task, but where instead of ratings we have binary outcomes and the dyads are (banner,domain) pairs. The assumption in collaborative filtering is that for a particular (banner,domain) pair, other row objects (other banners) as well as other column objects (other domains) contain predictive information. I.e., we are assuming that some information is shared between entities and we need to learn a model of this shared information.
In this work we investigate a model that fuses ideas from collaborative filtering via matrix factorization and a mapping to valid probabilities in , called a latent feature loglinear model (LFL) with a featurebased model for explicit features, that we refer to as a sideinformation model.
1.2 Related work
The model that we investigate in this work was introduced in [11] and builds on the latent feature loglinear model (LFL) from [12]. Our work can be seen as a supplement to [11], as we think this work is lacking in details, which we thus try and provide. Also, we offer different conclusions about the applicability of this model to a dataset of our own, but for the same application as [11]. [11] does not share any of their data so we can unfortunately not reproduce their results.
The modeling of clickthrough rates has been extensively investigated in the domain of search engine advertising, i.e., the sponsored advertisements that appear in web search engines as a result of user queries for relevant content retrieval. Many methods proposed in this domain are featurebased, e.g., [14, 5, 7] based on logistic regression. Other techniques are maximum likelihood based [6, 3], i.e., they operate directly with the empirically observed counts, which makes it a problem to predict in coldstart settings. Since in search engines, the user directly reveals an intent through his or her query, the features in most of these studies include somehow to predict clickthrough rates of pairs of (word,ad), which could indeed also be modeled using the LFL framework [12], but to our knowledge this has yet to be investigated.
In the setting that we are looking at, namely placement of banner ads, there is no direct query from the user, so modeling clickthrough rates cannot be based on (word,ad) pairs and have to be based on other often much weaker predictors of intent. Featurebased approaches are also popular in this setting, see e.g. [10]. Latent feature models are also not much explored in this area, hence a motivation for this work is to combine the best of combining latent and explicit features and share our findings.
2 Response prediction
The model we apply for response prediction is based on the work in [11], a collaborative filtering technique based on matrix factorization, which is a special case of the latent feature loglinear model (LFL) [12] for dyadic prediction with binary labels. Menon et al. demonstrate that their model incorporating sideinformation, hierarchies and an EMinspired iterative refinement procedure overcome many collaborative filtering challenges, such as sparsity and coldstart problems, and they show superior performance to models based purely on sideinformation and most notably the LMMH model [1]. In the following we introduce the confidenceweighted latent factor model from [11].
2.1 Confidenceweighted factorization
A binary classification problem of the probability of click given a dyadic observation for page and banner , , can be modeled with the logistic function and a single weight, , per dyad. I.e.,
. However, such a model is only capable of classifying dyads already observed in training data and cannot be applied to unseen combinations of pages and banners. Therefore we assume a factorization of
into the factors andeach representing latent feature vectors of the page and banner dimensions, respectively, such that
. Henceforth, we will refer to this estimator as .With data being observations of dyads, , with binary labels, , learning can be formulated as the regular logistic regression optimization problem:
(2.1) 
i.e.
, a maximumlikelihood solution for Bernoullidistributed output variables using the logistic function to nonlinearly map continuous values to probabilities. With the latent variables
and being indexed by , we can rewrite Eq. (2.1) to a confidenceweighted factorization:(2.2) 
where is the number of clicks () involving dyad and the number of nonclicks () involving dyad in the training data. This reformulation can be a significant performance optimization, since the number of distinct dyads can be much smaller than the total number of observations. E.g., in the case of clickthrough data, we can easily have many thousands of click and (particularly) nonclick observations per dyad, hence the number of operations involved in the summation of Eq. (2.2) is significantly reduced compared to Eq. (2.1).
2.1.1 Regularization, learning and bias weights
Optimization of Eq. (2.2) is jointly nonconvex in and , but convex for with fixed, and vice versa. In practice that means we can only converge to a local minimum. Introducing regularization into the problem alleviates some nonconvexity by excluding some local minima from the feasible set and additionally helps controlling overfitting. [11] suggests an norm penalty, thereby effectively smoothing the latent factors:
(2.3) 
where . In this work we also try optimization with an norm regularizer:
(2.4) 
with , thereby promoting sparse latent features.
For the regularized problem Eq. (2.3), a batch solver such as LBFGS (see [13, Chapter 9]) can be invoked. For the regularized problem Eq. (2.4), special care must be taken due to nondifferentiability. The quasinewton method OWLQN [2] can be used instead in this setting. For really large problems, an online learning framework, such as stochastic gradient descend (SGD) is more scalable; again requiring special handling of the regularizer; see [15] for details.
In general with classification problems with skewed class probabilities,
e.g., observing many more nonclicks than clicks, we can add bias terms to capture such baseline effects. We follow the suggestion from [12] and add separate bias terms for each row and column object, i.e., in our case perpage and perbanner bias weights. Hence, without loss of generality, when we refer to and , we assume they have been appended and , respectively, thereby catering for the biases. Furthermore, when we speak of a rank latent feature model, we actually refer to a rank model consisting of latent features as well as the two bias dimensions.2.2 Featurebased response prediction
A different approach to response prediction is a model based on explicit features available in each observation. In the case of clickthrough data, such information could for instance be attributes exposed by the browser (e.g., browser name and version, OS, Screen resolution, etc.), timeofday, dayofweek as well as user profiles based on particular user’s previous engagements with pages, banners, and with the ad server in general.
Again, we can use logistic regression to learn a model of binary labels: For observations we introduce feature vectors, , and model the probability of click given features with the logistic function, i.e., . The optimization problem for learning the weights becomes
(2.5) 
where is added to control overfitting and produce sparse solutions. As discussed in Section 2.1.1, adding bias terms can account for skewed target distributions, and may be included in this type of model, e.g., as a global intercept, by appending an allone feature to all observations. Alternatively, if we want to mimic the perpage and perbanner biases of the latent factor model, we do so by including the page indices and banner indices encoded as oneof and oneof binary vectors, respectively, in the feature vectors.
2.3 Combining models
With dyadic response prediction as introduced in Section 2.1, the model can be extended to take into account sideinformation available to each dyadic observation. I.e.
, introducing an order3 tensor
with entries being the feature vectors of sideinformation available to the dyad, we follow [11] and model the confidenceweighted factorization with sideinformation as .Learning such a model by jointly optimizing both , , and , is nonconvex and may result in bad local minima [12]. To avoid such bad minima, [12, 11]
suggest a simple heuristic; first learn a latentfeature model as the one detailed in Section
2.1, then train the sideinformation model as in Section 2.2, but given the logodds (
) from the latentfeature model as input. I.e., can be rewritten as(2.6) 
hence, having first learned c.f. Section 2.1, can be learned by extending the input features with the logodds of the latentfeature model and fixing the corresponding weights to one. This training heuristic is a type of residual fitting, where the sideinformation model is learning from the differences between observed data and the predictions by the latentfeature model.
In practice we have found the above procedure to be insufficient for obtaining the best performance. Instead we need to alternate between fitting the latent features and fitting the sideinformation model, each while holding the predictions of the other model as fixed. This leaves how to train the latent feature model using the current sideinformation model prediction as fixed parameters open. Therefore we in the following show how this can be achieved.
For Eq. (2.3), which we will use as the working example, the observations are summarized for each unique dyad, , in terms of the click and nonclick counts, regardless of the sideinformation in each of those observations. Therefore we now address the question: Given from Section 2.2, how do we obtain the quantities in Eq. (2.6)?
Initially, we define the notation as indexing the explicit feature vector involving the dyad . Hence, for dyad there are (potentially) different feature vectors , , involved. Assuming a model learned on the explicit features alone according to from Section 2.2, the overall predicted clickthrough rate for the observations involving dyad becomes
(2.7) 
which is obvious from the fact that the sum calculates the predicted number of clicks and is the empirical number of observations. I.e., Eq. (2.7) is just the average predicted clickthrough rate taken over the observations involving dyad . Using this result we can now make sure the combined model yields when either or (or both) by fixing the term according to the logodds of . Hence,
(2.8) 
should be used as fixed inputs while learning the latentfactor model and thus accounts for the predictions of the featurebased model the same way as bias terms account for baseline effects.
3 Data and experiments
We will run experiments on datasets extracted from ad transaction logs in Adform, an international online advertising technology provider. Due to the sequential nature of these data, we will report results from training a model on 7 consecutive days and then testing on the 8. For measuring performance we report area under the ROC curve (AUC) scores as well as the logistic loss, evaluated on the heldout data (i.e., the last day). We evaluate different instantiations of the model over a period of in total 23 test days each using the previous 7 days for training and therefore can also report on the consistency of the results.
The data consists of observations labeled either click or not click and in each observation the domain and the banner id are always known. The additional features, that are features for the sideinformation, include various categorical variables all encoded as oneofK. These include the web browsers
UserAgent string (a weak fingerprint of a user), an indicator vector of the top50k (for a single country) websites the user has visited (URLs visited) the past 30 days, the full URL of the site being visited (top50k indicator vector, per country), a binned frequency of the times the user has clicked before (never, low, mid, high), as well as crossfeatures of the above mentioned and each banner id, thereby tailoring coefficients for each ad. The resulting number of features () is between 500k600k and the number of positive observations is around 250k (), i.e., the problem is overcomplete in features and thuson the sideinformation model is added as a means of feature selection. The negative class
is downsampled by a factor of 100 bringing it down to around 1.5M2.5M. The resulting model can be corrected in the intercept [8]. In our experience downsampling the negative class this drastically and calibrating the model by intercept correction does not impact downstream performance.We train different variations of the models to investigate in particular the usefulness of latent features in addition to a model using only explicit features. The different models are:
 LR

Logistic regression on the sideinformation alone. The corresponding regularization strength we call .
 LFL

The latent feature loglinear model using only the bias features, i.e., corresponding to a logistic regression for the two indicator features for domain and banner, respectively. The corresponding regularization weights we refer to as and , but as we describe later, in practice we use the same weight, , for both.
 LFL

The latent feature loglinear model with including bias features. The corresponding regularization weights we call and , which we also find in practice can be set to be equal, and for this introduce the weight .
 LR+LFL

The combined model with for the LFL model combined with the sideinformation model.
3.1 Tuning hyperparameters
The combined model with both latent features and sideinformation we dub LR+LFL. This model has up to 5 (!) hyperparameters that need tuning by crossvalidation: . [12] does not report whether they use individual , , , and weights, but we consider this highly infeasible. What we have found to be most effective, is to use the same weight for the latent dimensions as well as a shared bias weight , which narrows the search space down to three hyperparameters that must be tuned.
Tuning three hyperparameters is still a cumbersome task, in particular for large datasets, where an exhaustive search for a reasonable grid of three parameters becomes too time consuming. Instead we have had success using the following heuristic strategy for tuning of these parameters:

First run experiments for the logistic regression model alone and find a suitable regularization weight .

Run experiments for the LFL model (i.e., bias weights only) and find a suitable .

Run experiments for a number of LFL models with , with bias weights regularized by fixed from (2), and find a suitable .

Finally, train the combined LFL+LR model with different and fixed, but varying as well as both in the neighborhood of the values found in (1) and (3). If the results indicate performance could be improved in any direction away from that region, we run more experiments with the hyperparameters set in that direction.
To verify the validity of this approach, we have run experiments with the hyperparameters set to their optimal settings as per the above procedure, and varying one at the time, including separate weights for the latent features and biases. In this way we do not find any increase in the performance along any single direction in the space of hyperparameters.
4 Results and discussion
4.1 Validation set results and initialization
We use the first 8 days, i.e., train on 7, test on the 8, to find a reasonable range of hyperparameters that we will test over the entire period. I.e., we use the first test day of a total of 23 days (30 days worth of data, where the first 7 are only used for training) as our validation set. At the same we initialize models with different parameters, that we use for warm starting the training on subsequent data. In the following we provide our results where we are testing performance on a single day, thereby gaining insights into both hyperparameter values, model order () and regularization type ( and ) for the latent features.
In Fig. 1 we show the results using regularization (see Eq. (2.3)) and varying with an LFL model (a) and in (bc) with fixed, different model orders and no sideinformation model. Results are not shown for experiments where and were varied in larger grids, i.e., these plots focus of where the performances peak. What we also learn from these plots (bc), is that higher model orders are advantageous, but that this increase levels off from between to . This is in contrast to [11] reporting being advantageous. We have also run experiments not shown here with and and seen no further increase, and if anything at all, a slight decrease in performance.
The same experiments using regularization are summarized in Fig. 2. Both the experiments for the bias regularization (a) as well as those for the latent factors (bc) do not show as good performances as in the case of regularization and the advantage of adding latent dimensions is harder to distinguish. Furthermore the regions of interest seem more concentrated, i.e., the optima are more peaked. This leads us to the conclusion, that smoothness in the latent dimensions () is preferable to sparsity () and thus we do not report further results using regularization.
In Fig. 3 we show experiments with varying , as well as different models orders for combined LR+LFL models. We confirm a trend towards better performance using higher , but again saturating beyond . We further notice that peak performances in terms of AUC do not necessarily agree completely with those for LL. There may be other explanations as to why that is, but we believe this is a consequence of the LL being sensitive to probabilities being improperly calibrated, while the AUC is not. Inspection of the different models seem to confirm this; where the models perform better in terms of logistic loss, the predicted clickthrough rate for the test set is (slightly) closer to the empirical, than for those models which maximize AUC. We expect that a postcalibration of the models beyond just an intercept correction could be beneficial for the reported logistic losses, but also note that this would not change the AUCs.
As mentioned in Section 2.3, we find that alternating between fitting the latent model and the sideinformation model is necessary. For the experiments Fig. 3, we have alternated 7 times which we have confirmed in practice ensures the performance has leveled off. An example supporting this claim is shown in Fig. 4 and serves to illustrate the general observation we make in all our experiments.
4.2 Results on 22 consecutive days
With just a subset of LR+LFL models handpicked (marked by little x’s in Fig. 3) from the experiments on the first test day, we run experiments on a daily basis while initializing with the models from the previous day. This sequential learning process reflects how modeling would also be run in production at Adform and by warm starting the models in the previous days’ coefficients, we do not expect that running multiple epochs of alternated fitting is required, i.e., this only needs to be done once for initialization.
In the following, the AUCs and logistic losses we report are daily averages of the performances for each banner. As opposed to the performances over an entire test data set that we have reported up until now, making daily averages per banner prevents the performance numbers from being entirely dominated by a single or a few banners, and instead assigns perbanner performances equal weights.
models using the optimal settings for different modelorders (colored lines) relative to the sideinformation model alone. Shaded, gray lines in the background trace all of the different configurations tested. In the above the averages include every banner with 1 or more clicks on a day (between 9001000 banners qualify each day), while in the lower all the banners on a particular day with less than 10 clicks is filtered from the averages (between 400500 banners qualify each day), hence decreasing the variance in the measures.
Reporting performances based on slices of data per banner further allows analysis of under which circumstances the latent feature models add a statistical significant improvements. In Fig. 5 we show the difference in AUC banner averages per day in the total of 22 days we use for testing. The upper shows the performances for all the banners with 1 or more clicks in each test set (day), while the lower is averaged daily performances for only the banners with 10 or more clicks. It is apparent from these two figures, that AUC scores based on very few clicks add significant variance to the daily averages and the difference between the model orders is hard to spot. We also note, that since we cannot evaluate AUCs score for banners without clicks in the test set, these are ignored entirely. For logistic loss, however, we can still report a performance for banners without clicks in the test. The LR model used as the reference (0.0) in Fig. 5 uses , which we found as optimal over the entire 22 day period testing a grid from to in increments of .
In order to further quantify and investigate the impact different model orders has on performance, we summarize in Fig. 6 the relative differences over the 22 test days in box plots. Again, we show performances relative to the sideinformation model and for different inclusion criteria, based on number of clicks in the test sets. In all cases, we see an increase in performance, as the model order is increased, and this increase levels off from to . The notches on boxes are 5k sample bootstraps of the medians, hence based on these we can say something about the statistical significance of these results. I.e., nonoverlapping notches correspond to
for a twotailed nullhypothesis test. First of all, all model orders, including
, improve performances compared to the sideinformation model alone.For both the AUCs and the logistic losses we see wide confidence intervals on the medians, when banners with very few clicks () per day are included. We still observe an increase in performance as the model order increases, but only in the case of logistic loss do the model orders and barely clear overlapping with the notches of .
In the case of including only banners with more than 10 clicks in the summary statistics, the confidence intervals of the medians shrink, in particularly in the case of logistic loss. However, the relative gains (means and medians) are also slightly lower. I.e., there is a trend, albeit barely statistically significant, that there are higher gains among the banners with few clicks in the test sets, than for those with more. Apart from this, there is now also statistically significant differences between the medians for the higher model orders and ; in the case of AUC this includes and , and in the case of logistic loss, are statistically better.
It is worth noting that, regardless of the slice based on number of clicks in the test sets, the results agree that using the LR+LFL model yields higher performance than the LR model alone.
For the results in Fig. 6, while we find evidence that supports that latent features improves clickthrough prediction, the question remains how much this improves realworld performances. Indeed the increments which the latent features introduce in the two measures we report here seem very small. When measuring AUC scores, in particular, we are however not the first to report small, but significant improvements on the third decimal. As McMahan [9] (on web search ads) puts it:
The improvement is more significant than it first appears. A simple model with only features based on where the ads were shown achieves an AUC of nearly 0.80, and the inherent uncertainty in the clicks means that even predicting perfect probabilities would produce an AUC significantly less than 1.0, perhaps 0.85.[9, p.532]
Our data as well as our experiences in web banner ads support this statement, and we often also identify new features or model changes with these low levels of improvement, but which however remain consistent.
Another possibility as an alternative to offline measures on heldout data, such AUC and logistic loss, is live A/B testing. Yet, before taking new features, models or technology into production, a prerequisite to us at least, is to demonstrate consistent offline data performance improvements. For the present work, we have not had the opportunity to test it live.
5 Conclusion
In this work we have reviewed a method for clickthrough rate prediction which combines collaborative filtering and matrix factorization with a sideinformation model and fuses the outputs to proper probabilities in . We have provided details about this particular setup that are not found elsewhere and shared results from numerous experiments highlighting both the strengths and the weaknesses of the approach.
We test the model on multiple consecutive days of clickthrough data from Adform ad transaction logs in a manner which reflects a realworld pipeline and show that predictive performance can be increased using higherorder latent dimensions. We do see a leveloff in the performances for , whereas was suggested in another work [11], but may be due to differences in the data sets; in particular how many sideinformation features are available and used.
Our numerous experiments detail a very involved phase for finding proper regions for the various hyperparameters of the combined model. This is particularly complicated, since the latent feature model and the sideinformation model need to be trained in several alternating steps, for each combination of hyperparameters. This we think is one of the most severe weaknesses of this modeling approach. We circumvent some of the complexity of finding good hyperparameters by using shared regularization strengths for both entities of the latent model and demonstrate, that in a sequential learning pipeline, it is only for initialization of the model, i.e., on the first training set, that we need multiple alternating steps.
For future studies, it would be particularly useful if the hyperparameters could instead be inferred from data. Yet, as we also show in our results, the objective differences (i.e., the evidence) that separate good models from the bad, are small, hence we expect any technique, such as Type II maximum likelihood, would be struggling to properly navigate such a landscape.
References
 Agarwal et al. [2010] Agarwal, D., Agrawal, R., Khanna, R., and Kota, N. Estimating rates of rare events with multiple hierarchies through scalable loglinear models. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining  KDD ’10, page 213, 2010.

Andrew and Gao [2007]
Andrew, G. and Gao, J.
Scalable training of L1regularized loglinear models.
Proceedings of the 24th international conference on Machine learning
, pages 33–40, 2007.  Ashkan et al. [2009] Ashkan, A., Clarke, C. L. a., Agichtein, E., and Guo, Q. Estimating Ad Clickthrough Rate through Query Intent Analysis. 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, pages 222–229, 2009.
 Bennett and Lanning [2007] Bennett, J. and Lanning, S. The netflix prize. In KDD Cup and Workshop in conjunction with KDD, 2007.
 Chakrabarti et al. [2008] Chakrabarti, D., Agarwal, D., and Josifovski, V. Contextual advertising by combining relevance with click feedback. Proceedings of the 17th international conference on World Wide Web, pages 417–426, 2008.
 Dembczynski et al. [2008] Dembczynski, K., Kotlowski, W., and Weiss, D. Predicting ads’ clickthrough rate with decision rules. Workshop on Targeting and Ranking in Online Advertising, 2008.
 Graepel et al. [2010] Graepel, T., Borchert, T., Herbrich, R., and Com, R. M. WebScale Bayesian ClickThrough Rate Prediction for Sponsored Search Advertising in Microsoft’s Bing Search Engine. Search, (April 2009), 2010.
 King and Zeng [2001] King, G. and Zeng, L. Explaining Rare Events in International Relations. International Organization, 55(3):693–715, September 2001.

McMahan [2011]
McMahan, H.
Followtheregularizedleader and mirror descent: Equivalence
theorems and l1 regularization.
International Conference on Artificial Intelligence and Statistics
, 2011.  McMahan et al. [2013] McMahan, H., Holt, G., and Sculley, D. Ad click prediction: a view from the trenches. Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 2013.
 Menon et al. [2011] Menon, A., Chitrapura, K., and Garg, S. Response prediction using collaborative filtering with hierarchies and sideinformation. Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining KDD, 2011.
 Menon and Elkan [2010] Menon, A. A. K. and Elkan, C. A loglinear model with latent features for dyadic prediction. 2010 IEEE International Conference on Data Mining, pages 364–373, December 2010.
 Nocedal and Wright [1999] Nocedal, J. and Wright, S. J. Numerical Optimization. Springer Series in Operations Research and Financial Engineering. SpringerVerlag, New York, 1999.
 Richardson et al. [2007] Richardson, M., Dominowska, E., and Ragno, R. Predicting clicks: estimating the clickthrough rate for new ads. Proceedings of the 16th international conference on World Wide Web, pages 521–530, 2007.

Tsuruoka et al. [2009]
Tsuruoka, Y., Tsujii, J., and Ananiadou, S.
Stochastic gradient descent training for L1regularized loglinear
models with cumulative penalty.
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1  ACLIJCNLP ’09
, 1:477, 2009.