In advertising, one is interested in segmenting people and targeting ads based on segments . With the rapid growth of the Web as a publishing platform, new advertising technologies have evolved, offering greater reach and new possibilities for targeted advertising. One such innovation is real-time bidding (RTB), where upon a user’s request for a specific URL, an online real-time auction is started amongst numerous participants, competing to serve their advertisement. The participants are allotted a limited time on the order of 100ms to query their data sources and come up with a bid, and the winner gets to display their advertisement. Thus if the computational complexity can be reduced, more complex decision processes can be invoked. In this work, we evaluate how dimensionality reduction can be used to simplify predictors of click-through rate.
We focus on three techniques for dimensionality reduction of the large bipartite graph of user-website interactions, namely Singular Value Decomposition (SVD), Non-negative Matrix Factorization (NMF) , and the Infinite Relational Model (IRM) . We are interested in how the different levels of sparsity of the output features imposed by each of the models affect the performance in a click-through rate prediction task. In the RTB setup, where low latency and high throughput are both of crucial importance, database queries need to require as little I/O as possible, and computing model predictions need to involve as few operations as possible. Therefore a good idea is to “compress” very high-cardinality features using dimensionality reduction techniques and at the same time potentially benefit from recommender effects . This presents a trade-off between how much to compress in order to speed up I/O and calculations versus retaining, or exceeding, the performance of a high cardinality feature.
By investigating the SVD, NMF, and the IRM, we essentially vary the compression of a high-cardinality feature (user-website engagements). The SVD produces dense singular vectors, thus requiring the most I/O as well as computation. The NMF is known to produce sparse components, meaning that zeros need not be stored, retrieved nor used in computations, and thus requires less I/O and computation. The IRM offers the most sparse representation, in that it produces hard cluster assignments, hence I/O and computation are reduced to a single weight per mode.
We present results that use either of the dimensionality reduction techniques’ outputs as predictors for a click-through rate prediction task. Our experiments show that a compact representation based on the NMF outperforms the other two options. If one however wants to use as little I/O and as simple computations as possible, the very compact representation from the IRM model offers an interesting alternative. While incurring a limited loss of lift relative to the NMF, the IRM based predictors yield the fastest training speed of the downstream logistic regression classifier and also results in the most economical usage of features and fastest possible computations at run-time. The IRM further has the advantage that it alleviates the need for model order selection, which is required in NMF. While the dense features produced by SVD also find usage in terms of predictive performance, the dense features inhibit the logistic regression training time, and if low database I/O as well as fast computation of predictions is a priority, the SVD will not be of great use.
A key enabling factor in running the IRM with the data we present in this work, is a sampler written for the graphics processing unit (GPU) , without which learning of the IRM model would not be feasible, at least not on a day-by-day schedule. To demonstrate the feasibility of the IRM as a large-scale sparse dimensionality reduction, we run final tests on a full-scale click-through rate data set and compare the performances with not using any dimensionality reductions.
1.1 Related work
Within the area of online advertising, computational targeting techniques are often faced with the challenge of very few observations per feature, particularly of the positive label (i.e., click, action, buy). A common approach to alleviate such label sparsity is to use collaborative filtering type algorithms, where one allow similar objects to “borrow” training data and thus constrain the related objects to have similar predicted behaviour. Studies hereof are common for sponsored search advertising where the objects of interest are query-ad pairs [5, 7], but the problem is similar to that of user-website pairs that we study. To our knowledge we are the first to report on the usage of the IRM co-clustering of user-website pairs and the results should be applicable for query-add click-through rate prediction as well.
By representing users in a compressed or latent space based on the user-website graph, we are essentially building profiles of users based on their behaviour and using those profiles for targeted advertising. This approach is well studied with many other types of profiles based on various types of information: For using explicit features available for predicting click-through rates,  is a good resource: Latent factor models have been proposed to model click-through rates in online advertising, see e.g. : For examples of using dimensionality reduction techniques in the construction of click-through rate models, such as the NMF, see . We believe our contribution to have applications in many such setups, either as an additional predictor or for incorporation as a priori information (priors, constraints, etc.) which can help with identifiability of the models.
We regard the problem of predicting click-through rates as a supervised learning task, i.e., given historical observations with features (or predictors) available about the user, webpage, and ad, along with the labels of actions (in our case click (1) or not-click (0)), the task is to learn a classifier for predicting unseen observations, given the features. This is the approach taken also by e.g.,. As in , we build a probabilistic model based on logistic regression for predicting click-through rates. What we add, is additional features based on dimensionality reduction, as well as a sparsity inducing constraint based on the -norm.
We are interested in estimation of features which can improve click-through rate predictions. In this work, we focus on introducing features from different dimensionality reduction techniques based on a bipartite graph of users and websites (URLs), and using them in a simple probabilistic model for click-through rate prediction, namely logistic regression. In the following, we introduce the dimensionality reduction techniques which we evaluate.
2.1 Dimensionality reduction techniques
2.1.1 Singular value decomposition
The singular value decomposition (SVD) of a rank matrix is given as the factorization , where and are unitary matrices and hold the left and right singular vectors of , respectively. The diagonal matrix contains the singular values of . By selecting only the largest singular values of , i.e., truncating all other singular values to zero, one obtains the approximation , which is the rank optimal solution to . This truncation corresponds to disregarding the
dimensions with the least variances of the basesand as noise.
2.1.2 Non-negative matrix factorization
Non-negative matrix factorization (NMF) received its name as well as its popularity in . NMF is a matrix factorization comparable to SVD, the crucial difference being that NMF decomposes into non-negative factors and impose no orthogonality constraints. Given a non-negative input matrix with dimensions , NMF approximates the decomposition , where is an non-negative matrix, a non-negative matrix, and is the number of components. By selecting one approximates the decomposition of , thereby disregarding some residual (unconstrained) matrix as noise.
2.1.3 Infinite relational model
The Infinite Relational Model (IRM) has been proposed as a Bayesian generative model for graphs. Generative models can provide accurate predictions and through inference of relevant latent variables they can inform the user about mesoscale structure. The IRM model can be cast as co-clustering approach for bipartite networks where the nodes of each mode are grouped simultaneously. A benefit of the IRM model over existing co-clustering approaches is that the model explicitly exploit the statistical properties of binary graphs and allows the number of components of each mode to be inferred from the data.
Sample the row cluster probabilities, i.e.,.
Sample row cluster assignments, i.e., .
Sample the column cluster probabilities, i.e., .
Sample column cluster assignments, i.e., .
Sample between cluster relations, i.e., and .
Generate links, i.e., and .
Where and denote the number of row and column clusters respectively whereas and are vectors of ones with size and . The limits and lead to the Infinite Relational Model (IRM) which has an analytic solution given by the Chinese Restaurant Process (CRP) [15, 4, 17].
Rather than collapsing the parameters of the model, we apply blocked sampling that allows for parallel GPU computation . Moreover, the CRP is approximated by the truncated stick breaking construction (TSB), and the truncation error becomes insignificant when the model is estimated for large values of and , see also .
2.2 Supervised learning using logistic regression
For learning a model capable of predicting click-through rates trained on historical data, we employ logistic regression with sparsity constraints; for further details see for instance [19, 20]. Given data consisting of observations with -dimensional feature vectors and binary labels , the probability of a positive event can be modeled with the logistic function and a single weight per feature. I.e., , referred to as in the following. The optimization problem for learning the weights becomes
is added to control overfitting and produce sparse solutions. For skewed target distributions, an intercept termmay be included in the model by appending an all-one feature to all observations. The corresponding regularization term then needs to be fixed to zero.
For training the logistic regression model, one can use gradient-descent type optimizers and quasi-Newton based algorithms are a popular choice. With -penalty, however, a little care must be taken since off-the-shelf Newton-based solvers require the objective function to be differentiable, which (1) is not due to the penalty function which is not differentiable in zero. In this work we base our logistic regression training on OWL-QN 
for batch learning. For online learning using stochastic gradient descent with-penalization, see .
Performing predictions with a logistic regression model is as simple as computing the logistic function on the features of a test observation, . In terms of speed, however, it matters how the features of are represented. In particular for a binary feature vector
I.e., predicting for binary feature vectors scales in the number of non-zero elements of the feature vector, which makes computations considerably faster. Additionally, using the right-hand side of (2), can be performed when storing the weights in memory or a database, hences saves further processing power. This has two consequences: 1) Binary features are more desirable for making real-time predictions and 2) the sparser the features, the less computation time and I/O from databases is required.
The data we use for our experiments originate from Adform’s ad transaction logs. In each transaction, e.g., when an ad is served, the URL where the ad is being displayed and a unique identifier of the users web browser is stored along with an identifier of the ad. Likewise, a transaction is logged when a user clicks an ad. From these logs, we prepare a data set over a period of time and use the final day for testing and use the rest for training.
As a pre-processing step, all URLs in the transaction log are stripped of any query-string that might be trailing the URL111Query-string: Anything trailing an “?” in a URL, including the “?”., however the log data are otherwise unprocessed.
3.1 Dimensionality reduction
From the training set transactions, we produce a binary bipartite graph of users in the first mode and URLs in the second mode. This is an unweighted, undirected graph where edges represent which URLs a user has seen, i.e., we do not use the number of times the user has engaged each URL. The graph we obtain has =9,304,402 unique users and =7,056,152 unique URLs. We denote this graph UL.
As we will be repeating numerous supervised learning experiments, that each can be quite time consuming for the entire training set, we do our main analysis based on experiments from a subset of transactions. As an inclusion criteria, we select the top =99,854 users based on the number of URLs they have seen and URLs with visits from at least 100 unique users, resulting in =70,436 URLs being included. Based on those subsets of users and URLs, we produce a smaller transaction log, from which we also construct a bipartite graph denoted UL.
3.1.1 Method details
For the sampled data for unsupervised learning, UL, we use the different dimensionality reduction techniques presented in Section 2 to obtain new per-user and per-URL features.
For obtaining the SVD-based dense left and right singular vectors, we use SVDS
included with Matlab to compute the 500 largest eigenvalues with their corresponding eigenvectors. In the supervised learning, by joining our data by user and URL with the left and right singular vectors, respectively, we can use anything from 1 to 500 of the largest eigenvectors for each modality as features.
We use the NMF Matlab Toolbox from  to decompose UL into non-negative factors. We use the original algorithm introduced in  with the least-squares objective and multiplicative updates (nmfrule option in the NMF Toolbox). With NMF we need to decide the model order, i.e., number of components to fit in each of the non-negative factors. Hence, to investigate the influence of NMF model order, we train NMF using various model orders of 100, 300, and 500 number of components. We run the toolbox with the default configurations for convergence tolerance and maximum number of iterations.
As detailed in Section 2.1.3, we use the GPU sampling scheme from  for massively speeding up the computation of the IRM model. The IRM estimation infers the number of components (i.e., clusters) separately for each modality, however, it does require we input a maximum number of components for users and URLs. For UL, we run with =500 for both modalities and terminate the estimation after 500 iterations. The IRM infers 216 user clusters and 175 URL cluster for UL, i.e., well below the we specify.
For the full dataset UL, we have only completed the dimensionality reduction using IRM, which is thanks to our access to the aforementioned GPU sampling code. Again we run the IRM for 500 iterations, and with 500 as for each modality. The IRM infers 408 user clusters and 380 URL clusters for UL; again well below .
Running the SVD and NMF for a data set the size of UL within acceptable times (i.e., within a day or less), is in it self a challenge and requires specialized software, either utilizing GPUs or distributed computation (or both). As we have not had immediate access to any implementations capable hereof, the SVD and NMF decompositions of UL remain as future work. Hence, for click-through rate prediction on the full data set, we demonstrate only the benefit of using the IRM cluster features over not using any dimensionality reduction.
3.2 Supervised learning
For testing the various dimensionality reductions, we construct several training and testing data sets from RTB logs with observations labeled as click (1) or non-click (0). The features we use are summarized in table 1.
|(BannerId, Url)||A one-of-K encoding of the cross-features between and , which indicates where a request has been made. This serves as a baseline predictor in all of our experiments.|
|UrlsVisited||A vector representation (zeros and ones) of URLs that a specific user has visited in the past.|
|UserCluster||A one-of-K encoding of which IRM cluster a specific user belongs to.|
|UrlCluster||A one-of-K encoding of which IRM cluster a specific URL belongs to.|
|UserSVDLoading||The continous-valued -dimension left singular vector of a specific user from the SVD.|
|UrlSVDLoading||The continous-valued -dimension right singular vector of a specific URL from the SVD.|
|UserNMFLoading||The continous-valued cluster assignment vector of a specific user according to the NMF- decomposition.|
|UrlNMFLoading||The continous-valued cluster assignment vector of a specific URL according to the NMF- decomposition.|
Based on the full set of users and URLs as well as the sub-sampled sets, detailed in Section 3.1, we prepare training and testing data sets based on the features of Table 1 for our logistic regression classifier. We denote the full dataset SL and the sampled SL. The data are represented as matrices, i.e., with columns being features and rows being observations.
3.2.1 Method details
From the predictors of Table 1, we train a number of logistic regression classifiers, using -penalization for sparsity, see also Section 2.2. For the stopping criteria, we run until the change of the objective value between iterations falls below 1e-6. As the classes (clicks vs. non-clicks) are highly unbalanced, we also learn an unpenalized intercept term. In order not to introduce any advantages (or disadvantages) to some predictors over others, we do not normalize the input features for any of the predictors in any way. Rather, we first select one regularization strength, , for the baseline predictor only, , and fix that through all other trials. In each experiment, we then use other predictors - in addition to and select another regularization strength, , jointly regularizing those predictors, but with still fixed for . We compare to using regularized by in addition to and henceforth refer to this model as NODR, short for no dimensionality reduction.
For each trained model, we measure the performance in terms of the negative Bernoulli log-likelihood (LL), which measures the mismatch between the observations and the predictions of the model, i.e., the lower, the better. The likelihoods we report are normalized with respect to the baseline likelihood of the click-through rate evaluated on the test set, such that in order to outperform the baseline, they should fall between 0 and 1.
|44086||143120||1 - 2.3e-5|
|42910||8824491||1 - 1.4e-3|
|216||143120||1 - 4.6e-3|
|175||143120||1 - 5.7e-3|
|,||100 / 300 / 500||0|
|100 / 300 / 500||4745568 / 9780078 / 13993847||0.67 / 0.77 / 0.80|
|100 / 300 / 500||4174552 / 14363612 / 23712222||0.71 / 0.67 / 0.67|
3.3 Results on Sl
|Model (||,||,||)||Time (s)||nnz||nnz||LL||% Lift|
|(||0.8 ,||- ,||- )||9||3612||-||93.83||0.00|
|, (||0.8 ,||10.6 ,||- )||91||3943||760||88.15||6.05|
|, (||0.8 ,||- ,||6.0e-4 )||13||3653||-||90.19||3.88|
|, (||0.8 ,||- ,||7.0e-4 )||16||3674||-||89.84||4.25|
|(||0.8 ,||15.4 ,||7.0e-4 )||76||3861||366||87.78||6.45|
|, (||0.8 ,||- ,||0.1 )||19||3479||-||89.87||4.22|
|, (||0.8 ,||- ,||0.3 )||29||3502||-||89.73||4.37|
|, (||0.8 ,||- ,||0.3 )||56||3552||-||89.73||4.37|
|, , (||0.8 ,||- ,||7.0e-4 )||649||3409||-||89.15||4.99|
|, , (||0.8 ,||- ,||7.0e-4 )||2487||3702||-||88.92||5.23|
|, , (||0.8 ,||- ,||1.2e-3 )||4082||4027||-||89.55||4.56|
|(||0.8 ,||10.8 ,||7.0e-4 )||3291||4063||484||87.90||6.32|
|, (||0.8 ,||- ,||6.0e-3 )||30||3453||-||89.38||4.74|
|, (||0.8 ,||- ,||3.0e-3 )||40||3467||-||89.15||4.99|
|, (||0.8 ,||- ,||2.0e-3 )||45||3521||-||88.68||5.49|
|, , (||0.8 ,||- ,||5.0e-3 )||151||3389||-||89.05||5.09|
|, , (||0.8 ,||- ,||6.0e-3 )||392||3468||-||87.89||6.33|
|, , (||0.8 ,||- ,||4.0e-3 )||740||3635||-||93.59||0.26|
|(||0.8 ,||11.2 ,||6.0e-3 )||641||3973||680||86.91||7.38|
For the sampled data the number of observations are as follows: =138,847 and =4,273. In order to give the reader an idea about the dimensionalities of the features as well as their sparsity, in Table 2 we summarize some numbers on the predictors on the sampled data set. For features ,, and , the number of non-zeros (nnz) and sparsities are somewhat trivial, since these are categorical features represented as one-of-K binary vectors. For the SVD features, and , we see that the feature vectors become completely dense. For the NMF features, however, we can confirm the methods’ ability to produce sparse components, i.e., only between 20-33% of the components turn up as non-zeros, yet they are far from the sparsities of the IRM cluster features, and .
|Model (||,||,||)||Time (s)||nnz||nnz||LL||% Lift|
|(||0.7 ,||- ,||- )||34||14152||-||91.76||0.00|
|, (||0.7 ,||10.2 ,||- )||195||15673||3010||88.71||3.32|
|, (||0.7 ,||- ,||1.2e-3 )||51||13604||-||89.35||2.63|
|(||0.7 ,||10.2 ,||1.2e-3 )||293||16018||2939||88.19||3.89|
In Table 3, we report the normalized likelihoods, lifts and test-set optimal regularization strengths and , with varying features used for training. The lifts are all relative to model . The penalization strength is selected as the one maximizing the performance of the classifier using only , and is kept fixed for all the other classifiers. Note, that generalization of the penalization terms is an issue we do not currently address. The time reported in the table are the seconds it takes to train the logistic regression classifier. nnz and nnz are the respective number of non-zero weights of the resulting classifier for all the features and the feature only
In order to be able to further elaborate on the pros and cons of using the various dimensionality reduction techniques as features in the logistic regression classifier, we carry out another set of experiments for the models highlighted (bold and marked ) in Table 3. We fix the values of and to the values from , and , respectively, and append as an additional feature with each model and then tune the regularization strength . The results are shown in the rows of Table 3 with the symbols , and under “Model”.
The final experiment we run is with the full data set where we only evaluate the IRM based features and compare those to not using any dimensionality reduction. The number of observations for train and test are =5,460,229 and =188,867. The selection of regularization terms we do as in the previous experiments. The results are reported in Table 4.
From Table 3 we first concentrate on the best models from each dimensionality reduction, i.e., the results highlighted in bold. Comparing the lifts, we see that the NMF-300 features perform roughly one %-point better than the SVD-300 features, which then in turn perform roughly another %-point better than the IRM cluster features. Comparing to the classifier using just and , i.e., no dimensionality reduction, we see that only the NMF-based classifier achieves slightly higher lift. Hence, using SVD or IRM based features as a replacement for the feature would result in worse predictions. Seeing the number of non-zero weights dropping from 3943 using to 3468 using both NMF-300 features, indicates that the NMF offers a more economical representation which can replace while not sacrificing performance. The performance gain of NMF-300 we expect is achieved by the implicit data grouping effects of NMF, i.e., recommender effects.
In terms of training speed, we see that while the IRM based features fare worst in terms of lift, the fact that each mode is a categorical value represented in a one-of-K binary vector makes the input matrix very sparse, which speeds up the training of our classifier significantly and the model trains at least an order of magnitude faster than the other dimensionality reduction techniques and even significantly faster than training the NODR model. Hence, if fast training is a priority, either no dimensionality reduction should be used or the IRM based features can be used, but at the cost of slightly lower lift.
We now turn to the results for the models , and in Table 3. Here we investigate how the learning of weights for the high-cardinality feature is affected when combined with each of the optimal settings from the reduced dimension experiments. Again, observing the lifts, the NMF-300 based features combined with obtains the highest lift. However, the IRM based features now outperform the SVD ones and using either of the techniques in combination with , we are able to obtain higher lifts than using only .
For the training speed, we again see that the training using IRM features is by far the fastest amongst SVD and NMF and it is still faster than using only. What is more interesting, is the resulting number of non-zero weights, both in total and in the feature alone. Of all the different dimensionality reductions as well as NODR, using the IRM based representation requires the fewest non-zero weights at its optimal settings. Additionally, recalling from Section 2.2, that predictions can be made computationally very efficient, when the input features are binary indicator vectors, the IRM becomes all the more tractable. By combining the IRM based features with the explicit predictors and , our classifier is able to improve the lift over not using dimensionality reduction while reducing the need for fetching many weights for predictions and with only a small reduction in lift, compared to the more computationally expensive classifiers based on NMF and SVD.
Finally, in Table 4 we have run experiments using just the IRM based predictors with the full data set. The results confirm our findings from Table 3 and at the same time demonstrates both the feasibility of processing very large bipartite graphs using IRM as well as the application of the user and URL clusters as predictors of click-through rates.
We have presented results that demonstrate the use of three bimodal dimensionality reduction techniques, SVD, NMF, and IRM, and their applications as predictors in a click-through rate data set. We show that the compact representation based on the NMF is, in terms of predictive performance, the best option. For applications where fast predictions are required, however, we show that the binary representation from the IRM model is a viable alternative. The IRM based predictors yield the fastest training speed in the supervised learning stage, produces the most sparse model and offers the fastest computations at run-time, while incurring only a limited loss of lift relative to the NMF. In applications such as real-time bidding, where fast database I/O and few computations are key to success, we recommend using IRM based features as predictors.
-  C Apte, E Bibelnieks, and R Natarajan. Segmentation-based modeling for advanced targeted marketing. Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, 2001.
-  G Golub and W Kahan. Calculating the singular values and pseudo-inverse of a matrix. Journal of the Society for Industrial & Applied Mathematics, 1965.
-  D.D. Lee and H.S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999.
-  C. Kemp, J.B. Tenenbaum, T.L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. In Artificial Intelligence, Proceedings of the National AAAI Conference on, 2006.
-  D Hillard, E Manavoglu, H Raghavan, C Leggetter, E Cantú-Paz, and R Iyer. The sum of its parts: reducing sparsity in click estimation with query segments. Information Retrieval, 14(3):315–336, February 2011.
-  TJ Hansen, M Morup, and LK Hansen. Non-parametric co-clustering of large scale sparse bipartite networks on the GPU. Machine Learning for Signal Processing (MLSP), 2011 IEEE International Workshop on, 2011.
-  KS Dave and V Varma. Learning the click-through rate for rare/new ads from similar ads. Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 897–898, 2010.
-  M Richardson, E Dominowska, and R Ragno. Predicting clicks: estimating the click-through rate for new ads. Proceedings of the 16th international conference on World Wide Web, pages 521–530, 2007.
-  AK Menon, KP Chitrapura, S Garg, D Agarwal, and N Kota. Response prediction using collaborative filtering with hierarchies and side-information. Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’11, page 141, 2011.
-  Y Chen and M Kapralov. Factor modeling for advertisement targeting. Advances in Neural Information Processing Systems, 2009.
-  MW Berry and M Browne. Email surveillance using non-negative matrix factorization. Computational & Mathematical Organization Theory, 1(11.3):249–264, 2005.
-  W Xu, X Liu, and Y Gong. Document clustering based on non-negative matrix factorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, volume pages, page 273. ACM, 2003.
-  BO Wahlgreen and LK Hansen. Large scale topic modeling made practical. In Machine Learning for Signal Processing (MLSP), 2011 IEEE International Workshop on, volume 1, pages 1–6. IEEE, September 2011.
-  QW Dong, XL Wang, and L Lin. Application of latent semantic analysis to protein remote homology detection. Bioinformatics, 22(3):285–290, 2006.
-  Zhao Xu, Volker Tresp, Kai Yu, and Hans-Peter Kriegel. Learning infinite hidden relational models. Uncertainity in Artificial Intelligence (UAI2006), 2006.
-  K. Nowicki and T.A.B. Snijders. Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association, 96(455):1077–1087, 2001.
-  R.M. Neal. Markov chain sampling methods for dirichlet process mixture models. Computational and Graphical Statistics, Journal of, 9:249–265, 2000.
-  S. Yu K. Yu H.-P. Kriegel Z. Xu, V.Tresp. Fast inference in inifinite hidden relational models. Mining and Learning with Graphs (MLG’07), 2007.
-  CM Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, 2007.
-  G Andrew and J Gao. Scalable training of L 1-regularized log-linear models. Proceedings of the 24th international conference on Machine learning, pages 33–40, 2007.
Y Tsuruoka, J Tsujii, and S Ananiadou.
Stochastic gradient descent training for L1-regularized log-linear
models with cumulative penalty.
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 1:477, 2009.
-  Y Li and A Ngom. The non-negative matrix factorization toolbox for biological data mining. Source code for biology and medicine, 8(1):10, January 2013.