1 Introduction
In advertising, one is interested in segmenting people and targeting ads based on segments [1]. With the rapid growth of the Web as a publishing platform, new advertising technologies have evolved, offering greater reach and new possibilities for targeted advertising. One such innovation is realtime bidding (RTB), where upon a user’s request for a specific URL, an online realtime auction is started amongst numerous participants, competing to serve their advertisement. The participants are allotted a limited time on the order of 100ms to query their data sources and come up with a bid, and the winner gets to display their advertisement. Thus if the computational complexity can be reduced, more complex decision processes can be invoked. In this work, we evaluate how dimensionality reduction can be used to simplify predictors of clickthrough rate.
We focus on three techniques for dimensionality reduction of the large bipartite graph of userwebsite interactions, namely Singular Value Decomposition (SVD)
[2], Nonnegative Matrix Factorization (NMF) [3], and the Infinite Relational Model (IRM) [4]. We are interested in how the different levels of sparsity of the output features imposed by each of the models affect the performance in a clickthrough rate prediction task. In the RTB setup, where low latency and high throughput are both of crucial importance, database queries need to require as little I/O as possible, and computing model predictions need to involve as few operations as possible. Therefore a good idea is to “compress” very highcardinality features using dimensionality reduction techniques and at the same time potentially benefit from recommender effects [5]. This presents a tradeoff between how much to compress in order to speed up I/O and calculations versus retaining, or exceeding, the performance of a high cardinality feature.By investigating the SVD, NMF, and the IRM, we essentially vary the compression of a highcardinality feature (userwebsite engagements). The SVD produces dense singular vectors, thus requiring the most I/O as well as computation. The NMF is known to produce sparse components
[3], meaning that zeros need not be stored, retrieved nor used in computations, and thus requires less I/O and computation. The IRM offers the most sparse representation, in that it produces hard cluster assignments, hence I/O and computation are reduced to a single weight per mode.We present results that use either of the dimensionality reduction techniques’ outputs as predictors for a clickthrough rate prediction task. Our experiments show that a compact representation based on the NMF outperforms the other two options. If one however wants to use as little I/O and as simple computations as possible, the very compact representation from the IRM model offers an interesting alternative. While incurring a limited loss of lift relative to the NMF, the IRM based predictors yield the fastest training speed of the downstream logistic regression classifier and also results in the most economical usage of features and fastest possible computations at runtime. The IRM further has the advantage that it alleviates the need for model order selection, which is required in NMF. While the dense features produced by SVD also find usage in terms of predictive performance, the dense features inhibit the logistic regression training time, and if low database I/O as well as fast computation of predictions is a priority, the SVD will not be of great use.
A key enabling factor in running the IRM with the data we present in this work, is a sampler written for the graphics processing unit (GPU) [6], without which learning of the IRM model would not be feasible, at least not on a daybyday schedule. To demonstrate the feasibility of the IRM as a largescale sparse dimensionality reduction, we run final tests on a fullscale clickthrough rate data set and compare the performances with not using any dimensionality reductions.
1.1 Related work
Within the area of online advertising, computational targeting techniques are often faced with the challenge of very few observations per feature, particularly of the positive label (i.e., click, action, buy). A common approach to alleviate such label sparsity is to use collaborative filtering type algorithms, where one allow similar objects to “borrow” training data and thus constrain the related objects to have similar predicted behaviour. Studies hereof are common for sponsored search advertising where the objects of interest are queryad pairs [5, 7], but the problem is similar to that of userwebsite pairs that we study. To our knowledge we are the first to report on the usage of the IRM coclustering of userwebsite pairs and the results should be applicable for queryadd clickthrough rate prediction as well.
By representing users in a compressed or latent space based on the userwebsite graph, we are essentially building profiles of users based on their behaviour and using those profiles for targeted advertising. This approach is well studied with many other types of profiles based on various types of information: For using explicit features available for predicting clickthrough rates, [8] is a good resource: Latent factor models have been proposed to model clickthrough rates in online advertising, see e.g. [9]: For examples of using dimensionality reduction techniques in the construction of clickthrough rate models, such as the NMF, see [10]. We believe our contribution to have applications in many such setups, either as an additional predictor or for incorporation as a priori information (priors, constraints, etc.) which can help with identifiability of the models.
We regard the problem of predicting clickthrough rates as a supervised learning task, i.e., given historical observations with features (or predictors) available about the user, webpage, and ad, along with the labels of actions (in our case click (1) or notclick (0)), the task is to learn a classifier for predicting unseen observations, given the features. This is the approach taken also by e.g.,
[8]. As in [8], we build a probabilistic model based on logistic regression for predicting clickthrough rates. What we add, is additional features based on dimensionality reduction, as well as a sparsity inducing constraint based on the norm.2 Methods
We are interested in estimation of features which can improve clickthrough rate predictions. In this work, we focus on introducing features from different dimensionality reduction techniques based on a bipartite graph of users and websites (URLs), and using them in a simple probabilistic model for clickthrough rate prediction, namely logistic regression. In the following, we introduce the dimensionality reduction techniques which we evaluate.
2.1 Dimensionality reduction techniques
2.1.1 Singular value decomposition
The singular value decomposition (SVD) of a rank matrix is given as the factorization , where and are unitary matrices and hold the left and right singular vectors of , respectively. The diagonal matrix contains the singular values of . By selecting only the largest singular values of , i.e., truncating all other singular values to zero, one obtains the approximation , which is the rank optimal solution to . This truncation corresponds to disregarding the
dimensions with the least variances of the bases
and as noise.2.1.2 Nonnegative matrix factorization
Nonnegative matrix factorization (NMF) received its name as well as its popularity in [3]. NMF is a matrix factorization comparable to SVD, the crucial difference being that NMF decomposes into nonnegative factors and impose no orthogonality constraints. Given a nonnegative input matrix with dimensions , NMF approximates the decomposition , where is an nonnegative matrix, a nonnegative matrix, and is the number of components. By selecting one approximates the decomposition of , thereby disregarding some residual (unconstrained) matrix as noise.
2.1.3 Infinite relational model
The Infinite Relational Model (IRM) has been proposed as a Bayesian generative model for graphs. Generative models can provide accurate predictions and through inference of relevant latent variables they can inform the user about mesoscale structure. The IRM model can be cast as coclustering approach for bipartite networks where the nodes of each mode are grouped simultaneously. A benefit of the IRM model over existing coclustering approaches is that the model explicitly exploit the statistical properties of binary graphs and allows the number of components of each mode to be inferred from the data.
The generative process for the Relational Model
[4, 15, 16] is given by:
Sample the row cluster probabilities, i.e.,
.Sample row cluster assignments, i.e., .
Sample the column cluster probabilities, i.e., .
Sample column cluster assignments, i.e., .
Sample between cluster relations, i.e., and .
Generate links, i.e., and .
Where and denote the number of row and column clusters respectively whereas and are vectors of ones with size and . The limits and lead to the Infinite Relational Model (IRM) which has an analytic solution given by the Chinese Restaurant Process (CRP) [15, 4, 17].
Rather than collapsing the parameters of the model, we apply blocked sampling that allows for parallel GPU computation [6]. Moreover, the CRP is approximated by the truncated stick breaking construction (TSB), and the truncation error becomes insignificant when the model is estimated for large values of and , see also [18].
2.2 Supervised learning using logistic regression
For learning a model capable of predicting clickthrough rates trained on historical data, we employ logistic regression with sparsity constraints; for further details see for instance [19, 20]. Given data consisting of observations with dimensional feature vectors and binary labels , the probability of a positive event can be modeled with the logistic function and a single weight per feature. I.e., , referred to as in the following. The optimization problem for learning the weights becomes
(1) 
where
is added to control overfitting and produce sparse solutions. For skewed target distributions, an intercept term
may be included in the model by appending an allone feature to all observations. The corresponding regularization term then needs to be fixed to zero.For training the logistic regression model, one can use gradientdescent type optimizers and quasiNewton based algorithms are a popular choice. With penalty, however, a little care must be taken since offtheshelf Newtonbased solvers require the objective function to be differentiable, which (1) is not due to the penalty function which is not differentiable in zero. In this work we base our logistic regression training on OWLQN [20]
for batch learning. For online learning using stochastic gradient descent with
penalization, see [21].Performing predictions with a logistic regression model is as simple as computing the logistic function on the features of a test observation, . In terms of speed, however, it matters how the features of are represented. In particular for a binary feature vector
(2) 
I.e., predicting for binary feature vectors scales in the number of nonzero elements of the feature vector, which makes computations considerably faster. Additionally, using the righthand side of (2), can be performed when storing the weights in memory or a database, hences saves further processing power. This has two consequences: 1) Binary features are more desirable for making realtime predictions and 2) the sparser the features, the less computation time and I/O from databases is required.
3 Experiments
The data we use for our experiments originate from Adform’s ad transaction logs. In each transaction, e.g., when an ad is served, the URL where the ad is being displayed and a unique identifier of the users web browser is stored along with an identifier of the ad. Likewise, a transaction is logged when a user clicks an ad. From these logs, we prepare a data set over a period of time and use the final day for testing and use the rest for training.
As a preprocessing step, all URLs in the transaction log are stripped of any querystring that might be trailing the URL^{1}^{1}1Querystring: Anything trailing an “?” in a URL, including the “?”., however the log data are otherwise unprocessed.
3.1 Dimensionality reduction
From the training set transactions, we produce a binary bipartite graph of users in the first mode and URLs in the second mode. This is an unweighted, undirected graph where edges represent which URLs a user has seen, i.e., we do not use the number of times the user has engaged each URL. The graph we obtain has =9,304,402 unique users and =7,056,152 unique URLs. We denote this graph UL.
As we will be repeating numerous supervised learning experiments, that each can be quite time consuming for the entire training set, we do our main analysis based on experiments from a subset of transactions. As an inclusion criteria, we select the top =99,854 users based on the number of URLs they have seen and URLs with visits from at least 100 unique users, resulting in =70,436 URLs being included. Based on those subsets of users and URLs, we produce a smaller transaction log, from which we also construct a bipartite graph denoted UL.
3.1.1 Method details
For the sampled data for unsupervised learning, UL, we use the different dimensionality reduction techniques presented in Section 2 to obtain new peruser and perURL features.
For obtaining the SVDbased dense left and right singular vectors, we use SVDS
included with Matlab to compute the 500 largest eigenvalues with their corresponding eigenvectors. In the supervised learning, by joining our data by user and URL with the left and right singular vectors, respectively, we can use anything from 1 to 500 of the largest eigenvectors for each modality as features.
We use the NMF Matlab Toolbox from [22] to decompose UL into nonnegative factors. We use the original algorithm introduced in [3] with the leastsquares objective and multiplicative updates (nmfrule option in the NMF Toolbox). With NMF we need to decide the model order, i.e., number of components to fit in each of the nonnegative factors. Hence, to investigate the influence of NMF model order, we train NMF using various model orders of 100, 300, and 500 number of components. We run the toolbox with the default configurations for convergence tolerance and maximum number of iterations.
As detailed in Section 2.1.3, we use the GPU sampling scheme from [6] for massively speeding up the computation of the IRM model. The IRM estimation infers the number of components (i.e., clusters) separately for each modality, however, it does require we input a maximum number of components for users and URLs. For UL, we run with =500 for both modalities and terminate the estimation after 500 iterations. The IRM infers 216 user clusters and 175 URL cluster for UL, i.e., well below the we specify.
For the full dataset UL, we have only completed the dimensionality reduction using IRM, which is thanks to our access to the aforementioned GPU sampling code. Again we run the IRM for 500 iterations, and with 500 as for each modality. The IRM infers 408 user clusters and 380 URL clusters for UL; again well below .
Running the SVD and NMF for a data set the size of UL within acceptable times (i.e., within a day or less), is in it self a challenge and requires specialized software, either utilizing GPUs or distributed computation (or both). As we have not had immediate access to any implementations capable hereof, the SVD and NMF decompositions of UL remain as future work. Hence, for clickthrough rate prediction on the full data set, we demonstrate only the benefit of using the IRM cluster features over not using any dimensionality reduction.
3.2 Supervised learning
For testing the various dimensionality reductions, we construct several training and testing data sets from RTB logs with observations labeled as click (1) or nonclick (0). The features we use are summarized in table 1.
Ref  Feature(s)  Description 

(BannerId, Url)  A oneofK encoding of the crossfeatures between and , which indicates where a request has been made. This serves as a baseline predictor in all of our experiments.  
UrlsVisited  A vector representation (zeros and ones) of URLs that a specific user has visited in the past.  
UserCluster  A oneofK encoding of which IRM cluster a specific user belongs to.  
UrlCluster  A oneofK encoding of which IRM cluster a specific URL belongs to.  
UserSVDLoading  The continousvalued dimension left singular vector of a specific user from the SVD.  
UrlSVDLoading  The continousvalued dimension right singular vector of a specific URL from the SVD.  
UserNMFLoading  The continousvalued cluster assignment vector of a specific user according to the NMF decomposition.  
UrlNMFLoading  The continousvalued cluster assignment vector of a specific URL according to the NMF decomposition. 
Based on the full set of users and URLs as well as the subsampled sets, detailed in Section 3.1, we prepare training and testing data sets based on the features of Table 1 for our logistic regression classifier. We denote the full dataset SL and the sampled SL. The data are represented as matrices, i.e., with columns being features and rows being observations.
3.2.1 Method details
From the predictors of Table 1, we train a number of logistic regression classifiers, using penalization for sparsity, see also Section 2.2. For the stopping criteria, we run until the change of the objective value between iterations falls below 1e6. As the classes (clicks vs. nonclicks) are highly unbalanced, we also learn an unpenalized intercept term. In order not to introduce any advantages (or disadvantages) to some predictors over others, we do not normalize the input features for any of the predictors in any way. Rather, we first select one regularization strength, , for the baseline predictor only, , and fix that through all other trials. In each experiment, we then use other predictors  in addition to and select another regularization strength, , jointly regularizing those predictors, but with still fixed for . We compare to using regularized by in addition to and henceforth refer to this model as NODR, short for no dimensionality reduction.
For each trained model, we measure the performance in terms of the negative Bernoulli loglikelihood (LL), which measures the mismatch between the observations and the predictions of the model, i.e., the lower, the better. The likelihoods we report are normalized with respect to the baseline likelihood of the clickthrough rate evaluated on the test set, such that in order to outperform the baseline, they should fall between 0 and 1.
Feature  nnz  sparsity  

44086  143120  1  2.3e5  
42910  8824491  1  1.4e3  
216  143120  1  4.6e3  
175  143120  1  5.7e3  
,  100 / 300 / 500  0  
100 / 300 / 500  4745568 / 9780078 / 13993847  0.67 / 0.77 / 0.80  
100 / 300 / 500  4174552 / 14363612 / 23712222  0.71 / 0.67 / 0.67 
3.3 Results on Sl
Model (  ,  ,  )  Time (s)  nnz  nnz  LL  % Lift  
(  0.8 ,   ,   )  9  3612    93.83  0.00  
NODR 

, (  0.8 ,  10.6 ,   )  91  3943  760  88.15  6.05  
IRM 
, (  0.8 ,   ,  6.0e4 )  13  3653    90.19  3.88  
, (  0.8 ,   ,  7.0e4 )  16  3674    89.84  4.25  
(  0.8 ,  15.4 ,  7.0e4 )  76  3861  366  87.78  6.45  
SVD 
, (  0.8 ,   ,  0.1 )  19  3479    89.87  4.22  
, (  0.8 ,   ,  0.3 )  29  3502    89.73  4.37  
, (  0.8 ,   ,  0.3 )  56  3552    89.73  4.37  
, , (  0.8 ,   ,  7.0e4 )  649  3409    89.15  4.99  
, , (  0.8 ,   ,  7.0e4 )  2487  3702    88.92  5.23  
, , (  0.8 ,   ,  1.2e3 )  4082  4027    89.55  4.56  
(  0.8 ,  10.8 ,  7.0e4 )  3291  4063  484  87.90  6.32  
NMF 
, (  0.8 ,   ,  6.0e3 )  30  3453    89.38  4.74  
, (  0.8 ,   ,  3.0e3 )  40  3467    89.15  4.99  
, (  0.8 ,   ,  2.0e3 )  45  3521    88.68  5.49  
, , (  0.8 ,   ,  5.0e3 )  151  3389    89.05  5.09  
, , (  0.8 ,   ,  6.0e3 )  392  3468    87.89  6.33  
, , (  0.8 ,   ,  4.0e3 )  740  3635    93.59  0.26  
(  0.8 ,  11.2 ,  6.0e3 )  641  3973  680  86.91  7.38 
For the sampled data the number of observations are as follows: =138,847 and =4,273. In order to give the reader an idea about the dimensionalities of the features as well as their sparsity, in Table 2 we summarize some numbers on the predictors on the sampled data set. For features ,, and , the number of nonzeros (nnz) and sparsities are somewhat trivial, since these are categorical features represented as oneofK binary vectors. For the SVD features, and , we see that the feature vectors become completely dense. For the NMF features, however, we can confirm the methods’ ability to produce sparse components, i.e., only between 2033% of the components turn up as nonzeros, yet they are far from the sparsities of the IRM cluster features, and .
Model (  ,  ,  )  Time (s)  nnz  nnz  LL  % Lift  

(  0.7 ,   ,   )  34  14152    91.76  0.00  
NODR 

, (  0.7 ,  10.2 ,   )  195  15673  3010  88.71  3.32  
IRM 
, (  0.7 ,   ,  1.2e3 )  51  13604    89.35  2.63  
(  0.7 ,  10.2 ,  1.2e3 )  293  16018  2939  88.19  3.89 
In Table 3, we report the normalized likelihoods, lifts and testset optimal regularization strengths and , with varying features used for training. The lifts are all relative to model . The penalization strength is selected as the one maximizing the performance of the classifier using only , and is kept fixed for all the other classifiers. Note, that generalization of the penalization terms is an issue we do not currently address. The time reported in the table are the seconds it takes to train the logistic regression classifier. nnz and nnz are the respective number of nonzero weights of the resulting classifier for all the features and the feature only
In order to be able to further elaborate on the pros and cons of using the various dimensionality reduction techniques as features in the logistic regression classifier, we carry out another set of experiments for the models highlighted (bold and marked ) in Table 3. We fix the values of and to the values from , and , respectively, and append as an additional feature with each model and then tune the regularization strength . The results are shown in the rows of Table 3 with the symbols , and under “Model”.
The final experiment we run is with the full data set where we only evaluate the IRM based features and compare those to not using any dimensionality reduction. The number of observations for train and test are =5,460,229 and =188,867. The selection of regularization terms we do as in the previous experiments. The results are reported in Table 4.
4 Discussion
From Table 3 we first concentrate on the best models from each dimensionality reduction, i.e., the results highlighted in bold. Comparing the lifts, we see that the NMF300 features perform roughly one %point better than the SVD300 features, which then in turn perform roughly another %point better than the IRM cluster features. Comparing to the classifier using just and , i.e., no dimensionality reduction, we see that only the NMFbased classifier achieves slightly higher lift. Hence, using SVD or IRM based features as a replacement for the feature would result in worse predictions. Seeing the number of nonzero weights dropping from 3943 using to 3468 using both NMF300 features, indicates that the NMF offers a more economical representation which can replace while not sacrificing performance. The performance gain of NMF300 we expect is achieved by the implicit data grouping effects of NMF, i.e., recommender effects.
In terms of training speed, we see that while the IRM based features fare worst in terms of lift, the fact that each mode is a categorical value represented in a oneofK binary vector makes the input matrix very sparse, which speeds up the training of our classifier significantly and the model trains at least an order of magnitude faster than the other dimensionality reduction techniques and even significantly faster than training the NODR model. Hence, if fast training is a priority, either no dimensionality reduction should be used or the IRM based features can be used, but at the cost of slightly lower lift.
We now turn to the results for the models , and in Table 3. Here we investigate how the learning of weights for the highcardinality feature is affected when combined with each of the optimal settings from the reduced dimension experiments. Again, observing the lifts, the NMF300 based features combined with obtains the highest lift. However, the IRM based features now outperform the SVD ones and using either of the techniques in combination with , we are able to obtain higher lifts than using only .
For the training speed, we again see that the training using IRM features is by far the fastest amongst SVD and NMF and it is still faster than using only. What is more interesting, is the resulting number of nonzero weights, both in total and in the feature alone. Of all the different dimensionality reductions as well as NODR, using the IRM based representation requires the fewest nonzero weights at its optimal settings. Additionally, recalling from Section 2.2, that predictions can be made computationally very efficient, when the input features are binary indicator vectors, the IRM becomes all the more tractable. By combining the IRM based features with the explicit predictors and , our classifier is able to improve the lift over not using dimensionality reduction while reducing the need for fetching many weights for predictions and with only a small reduction in lift, compared to the more computationally expensive classifiers based on NMF and SVD.
Finally, in Table 4 we have run experiments using just the IRM based predictors with the full data set. The results confirm our findings from Table 3 and at the same time demonstrates both the feasibility of processing very large bipartite graphs using IRM as well as the application of the user and URL clusters as predictors of clickthrough rates.
5 Conclusion
We have presented results that demonstrate the use of three bimodal dimensionality reduction techniques, SVD, NMF, and IRM, and their applications as predictors in a clickthrough rate data set. We show that the compact representation based on the NMF is, in terms of predictive performance, the best option. For applications where fast predictions are required, however, we show that the binary representation from the IRM model is a viable alternative. The IRM based predictors yield the fastest training speed in the supervised learning stage, produces the most sparse model and offers the fastest computations at runtime, while incurring only a limited loss of lift relative to the NMF. In applications such as realtime bidding, where fast database I/O and few computations are key to success, we recommend using IRM based features as predictors.
References
References
 [1] C Apte, E Bibelnieks, and R Natarajan. Segmentationbased modeling for advanced targeted marketing. Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, 2001.
 [2] G Golub and W Kahan. Calculating the singular values and pseudoinverse of a matrix. Journal of the Society for Industrial & Applied Mathematics, 1965.
 [3] D.D. Lee and H.S. Seung. Learning the parts of objects by nonnegative matrix factorization. Nature, 401(6755):788–791, 1999.
 [4] C. Kemp, J.B. Tenenbaum, T.L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. In Artificial Intelligence, Proceedings of the National AAAI Conference on, 2006.
 [5] D Hillard, E Manavoglu, H Raghavan, C Leggetter, E CantúPaz, and R Iyer. The sum of its parts: reducing sparsity in click estimation with query segments. Information Retrieval, 14(3):315–336, February 2011.
 [6] TJ Hansen, M Morup, and LK Hansen. Nonparametric coclustering of large scale sparse bipartite networks on the GPU. Machine Learning for Signal Processing (MLSP), 2011 IEEE International Workshop on, 2011.
 [7] KS Dave and V Varma. Learning the clickthrough rate for rare/new ads from similar ads. Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 897–898, 2010.
 [8] M Richardson, E Dominowska, and R Ragno. Predicting clicks: estimating the clickthrough rate for new ads. Proceedings of the 16th international conference on World Wide Web, pages 521–530, 2007.
 [9] AK Menon, KP Chitrapura, S Garg, D Agarwal, and N Kota. Response prediction using collaborative filtering with hierarchies and sideinformation. Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining  KDD ’11, page 141, 2011.
 [10] Y Chen and M Kapralov. Factor modeling for advertisement targeting. Advances in Neural Information Processing Systems, 2009.
 [11] MW Berry and M Browne. Email surveillance using nonnegative matrix factorization. Computational & Mathematical Organization Theory, 1(11.3):249–264, 2005.
 [12] W Xu, X Liu, and Y Gong. Document clustering based on nonnegative matrix factorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, volume pages, page 273. ACM, 2003.
 [13] BO Wahlgreen and LK Hansen. Large scale topic modeling made practical. In Machine Learning for Signal Processing (MLSP), 2011 IEEE International Workshop on, volume 1, pages 1–6. IEEE, September 2011.
 [14] QW Dong, XL Wang, and L Lin. Application of latent semantic analysis to protein remote homology detection. Bioinformatics, 22(3):285–290, 2006.
 [15] Zhao Xu, Volker Tresp, Kai Yu, and HansPeter Kriegel. Learning infinite hidden relational models. Uncertainity in Artificial Intelligence (UAI2006), 2006.
 [16] K. Nowicki and T.A.B. Snijders. Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association, 96(455):1077–1087, 2001.
 [17] R.M. Neal. Markov chain sampling methods for dirichlet process mixture models. Computational and Graphical Statistics, Journal of, 9:249–265, 2000.
 [18] S. Yu K. Yu H.P. Kriegel Z. Xu, V.Tresp. Fast inference in inifinite hidden relational models. Mining and Learning with Graphs (MLG’07), 2007.
 [19] CM Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, 2007.
 [20] G Andrew and J Gao. Scalable training of L 1regularized loglinear models. Proceedings of the 24th international conference on Machine learning, pages 33–40, 2007.

[21]
Y Tsuruoka, J Tsujii, and S Ananiadou.
Stochastic gradient descent training for L1regularized loglinear
models with cumulative penalty.
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP
, 1:477, 2009.  [22] Y Li and A Ngom. The nonnegative matrix factorization toolbox for biological data mining. Source code for biology and medicine, 8(1):10, January 2013.