Distributed Flexible Nonlinear Tensor Factorization

04/27/2016 ∙ by Shandian Zhe, et al. ∙ University of Cambridge Purdue University 0

Tensor factorization is a powerful tool to analyse multi-way data. Compared with traditional multi-linear methods, nonlinear tensor factorization models are capable of capturing more complex relationships in the data. However, they are computationally expensive and may suffer severe learning bias in case of extreme data sparsity. To overcome these limitations, in this paper we propose a distributed, flexible nonlinear tensor factorization model. Our model can effectively avoid the expensive computations and structural restrictions of the Kronecker-product in existing TGP formulations, allowing an arbitrary subset of tensorial entries to be selected to contribute to the training. At the same time, we derive a tractable and tight variational evidence lower bound (ELBO) that enables highly decoupled, parallel computations and high-quality inference. Based on the new bound, we develop a distributed inference algorithm in the MapReduce framework, which is key-value-free and can fully exploit the memory cache mechanism in fast MapReduce systems such as SPARK. Experimental results fully demonstrate the advantages of our method over several state-of-the-art approaches, in terms of both predictive performance and computational efficiency. Moreover, our approach shows a promising potential in the application of Click-Through-Rate (CTR) prediction for online advertising.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


This project collects the different accepted papers and their link to Arxiv or Gitxiv

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Tensors, or multidimensional arrays, are generalizations of matrices (from binary interactions) to high-order interactions between multiple entities. For example, we can extract a three-mode tensor (user, advertisement, context) from online advertising data. To analyze tensor data, people usually turn to factorization approaches that use a set of latent factors to represent each entity and model how the latent factors interact with each other to generate tensor elements. Classical tensor factorization models include Tucker  [24] and CANDECOMP/PARAFAC (CP) [8] decompositions, which have been widely used in real-world applications. However, because they all assume a multi-linear interaction between the latent factors, they are unable to capture more complex, nonlinear relationships. Recently,  Xu et al. [25] proposed Infinite Tucker decomposition (InfTucker), which generalizes the Tucker model to infinite feature space using a Tensor-variate Gaussian process (TGP) and thus is powerful to model intricate nonlinear interactions. However, InfTucker and its variants [28, 29] are computationally expensive, because the Kronecker product between the covariances of all the modes requires the TGP to model the entire tensor structure. In addition, they may suffer from the extreme sparsity of real-world tensor data, i.e., when the proportion of the nonzero entries is extremely low. As is often the case, most of the zero elements in real tensors are meaningless: they simply indicate missing or unobserved entries. Incorporating all of them in the training process may affect the factorization quality and lead to biased predictions.

To address these issues, in this paper we propose a distributed, flexible nonlinear tensor factorization model, which has several important advantages. First, it can capture highly nonlinear interactions in the tensor, and is flexible enough to incorporate arbitrary subset of (meaningful) tensorial entries for the training. This is achieved by placing Gaussian process priors over tensor entries, where the input is constructed by concatenating the latent factors from each mode and the intricate relationships are captured by using the kernel function. By using such a construction, the covariance function is then free of the Kronecker-product structure, and as a result users can freely choose any subset of tensor elements for the training process and incorporate prior domain knowledge. For example, one can choose a combination of balanced zero and nonzero elements to overcome the learning bias. Second, the tight variational evidence lower bound (ELBO) we derived using functional derivatives and convex conjugates subsumes optimal variational posteriors, thus evades inefficient, sequential E-M updates and enables highly efficient, parallel computations as well as improved inference quality. Moreover, the new bound allows us to develop a distributed, gradient-based optimization algorithm. Finally, we develop a simple yet very efficient procedure to avoid the data shuffling operation, a major performance bottleneck in the (key-value) sorting procedure in MapReduce

. That is, rather than sending out key-value pairs, each mapper simply calculates and sends a global gradient vector without keys. This key-value-free procedure is general and can effectively prevent massive disk IOs and fully exploit the memory cache mechanism in fast

MapReduce systems, such as SPARK.

Evaluation using small real-world tensor data have fully demonstrated the superior prediction accuracy of our model in comparison with InfTucker and other state-of-the-art. On large tensors with millions of nonzero elements, our approach is significantly better than, or at least as good as two popular large-scale nonlinear factorization methods based on TGP: one uses hierarchical modeling to perform distributed infinite Tucker decomposition [28]; the other further enhances InfTucker by using Dirichlet process mixture prior over the latent factors and employs an online learning scheme [29]. Our method also outperforms GigaTensor [11], a typical large-scale CP factorization algorithm, by a large margin. In addition, our method achieves faster training speed and enjoys almost linear scalability on the number of computational nodes. We apply our model to CTR prediction for online advertising and achieves a significant,

improvement over the popular logistic regression and linear SVM approaches.

2 Background

We first introduce the background knowledge. For convenience, we will use the same notations in [25]. Specifically, we denote a -mode tensor by , where the -th mode is of dimension . The tensor entry at location () is denoted by . To introduce Tucker decomposition, we need to generalize matrix-matrix products to tensor-matrix products. Specifically, a tensor can multiply with a matrix at mode when its dimension at mode- is consistent with the number of columns in , i.e., . The product is a new tensor, with size . Each element is calculated by

The Tucker decomposition model uses a latent factor matrix in each mode and a core tensor and assumes the whole tensor is generated by . Note that this is a multilinear function of and . It can be further simplified by restricting and the off-diagonal elements of to be . In this case, the Tucker model becomes CANDECOMP/PARAFAC (CP).

The infinite Tucker decomposition (InfTucker) generalizes the Tucker model to infinite feature space via a tensor-variate Gaussian process (TGP) [25]. Specifically, in a probabilistic framework, we assign a standard normal prior over each element of the core tensor , and then marginalize out

to obtain the probability of the tensor given the latent factors:


where is the vectorized whole tensor, and is the Kronecker-product. Next, we apply the kernel trick to model nonlinear interactions between the latent factors: Each row of the latent factors is replaced by a nonlinear feature transformation and thus an equivalent nonlinear covariance matrix is used to replace , where is the covariance function. After the nonlinear feature mapping, the original Tucker decomposition is performed in an (unknown) infinite feature space. Further, since the covariance of is a function of the latent factors , Equation (1) actually defines a Gaussian process (GP) on tensors, namely tensor-variate GP (TGP) [25], where the input are based on . Finally, we can use different noisy models to sample the observed tensor . For example, we can use Gaussian models and Probit models for continuous and binary observations, respectively.

3 Model

Despite being able to capture nonlinear interactions, InfTucker may suffer from the extreme sparsity issue in real-world tensor data sets. The reason is that its full covariance is a Kronecker-product between the covariances over all the modes— (see Equation (1)). Each is of size and the full covariance is of size . Thus TGP is projected onto the entire tensor with respect to the latent factors , including all zero and nonzero elements, rather than a (meaningful) subset of them. However, the real-world tensor data are usually extremely sparse, with a huge number of zero entries and a tiny portion of nonzero entries. On one hand, because most zero entries are meaningless—they are either missing or unobserved, using them can adversely affect the tensor factorization quality and lead to biased predictions; on the other hand, incorporating numerous zero entries into GP models will result in large covariance matrices and high computational costs. Although Zhe et al. [28, 29]

proposed to improve the scalability by modeling subtensors instead, the sampled subtensors can still be very sparse. Even worse, because subtensors are typically restricted to a small dimension due to the efficiency considerations, it is often possible to encounter one that does not contain any nonzero entry. This may further incur numerical instabilities in model estimation.

To address these issues, we propose a flexible Gaussian process tensor factorization model. While inheriting the nonlinear modeling power, our model disposes of the Kronecker-product structure in the full covariance and can therefore select an arbitrary subset of tensor entries for training.

Specifically, given a tensor , for each tensor entry (), we construct an input by concatenating the corresponding latent factors from all the modes: , where is the -th row in the latent factor matrix for mode . We assume that there is an underlying function such that . This function is unknown and can be complex and nonlinear. To learn the function, we assign a Gaussian process prior over : for any set of tensor entries , the function values

are distributed according to a multivariate Gaussian distribution with mean

and covariance determined by :

where is a (nonlinear) covariance function.

Because , there is no Kronecker-product structure constraint and so any subset of tensor entries can be selected for training. To prevent the learning process to be biased toward zero, we can use a set of entries with balanced zeros and nonzeros. Furthermore, useful domain knowledge can also be incorporated to select meaningful entries for training. Note, however, that if we still use all the tensor entries and intensionally impose the Kronecker-product structure in the full covariance, our model is reduced to InfTucker. Therefore, from the modeling perspective, the proposed model is more general.

We further assign a standard normal prior over the latent factors . Given the selected tensor entries , the observed entries are sampled from a noise model . In this paper, we deal with both continuous and binary observations. For continuous data, we use the Gaussian model, and the joint probability is


where . For binary data, we use the Probit model in the following manner. We first introduce augmented variables and then decompose the Probit model into and where is the indicator function. Then the joint probability is


4 Distributed Variational Inference

Real-world tensor data often comprise a large number of entries, say, millions of non-zeros and billions of zeros. Even by only using nonzero entries for training, exact inference of the proposed model may still be intractable. This motivates us to develop a distributed variational inference algorithm, presented as follows.

4.1 Tractable Variational Evidence Lower Bound

Since the GP covariance term — (see Equations (2) and (3)) intertwines all the latent factors, exact inference in parallel is difficult. Therefore, we first derive a tractable variational evidence lower bound (ELBO), following the sparse Gaussian process framework by Titsias [23]. The key idea is to introduce a small set of inducing points and latent targets (). Then we augment the original model with a joint multivariate Gaussian distribution of the latent tensor entries and targets ,

where , , and . We use Jensen’s inequality and conditional Gaussian distributions to construct the ELBO. Using a very similar derivation to [23], we can obtain a tractable ELBO for our model on continuous data, , where


Here , is the variational posterior for the latent targets and , where and . Note that is decomposed into a summation of terms involving individual tensor entries . The additive form enables us to distribute the computation across multiple computers.

For binary data, we introduce a variational posterior and make the mean-field assumption that . Following a similar derivation to the continuous case, we can obtain a tractable ELBO for binary data, , where


One can simply use the standard Expectation-maximization (EM) framework to optimize (

4) and (5) for model inference, i.e., the E step updates the variational posteriors and the M step updates the latent factors , the inducing points and the kernel parameters. However, the sequential E-M updates can not fully exploit the paralleling computing resources. Due to the strong dependencies between the E step and the M step, the sequential E-M updates may take a large number of iterations to converge. Things become worse for binary case: in the E step, the updates of and are also dependent on each other, making a parallel inference even less efficient.

4.2 Tight and Parallelizable Variational Evidence Lower Bound

In this section, we further derive tight(er) ELBOs that subsume the optimal variational posteriors for and . Thereby we can avoid the sequential E-M updates to perform decoupled, highly efficient parallel inference. Moreover, the inference quality is very likely to be improved using tighter bounds. Due to the space limit, we only present key ideas and results here. Detailed discussions are given in Section 1 of the supplementary material.

Tight ELBO for continuous tensors. We take functional derivative of with respect to in (4). By setting the derivative to zero, we obtain the optimal (which is a Gaussian distribution) and then substitute it into , manipulating the terms, we achieve the following tighter ELBO.

Theorem 4.1.

For continuous data, we have


where is Frobenius norm, and

Tight ELBO for binary tensors. The binary case is more difficult because and are coupled together (see (5)). We use the following steps: we first fix and plug the optimal in the same way as the continuous case. Then we obtain an intermediate ELBO that only contains . However, a quadratic term in , , intertwines all in , making it infeasible to analytically derive or parallelly compute the optimal . To overcome this difficulty, we exploit the convex conjugate of the quadratic term to introduce an extra variational parameter to decouple the dependences between . After that, we are able to derive the optimal using functional derivatives and to obtain the following tight ELBO.

Theorem 4.2.

For binary data, we have



is the cumulative distribution function of the standard Gaussian.

As we can see, due to the additive forms of the terms in and , such as , , and , the computation of the tight ELBOs and their gradients can be efficiently performed in parallel. The derivation of the full gradient is given in Section 2 of the supplementary material.

4.3 Distributed Inference on Tight Bound

4.3.1 Distributed Gradient-based Optimization

Given the tighter ELBOs in (6) and (7), we develop a distributed algorithm to optimize the latent factors , the inducing points , the variational parameters (for binary data) and the kernel parameters. We distribute the computations over multiple computational nodes (Map step) and then collect the results to calculate the ELBO and its gradient (Reduce step). A standard routine, such as gradient descent and L-BFGS, is then used to solve the optimization problem.

For binary data, we further find that can be updated with a simple fixed point iteration:


where .

Apparently, the updating can be efficiently performed in parallel (due to the additive structure of and ). Moreover, the convergence is guaranteed by the following lemma. The proof is given in Section 3 of the supplementary material.

Lemma 4.3.

Given and , we have and the fixed point iteration (8) always converges.

In our experience, the fixed-point iterations are much more efficient than general search strategies (such as line-search) to identity an appropriate step length along the gradient direction. To use it, before we calculate the gradients with respect to and , we first optimize using the fixed point iteration (in an inner loop). In the outer control, we then employ gradient descent or L-BFGS to optimize and . This will lead to an even tighter bound for our model: . Empirically, this converges must faster than feeding the optimization algorithms with , and altogether.

4.3.2 Key-Value-Free MapReduce

In this section we present the detailed design of MapReduce procedures to fulfill our distributed inference. Basically, we first allocate a set of tensor entries on each Mapper such that the corresponding components of the ELBO and the gradients are calculated. Then the Reducer aggregates local results from each Mapper to obtain the integrated, global ELBO and gradient.

We first consider the standard (key-value) design. For brevity, we take the gradient computation for the latent factors as an example. For each tensor entry on a Mapper, we calculate the corresponding gradients and then send out the key-value pairs , where the key indicates the mode and the index of the latent factors. The Reducer aggregates gradients with the same key to recover the full gradient with respect to each latent factor.

Although the (key-value) MapReduce has been successfully applied in numerous applications, it relies on an expensive data shuffling operation: the Reduce step has to sort the Mappers’ output by the keys before aggregation. Since the sorting is usually performed on disk due to significant data size, intensive disk I/Os and network communications will become serious computational overheads. To overcome this deficiency, we devise a key-value-free Map-Reduce scheme to avoid on-disk data shuffling operations. Specifically, on each Mapper, a complete gradient vector is maintained for all the parameters, including , and the kernel parameters. However, only relevant components of the gradient, as specified by the tensor entries allocated to this Mapper, will be updated. After updates, each Mapper will then send out the full gradient vector, and the Reducer will simply sum them up together to obtain a global gradient vector without having to perform any extra data sorting. Note that a similar procedure can also be used to perform the fixed point iteration for (in binary tensors).

Efficient MapReduce systems, such as SPARK [27], can fully optimize the non-shuffling Map and Reduce, where most of the data are buffered in memory and disk I/Os are circumvented to the utmost; by contrast, the performance with data shuffling degrades severely [6]. This is verified in our evaluations: on a small tensor of size , our key-value-free MapReduce gains times speed acceleration over the traditional key-value process. Therefore, our algorithm can fully exploit the memory-cache mechanism to achieve fast inference.

4.4 Algorithm Complexity

Suppose we use tensor entries for training, with inducing points and Mapper, the time complexity for each Mapper node is . Since is a fixed constant ( in our experiments), the time complexity is linear in the number of tensor entries. The space complexity for each Mapper node is

, in order to store the latent factors, their gradients, the covariance matrix on inducing points, and the indices of the latent factors for each tensor entry. Again, the space complexity is linear in the number of tensor entries. In comparison, InfTucker utilizes the Kronecker-product properties to calculate the gradients and has to perform eigenvalue decomposition of the covariance matrices in each tensor mode. Therefor it has a higher time and space complexity (see  

[25] for details) and is not scalable to large dimensions.

5 Related work

Classical tensor factorization models include Tucker [24] and CP [8], based on which many excellent works have been proposed [19, 5, 22, 1, 9, 26, 18, 21, 10, 17]. To deal with big data, several distributed factorization algorithms have been recently developed, such as GigaTensor [11] and DFacTo [4]. Despite the widespread success of these methods, their underlying multilinear factorization structure may limit their capability to capture more complex, nonlinear relationship in real-world applications. Infinite Tucker decomposition [25], and its distributed or online extensions [28, 29] address this issue by modeling tensors or subtensors via a tensor-variate Gaussian process (TGP). However, these methods may suffer from the extreme sparsity in real-world tensor data, because the Kronecker-product structure in the covariance of TGP requires modeling the entire tensor space no matter the elements are meaningful (non-zeros) or not. By contrast, our flexible GP factorization model eliminates the Kronecker-product restriction and can model an arbitrary subset of tensor entries. In theory, all such nonlinear factorization models belong to the random function prior models [14] for exchangeable multidimensional arrays.

Our distributed variational inference algorithm is based on sparse GP [16], an efficient approximation framework to scale up GP models. Sparse GP uses a small set of inducing points to break the dependency between random function values. Recently, Titsias [23] proposed a variational learning framework for sparse GP, based on which Gal et al. [7] derived a tight variational lower bound for distributed inference of GP regression and GPLVM [13]. The derivation of the tight ELBO in our model for continuous tensors is similar to [7]. However, the gradient calculation is substantially different, because the input to our GP factorization model is the concatenation of the latent factors. Many tensor entries may partly share the same latent factors, causing a large amount of key-value pair to be sent during the distributed gradient calculation. This will incur an expensive data shuffling procedure that takes place on disk. To improve the computational efficiency, we develop a non-key-value Map-Reduce to avoid data shuffling and fully exploit the memory-cache mechanism in efficient MapReduce systems. This strategy is also applicable to other Map-Reduce based learning algorithms. In addition to continuous data, we also develop a tight ELBO for binary data on optimal variational posteriors. By introducing extra variational parameters with convex conjugates ( is the number of inducing points), our inference can be performed efficiently in a distributed manner, which avoids explicit optimization on a large number of variational posteriors for the latent tensor entries and inducing targets. Our method can also be useful for GP classification problem.

6 Experiments

6.1 Evaluation on Small Tensor Data

For evaluation, we first compared our method with various existing tensor factorization methods. To this end, we used four small real datasets where all methods are computationally feasible: (1) Alog, a real-valued tensor of size , representing a three-way interaction (user, action, resource) in a file access log. It contains nonzero entries.(2) AdClick, a real-valued tensor of size , describing (user, publisher, advertisement) clicks for online advertising. It contains nonzero entries. (3) Enron, a binary tensor extracted from the Enron email dataset (www.cs.cmu.edu/~./enron/) depicting the three-way relationship (sender, receiver, time). It contains elements, of which are nonzero. (4) NellSmall, a binary tensor extracted from the NELL knowledge base (rtw.ml.cmu.edu/rtw/resources), of size . It depicts the knowledge predicates (entity, relationship, entity). The data set contains nonzero elements.

We compared with CP, nonnegative CP (NN-CP) [19], high order SVD (HOSVD) [12], Tucker, infinite Tucker (InfTucker) Xu et al. [25] and its extension (InfTuckerEx) which uses the Dirichlet process mixture (DPM) prior to model latent clusters and local TGP to perform scalable, online factorization [29]. Note that InfTucker and InfTuckerEx are nonlinear factorization approaches.

For testing, we used the same setting as in [29]. All the methods were evaluated via a 5-fold cross validation. The nonzero entries were randomly split into folds: folds were used for training and the remaining non-zero entries and zero entries were used for testing so that the number of non-zero entries is comparable to the number of zero entries. By doing this, zero and nonzero entries are treated equally important in testing, and so the evaluation will not be dominated by large portion of zeros. For InfTucker and InfTuckerEx, we carried out extra cross-validations to select the kernel form (e.g., RBF, ARD and Matern kernels) and the kernel parameters. For InfTuckerEx, we randomly sampled subtensors and tuned the learning rate following  [29]. For our model, the number of inducing points was set to , and we used a balanced training set generated as follows: in addition to nonzero entries, we randomly sampled the same number of zero entries and made sure that they would not overlap with the testing zero elements.

Our model used ARD kernel and the kernel parameters were estimated jointly with the latent factors. Thus, the expensive parameter selection procedure was not needed. We implemented our distributed inference algorithm with two optimization frameworks, gradient descent and L-BFGS (denoted by Ours-GD and Ours-LBFGS respectively). For a comprehensive evaluation, we also examined CP on balanced training entries generated in the same way as our model, denoted by CP-2. The mean squared error (MSE) is used to evaluate predictive performance on Alog and Click and area-under-curve (AUC) on Enron and Nell. The averaged results from the -fold cross validation are reported.

Our model achieves a higher prediction accuracy than InfTucker, and a better or comparable accuracy than InfTuckerEx (see Figure 1). A -test shows that our model outperforms InfTucker significantly () in almost all situations. Although InfTuckerEx uses the DPM prior to improve factorization, our model still obtains significantly better predictions on Alog and AdClick and comparable or better performance on Enron and NellSmall. This might be attributed to the flexibility of our model in using balanced training entries to prevent the learning bias (toward numerous zeros). Similar improvements can be observed from CP to CP-2. Finally, our model outperforms all the remaining methods, demonstrating the advantage of our nonlinear factorization approach.

6.2 Scalability Analysis

To examine the scalability of the proposed distributed inference algorithm, we used the following large real-world datasets: (1) ACC, A real-valued tensor describing three-way interactions (user, action, resource) in a code repository management system [29]. The tensor is of size , where are nonzero. (2) DBLP: a binary tensor depicting a three-way bibliography relationship (author, conference, keyword) [29]. The tensor was extracted from DBLP database and contains elements, where are nonzero entries. (3) NELL: a binary tensor representing the knowledge predicates, in the form of (entity, entity, relationship) [28]. The tensor size is and are nonzero.

The scalability of our distributed inference algorithm was examined with regard to the number of machines on ACC dataset. The number of latent factors was set to 3. We ran our algorithm using the gradient descent. The results are shown in Figure 2(a). The Y-axis shows the reciprocal of the running time multiplied by a constant—which corresponds to the running speed. As we can see, the speed of our algorithm scales up linearly to the number of machines.

(a) Alog
(b) AdClick
(c) Enron
(d) NellSmall
Figure 1: The prediction results on small datasets. The results are averaged over 5 runs.
(a) Scalability
(b) ACC
(c) DBLP
(d) NELL
Figure 2: Prediction accuracy (averaged on test datasets) on large tensor data and the scalability.

6.3 Evaluation on Large Tensor Data

We then compared our approach with three state-of-the-art large-scale tensor factorization methods: GigaTensor [11], Distributed infinite Tucker decomposition (DinTucker) [28], and InfTuckerEx [29]. Both GigaTensor and DinTucker are developed on Hadoop, while InfTuckerEx uses online inference. Our model was implemented on SPARK. We ran Gigatensor, DinTucker and our approach on a large YARN cluster and InfTuckerEx on a single computer.

We set the number of latent factors to for ACC and DBLP data set, and for NELL data set. Following the settings in [29, 28], we randomly chose of nonzero entries for training, and then sampled test data sets from the remaining entries. For ACC and DBLP, each test data set comprises nonzero elements and zero elements; for NELL, each test data set contains nonzero elements and zero elements. The running of GigaTensor was based on the default settings of the software package. For DinTucker and InfTuckerEx, we randomly sampled subtensors for distributed or online inference. The parameters, including the number and size of the subtensors and the learning rate, were selected in the same way as [29]. The kernel form and parameters were chosen by a cross-validation on the training tensor. For our model, we used the same setting as in the small data. We set Mappers for GigaTensor, DinTucker and our model.

Figure 2(b)-(d) shows the predictive performance of all the methods. We observe that our approach consistently outperforms GigaTensor and DinTucker on all the three datasets; our approach outperforms InfTuckerEx on ACC and DBLP and is slightly worse than InfTuckerEx on NELL. Note again that InfTuckerEx uses DPM prior to enhance the factorization while our model doesn’t; finally, all the nonlinear factorization methods outperform GigaTensor, a distributed CP factorization algorithm by a large margin, confirming the advantages of nonlinear factorizations on large data. In terms of speed, our algorithm is much faster than GigaTensor and DinTucker. For example, on DBLP dataset, the average per-iteration running time were 1.45, 15.4 and 20.5 minutes for our model, GigaTensor and DinTucker, respectively. This is not surprising, because (1) our model uses the data sparsity and can exclude numerous, meaningless zero elements from training; (2) our algorithm is based on SPARK, a more efficient MapReduce system than Hadoop; (3) our algorithm gets rid of data shuffling and can fully exploit the memory-cache mechanism of SPARK.

6.4 Application on Click-Through-Rate Prediction

In this section, we report the results of applying our nonlinear tensor factorization approach on Click-Through-Rate (CTR) prediction for online advertising.

We used the online ads click log from a major Internet company, from which we extracted a four mode tensor (user, advertisement, publisher, page-section). We used the first three days’s log on May 2015, trained our model on one day’s data and used it to predict the click behaviour on the next day. The sizes of the extracted tensors for the three days are , and respectively. These tensors are very sparse ( nonzeros on average). In other words, the observed clicks are very rare. However, we do not want our prediction completely bias toward zero (i.e., non-click); otherwise, ads ranking and recommendation will be infeasible. Thus we sampled non-clicks of the same quantity as the clicks for training and testing. Note that training CTR prediction models with comparable clicks and non-click samples is common in online advertising systems [2]. The number of training and testing entries used for the three days are , and respectively.

We compared with popular methods for CTR prediction, including logistic regression and linear SVM, where each tensor entry is represented by a set of binary features according to the indices of each mode in the entry.

The results are reported in Table 1, in terms of AUC. It shows that our model improves logistic regression and linear SVM by a large margin, on average and respectively. Therefore, although we have not incorporated side features, such as user profiles and advertisement attributes, our tentative experiments have shown a promising potential of our model on CTR prediction task.

Method 1-2 2-3 3-4
Logistic regression 0.7360 0.7337 0.7538
Linear SVM 0.7414 0.7332 0.7540
Our model 0.8925 0.8903 0.9054
Table 1: CTR prediction accuracy on the first three days of May 2015. ”1-2” means using May 1st’s data for training and May 2nd’s data for testing; similar are ”2-3” and ”3-4”.

7 Conclusion

In this paper, we have proposed a new nonlinear and flexible tensor factorization model. By disposing of the Kronecker-product covariance structure, the model can properly exploit the data sparsity and is flexible to incorporate any subset of meaningful tensor entries for training. Moreover, we have derived a tight ELBO for both continuous and binary problems, based on which we further developed an efficient distributed variational inference algorithm in MapReduce framework. In the future, we will consider applying asynchronous inference on the tight ELBO, such as [20], to further improve the scalability of our model.


  • Acar et al. [2011] Acar, E., Dunlavy, D. M., Kolda, T. G., & Morup, M. (2011). Scalable tensor factorizations for incomplete data. Chemometrics and Intelligent Laboratory Systems, 106(1), 41–56.
  • Agarwal et al. [2014] Agarwal, D., Long, B., Traupman, J., Xin, D., & Zhang, L. (2014). Laser: A scalable response prediction platform for online advertising. In Proceedings of the 7th ACM international conference on Web search and data mining, (pp. 173–182). ACM.
  • Bishop [2006] Bishop, C. M. (2006). Pattern recognition and machine learning. springer.
  • Choi & Vishwanathan [2014] Choi, J. H. & Vishwanathan, S. (2014). Dfacto: Distributed factorization of tensors. In Advances in Neural Information Processing Systems, (pp. 1296–1304).
  • Chu & Ghahramani [2009] Chu, W. & Ghahramani, Z. (2009). Probabilistic models for incomplete multi-dimensional arrays. AISTATS.
  • Davidson & Or [2013] Davidson, A. & Or, A. (2013). Optimizing shuffle performance in spark. University of California, Berkeley-Department of Electrical Engineering and Computer Sciences, Tech. Rep.
  • Gal et al. [2014] Gal, Y., van der Wilk, M., & Rasmussen, C. (2014). Distributed variational inference in sparse gaussian process regression and latent variable models. In Advances in Neural Information Processing Systems, (pp. 3257–3265).
  • Harshman [1970] Harshman, R. A. (1970). Foundations of the PARAFAC procedure: Model and conditions for an”explanatory”multi-mode factor analysis. UCLA Working Papers in Phonetics, 16, 1–84.
  • Hoff [2011] Hoff, P. (2011). Hierarchical multilinear models for multiway data. Computational Statistics & Data Analysis, 55, 530–543.
  • Hu et al. [2015] Hu, C., Rai, P., & Carin, L. (2015). Zero-truncated poisson tensor factorization for massive binary tensors. In UAI.
  • Kang et al. [2012] Kang, U., Papalexakis, E., Harpale, A., & Faloutsos, C. (2012). Gigatensor: scaling tensor analysis up by 100 times-algorithms and discoveries. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, (pp. 316–324). ACM.
  • Lathauwer et al. [2000] Lathauwer, L. D., Moor, B. D., & Vandewalle, J. (2000).

    A multilinear singular value decomposition.

    SIAM J. Matrix Anal. Appl, 21, 1253–1278.
  • Lawrence [2004] Lawrence, N. D. (2004).

    Gaussian process latent variable models for visualisation of high dimensional data.

    Advances in neural information processing systems, 16(3), 329–336.
  • Lloyd et al. [2012] Lloyd, J. R., Orbanz, P., Ghahramani, Z., & Roy, D. M. (2012). Random function priors for exchangeable arrays with applications to graphs and relational data. In Advances in Neural Information Processing Systems 24, (pp. 1007–1015).
  • Minka [2000] Minka, T. P. (2000). Old and new matrix algebra useful for statistics. See www. stat. cmu. edu/minka/papers/matrix. html.
  • Quiñonero-Candela & Rasmussen [2005] Quiñonero-Candela, J. & Rasmussen, C. E. (2005). A unifying view of sparse approximate gaussian process regression. The Journal of Machine Learning Research, 6, 1939–1959.
  • Rai et al. [2015] Rai, P., Hu, C., Harding, M., & Carin, L. (2015). Scalable probabilistic tensor factorization for binary and count data.
  • Rai et al. [2014] Rai, P., Wang, Y., Guo, S., Chen, G., Dunson, D., & Carin, L. (2014). Scalable Bayesian low-rank decomposition of incomplete multiway tensors. In Proceedings of the 31th International Conference on Machine Learning (ICML).
  • Shashua & Hazan [2005] Shashua, A. & Hazan, T. (2005).

    Non-negative tensor factorization with applications to statistics and computer vision.

    In Proceedings of the 22th International Conference on Machine Learning (ICML), (pp. 792–799).
  • Smyth et al. [2009] Smyth, P., Welling, M., & Asuncion, A. U. (2009). Asynchronous distributed learning of topic models. In Advances in Neural Information Processing Systems, (pp. 81–88).
  • Sun et al. [2015] Sun, W., Lu, J., Liu, H., & Cheng, G. (2015). Provable sparse tensor decomposition. arXiv preprint arXiv:1502.01425.
  • Sutskever et al. [2009] Sutskever, I., Tenenbaum, J. B., & Salakhutdinov, R. R. (2009). Modelling relational data using bayesian clustered tensor factorization. In Advances in neural information processing systems, (pp. 1821–1828).
  • Titsias [2009] Titsias, M. K. (2009). Variational learning of inducing variables in sparse gaussian processes. In

    International Conference on Artificial Intelligence and Statistics

    , (pp. 567–574).
  • Tucker [1966] Tucker, L. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika, 31, 279–311.
  • Xu et al. [2012] Xu, Z., Yan, F., & Qi, Y. (2012). Infinite Tucker decomposition: Nonparametric Bayesian models for multiway data analysis. In Proceedings of the 29th International Conference on Machine Learning (ICML).
  • Yang & Dunson [2013] Yang, Y. & Dunson, D. (2013). Bayesian conditional tensor factorizations for high-dimensional classification. Journal of the Royal Statistical Society B, revision submitted.
  • Zaharia et al. [2012] Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., & Stoica, I. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, (pp. 2–2). USENIX Association.
  • Zhe et al. [2013] Zhe, S., Qi, Y., Park, Y., Molloy, I., & Chari, S. (2013). Dintucker: Scaling up gaussian process models on multidimensional arrays with billions of elements. arXiv preprint arXiv:1311.2663.
  • Zhe et al. [2015] Zhe, S., Xu, Z., Chu, X., Qi, Y., & Park, Y. (2015). Scalable nonparametric multiway data analysis. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, (pp. 1125–1134).

Supplementary Material

In this extra material, we provide the details about the derivation of the tight variational evidence lower bound of our proposed GP factorization model (Section 1) as well as its gradient calculation (Section 2). Moreover, we give the convergence proof of the fixed point iteration used in our distributed inference algorithm for binary tensor (Section 3).

1 Tight Variational Evidence Lower Bound

The naive variational evidence lower bound (ELBO) derived from the sparse Gaussian process framework (see Section 4.1 of the main paper) is given by


for continuous tensor and


for binary tensor, where and . Our goal is to further obtain a tight ELBO that subsumes the optimal variational posterior (i.e., and ) so as to prevent the sequential E-M procedure for efficient parallel training and to improve the inference quality.

1.1 Continuous Tensor

First, let us consider the continuous data. Given and , we use functional derivatives [3] to calculate the optimal . The functional derivative of with respect to is given by


is a probability density function, we use Lagrange multipliers to impose the constraint and obtain the optimal

by solving

Though simple algebraic manipulations, we can obtain the optimal to be the following form


Now substituting in with , we obtain the tight ELBO presented in Theorem 4.1 of the main paper:


where is Frobenius norm, and

1.2 Binary Tensor

Next, let us look at the binary data. The case for binary tensors is more complex, because we have the additional variational posterior . Furthermore, and are coupled in the original ELBO (see (10)). To eliminate and , we use the following steps. We first fix , calculate the optimal and plug it into (this is similar to the continuous case) to obtain an intermediate bound,


where denotes the expectation under the variational posteriors. Note that has a similar form to in (11).

Now we consider to calculate the optimal for . To this end, we calculate the functional derivative of with respect to each :

where and .

Solving being with Lagrange multipliers, we find that the optimal is a truncated Gaussian,

This expression is unfortunately not analytical. Even if we can explicitly update each , the updating will depend on all the other variational posteriors , making distributed calculation very difficult. This arises from the quadratic term
in (12), which couples all .

To resolve this issue, we introduce an extra variational parameter to decouple the dependencies between using the following lemma.

Lemma 1.1.

For any symmetric positive definite matrix ,


The equality is achieved when .


Define the function and it is easy to see that is convex because . Then using the convex conjugate, we have and . Then by maximizing , we can obtain . Thus, . Since is a free parameter, we can use to replace and obtain the inequality (13). Further, we can verify that when the equality is achieved. ∎

We now apply the inequality on the term in (12). Note that the quadratic term regarding all now vanishes, and instead a linear term is introduced so that these annoying dependencies between are eliminated. We therefore obtain a more friendly intermediate ELBO,


The functional derivative with respect to is then given by

Now solving , we see that the optimal variational posterior has an analytical form:

Plugging each into (14), we finally obtain the tight ELBO as presented in Theorem 4.2 of the main paper: