I Introduction
Consider the setting where we are given a partially observed dataset of discrete samples , and we are interested in predicting the missing entries. This scenario often arises in recommender systems where we are interested in predicting user preferences, pertaining to news, movies or music, based on a user’s history as well as the history of other users.
Among the various approaches used in recommender systems, data completion using matrix and (more recently) tensor factorization is one of the most pervasive. The premise of factorizationbased recommendation approaches is that the data approximately follow a lowrank model, i.e., there are few basic types of customers (and movies, songs, or news items), and every customer (movie, song, story) is a linear combination of the respective types. Thus a lowrank model is appropriate for the data, and can be used for completion.
In this paper, we propose a fundamentally different approach. From a statistical inference point of view, having access to the joint distribution of all variables of interest is the ‘gold standard’. Given the joint Probability Mass Function (PMF) of a set of discrete random variables, it is possible to compute any marginal or conditional probability for subsets of these variables, and use it to solve regression or classification problems. For example, one may be interested in finding the Maximum A Posteriori (MAP) estimate of an unobserved entry, or its conditional expectation given a number of observed variables. In practice however, it is often not possible to learn a joint PMF of all random variables without making restrictive assumptions, due to computational or statistical reasons; the number of free parameters grows exponentially in the number of variables. In addition, when the dataset is incomplete an imputation mechanism is needed.
In this work, we propose modeling a joint PMF of a set of random variables using a lowrank nonnegative tensor (multiway array) factorization model. In effect, we propose using a lowrank model of the joint PMF, as opposed to using a lowrank model of the raw data. Tensor factorization techniques are widely used in numerous diverse fields such as signal processing, computer vision, chemistry and more recently in machine learning and data mining
[1]. Canonical Polyadic Decomposition (CPD) also known as PARAllel FACtor analysis (PARAFAC) [2, 3] and the Tucker decomposition [4] are the two most widely used factorization models. In this work, we focus on the CPD model which is known to be unique under mild conditions.Any joint PMF of size can be regarded as a nonnegative CPD model of nonnegative rank – this is easy to see, using the same argument as for realvalued CPD in [1], and the trivial factorization . Symmetric CPD has been considered for the related (but different) problem of modeling cooccurrence data [5, 6]. The most relevant prior work is [7] (and [8]), where CPD (resp. Tucker) was used to model the joint PMF of multivariate categorical data, assuming access to the full joint PMF.
We do not assume access to the full joint PMF. The reason is that, when dealing with many random variables (large ), the probability of encountering any particular realization decays very fast (usually exponentially) as a function of . That makes joint PMF estimation, even from complete samples, essentially intractable. One needs very long data records (very high ) for reliable estimates of the joint PMF values. In this paper, we show how we can utilize information regarding lowerorder marginals of subsets of the random variables to provably infer the full joint PMF. Estimates of lowerorder marginals are easier to compute even in the case of missing data, and a lowrank CPD model has the advantage of reducing the dimension of the parameter space, which becomes linear in the number of variables. We formulate the problem as a coupled tensor factorization problem and derive an algorithmic approach to solve it. We illustrate the method using synthetic data and give a motivating example using real data for rating prediction.
Ia Notation
denotes a vector,
denotes a matrix and denotes a tensor. The outer product of vectors is a way tensor with elements . The KhatriRao (columnwise Kronecker) product of two matrices and is . The Hadamard (elementwise) product of commensurate matrices is denoted . is the vector obtained by vertically stacking the columns of matrix . denotes the diagonal matrix with the elements of vector on its diagonal.Ii Tensor Decomposition Preliminaries
way tensor admits a CPD of rank if it can be decomposed as a sum of rank1 tensors
(1) 
where and is the smallest number for which such a decomposition exists (Fig. 1). We then write , with elements
(2) 
We denote the mode matrix unfolding of as the matrix of size . We have that , where
(3) 
The mode matrix unfolding can be expressed as
(4) 
where
(5) 
We can also express a tensor in a vectorized form. , where
(6) 
The vectorized form of a tensor can be expressed as
(7) 
Iii NonNegative Tensor Factorization and Latent Variable Models
It has been shown that the joint PMF of a finite set of discrete random variables satisfying the naive Bayes hypothesis can be regarded as a nonnegative CPD model [5, 9]. More specifically, let denote discrete random variables that can take one of , , distinct values respectively. We define a way tensor that models the joint PMF of the random variables i.e., . Suppose that the discrete random variables satisfy the naive Bayes hypothesis, that is, they are conditionally independent given a hidden variable that can take distinct values (Fig. 2). Then, the joint PMF can be decomposed as
(8) 
where is the prior distribution of the latent variable and , are conditional distributions. In fact, every joint PMF can be decomposed as in Equation (8) for large enough, as we explained in the introduction; the naive Bayes hypothesis is just an interpretation. Notice the similarity of equations (2) and (8). In the CPD model of the joint PMF, each column of the CPD factor matrices is a conditional PMF; and the vector
contains the prior probabilities of the latent variable
.For the case , the model described by Equation (8) is equivalent to Probabilistic Latent Semantic Indexing (PLSI) [10], a popular method for document clustering based on dyadic cooccurence data which is known to be closely related to NMF with KL divergence [11]. In the following, we focus on the general case where can be larger than two and we are interested in cases where the PMF can be approximated by a lowrank CPD model
(9) 
ideally for some .
Iv Problem Formulation
In practice it is not always possible to have “point” estimates of . When is large, a very large number of samples is needed in order to obtain a reliable empirical estimate of the PMF. Furthermore, much (most) of the data may be missing, as in collaborative filtering applications. An alternative approach is to extract estimates for lowerorder marginals of subsets of random variables, which can be viewed as linear measurements (sums over the remaining modes, i.e., lowerdimensional projections) of the complete tensor, and seek a joint PMF that is consistent with this information.
For brevity, we focus on the case where we have estimates of marginal distributions corresponding to every possible combination of triples of random variables, i.e., we are given estimates , , which we put in a tensor
(10) 
The method can be easily generalized to any type of lowerorder marginals. Under the assumption of a lowrank CPD model as in (9), every marginal distribution of three random variables can be decomposed as follows
(11) 
This is a direct consequence of the law of total probability. Marginalizing with respect to the
th random variable we have that(12) 
since . Therefore, in order to compute an estimate of the full joint PMF, we propose solving the following optimization problem
(13)  
subject to  
where , , . The optimization problem in (13) is an instance of coupled tensor factorization. Coupled tensor/matrix factorization has attracted a lot of interest lately, especially in data mining as a way of combining various datasets that share dimensions and corresponding matrix factors [12, 13]. Notice that in the case where we have estimates of pairwise marginals, the optimization problem in (13) corresponds to coupled matrix factorization.
An important question that arises is whether the parameters of the model in Equation (8) are identifiable from the lowerorder marginals. We know that this is not the case when only firstorder marginals are given, unless the random variables are independent. But what about the case where thirdorder marginals are given? The answer in this case is affirmative. In fact is is possible to derive combinatoriallymany identifiability results here, so we restrict ourselves to two illustrative ones. Uniqueness conditions for coupled CPD of thirdorder tensors with one common factor have been provided in [14], which showed that the coupling between several CPDs can in fact enhance identifiability relative to considering individual CPDs. The result in [14] can be applied to derive identifiability conditions in our context. Another possibility is outlined next. Consider the thirdorder marginals for random variables (RVs) 1, 2, and a third RV. Using the mode unfolding and stacking the marginal distributions
(14) 
where we have absorbed the scaling in . Let denote the compound matrix [1] of . If rank() and rank( then the rank of the tensor is and the decomposition is essentially unique – cf. Theorem 6 in [1]. If , , and , then the rank of the tensor is and the decomposition is unique almost surely – cf. Theorem 7 in [1]. We have simply scratched the surface here; there are many more possibilities for proving identifiability under further relaxed conditions. For example, further exploiting the coupling between the factors by also accounting for the marginals that depend on , regarding as the common factor of two CPDs.
V Alternating Optimization Based on ADMM.
Alternating Optimization (AO) is one of the most commonly used methods for computing a constrained CPD model. In order to solve the optimization problem in (13) we develop an AO algorithm in which we cyclically update variables and while fixing the remaining variables at their last updated values. Problem (13) is nonconvex but it becomes convex with respect to each variable if we fix the remaining ones. Assume that we fix estimates , . Then, the optimization problem with respect to becomes
(15)  
subject to  
Note that we have dropped the terms that do not depend on . The number of marginals that depend on the th variable is . By using the mode1 unfolding of each tensor , the problem can be equivalently written as
(16)  
subject to  
which is a leastsquares problem with respect to matrix under probability simplex constraints on its columns. Similar expressions can be derived for each factor due to symmetry. Finally, in order to update we solve the following optimization problem
(17)  
subject to  
We use the alternating direction method of multipliers (ADMM) for solving problems (16), (17). Let be the convex set that represents the probability simplex constraints on the factors and define
(18) 
which is the indicator function of set . We reformulate Problem (16) and write it as
(19)  
subject to 
It is easy to adopt the ADMM algorithm [15] and derive the following updates
(20)  
where
(21) 
(22) 
(23) 
The update of is called proximity operator of the function which is the indicator function of a convex set, and thus a projection operator. To project onto the probability simplex we use a simple complexity algorithm [16]. Note that in order to efficiently compute matrix we use a property of the KhatriRao product.
(24) 
Very efficient algorithms also exist for the computation of matrix which is a sum of Matricized Tensor Times KhatriRao Product (MTTKRP) terms [17, 18]. Similarly, we can derive updates for .
(25)  
In this case we need to compute matrices
(26) 
(27) 
(28) 
Matrix can be computed using the property of the KhatriRao product. Matrix can be efficiently computed without explicitly forming the KhatriRao products and also by exploiting sparsity in . We run the ADMM algorithm for each subproblem until the primal and dual residuals are below a certain threshold [15] or a maximum number of iterations has been reached.
Vi Numerical Results
In this section, we evaluate our method on both synthetic and real datasets.
Via Synthetic Dataset
We generate a lowrank fiveway tensor with factor matrices , , , and . The elements of each factor and the vector
are drawn from an i.i.d uniform distribution between zero and one and are normalized so that the tensor elements sum up to one. We simulate scenarios where we are given different noiseless marginal distributions of the PMF as input, that is we observe only projections of the original tensor. The different types of input are pairwise marginals, marginals of triples and quadruples of the random variables. We run 20 Monte Carlo simulations with randomly generated tensors and compute the mean relative error of the factors as well as the mean relative error of the recovered tensor which are defined as follows
(29) 
Rank  Rel. Fact. Error  Rel. Ten. Error  

Pairs  
Triples  
Quadruples  

Pairs  
Triples  
Quadruples  
Pairs  
Triples  
Quadruples 
(30) 
where is the th realization of the th factor, is the th tensor realization, is a permutation matrix to fix the inherent permutation ambiguity and , are the estimated tensor and the corresponding factors. We run the alternating optimization algorithm based on ADMM until the maximum number of iterations is met which was set to . Table I shows the mean relative factor and tensor error for the different types of input and different choices of rank. We observe that using triples or quadruples of random variables we are able to identify the true model parameters. On the other hand, using every combination of pairwise marginals we observe that we are not able to identify the true model parameters.
We repeated the above experiments using a slightly perturbed fiveway tensor. After generating the lowrank fiveway tensor, we add white Gaussian noise with standard deviation
. The lowrank tensor is then projected onto the probability simplex. Table II shows the mean relative factor and tensor error for the noisy data. Similarly to the noiseless case, we observe that using triples or quadruples of random variables we are able to achieve low relative tensor and factor errors.ViB Real Datasets
Next, we evaluate the performance of our method in real datasets. More specifically, we test our method in the task of rating prediction. We compute a lowrank CPD model of a joint PMF by using lowerorder marginals of pairs, triples and quadruples of variables that we estimate using a training set. Then, we use the estimated PMF in order to compute the expected value of users’ ratings that we do not observe given the ones we observe. As a baseline algorithm we use the Biased Matrix Factorization (BMF) method [19]. We test our method using two collaborative filtering datasets, Movielens and Jester.
MovieLens [20] is a collaborative filtering dataset that contains 5star movie ratings with 0.5 star increments. In order to test our algorithm we select three different subsets of the full dataset. Three different categories (action, animation and romance) are selected first. From each category we extract a small and relatively dense submatrix by keeping the 10 most rated movies.
Rank  Rel. Fact. Error  Rel. Ten. Error  

Pairs  
Triples  
Quadruples  

Pairs  
Triples  
Quadruples  
Pairs  
Triples  
Quadruples 
Jester [21] is a collaborative filtering dataset that contains continuous ratings ( to +) of 100 jokes. Again, we extract a dense submatrix corresponding to jokes that have been rated from almost all users. The resulting dataset is processed such that the ratings correspond to integers ( to ) by rounding each continuous rating.
For each dataset we randomly hide ratings that we use as a test set, ratings that we use as a validation set and the remaining dataset is used as a training set. We run Monte Carlo simulations using the alternating optimization ADMM and BMF algorithms until no improvement is observed in the cost function. At each iteration we calculate the Root Mean Squared Error (RMSE) based on the validation set and after the algorithm converges we return the model that reports the best RMSE on the validation set.
MovieLens Dataset 1  MovieLens Dataset 2  MovieLens Dataset 3  Jester Dataset  
Method  RMSE  MAE  RMSE  MAE  RMSE  MAE  RMSE  MAE 
CP (Pairs)  
CP (Triples)  
CP (Quadruples)  
Global Average  
User Average  
Item Average  
BMF 
Table III shows the performance of the two algorithms in terms of the RMSE and Mean Absolute Error (MAE). For our method, we tested different values for the rank parameter and report the model that performed the best on the test set. Similarly, for BMF we tested different values for both the rank and the regularization parameter and reported the model that performed the best on the test set. In addition, we present the RMSE and MAE obtained when we use as our prediction the global average of the ratings, the user average and the item average. We observe that the performance of our method is better than the baselines when we use information corresponding to marginals of triples and quadruples of the random variables.
Figure 3 shows the performance of our method for different values of rank. We observe that the behavior of the algorithm for the different datasets is similar. As rank increases the RMSE drops until it reaches a plateau.
Vii Conclusion
In this work, we proposed a method based on tensor decomposition for computing a parsimonious model of a joint PMF using lowerorder marginals. We formulated the problem as coupled tensor factorization and described an algorithmic approach to solve it. When the joint PMF admits a lowrank decomposition we showed that it is possible to recover the true factors using synthetic data. Finally, we showed some preliminary results in collaborative filtering datasets for rating prediction.
References
 [1] N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis, and C. Faloutsos, “Tensor decomposition for signal processing and machine learning,” IEEE Transactions on Signal Processing, overview paper, to appear in 2017; arXiv:1607.01668, July 2016.
 [2] J. D. Carroll and J.J. Chang, “Analysis of individual differences in multidimensional scaling via an nway generalization of EckartYoung decomposition,” Psychometrika, vol. 35, no. 3, pp. 283–319, 1970.
 [3] R. A. Harshman and M. E. Lundy, “PARAFAC: Parallel factor analysis,” Computational Statistics & Data Analysis, vol. 18, no. 1, pp. 39–72, Aug. 1994.
 [4] L. R. Tucker, “Some mathematical notes on threemode factor analysis,” Psychometrika, vol. 31, no. 3, pp. 279–311, 1966.
 [5] A. Shashua and T. Hazan, “Nonnegative tensor factorization with applications to statistics and computer vision,” in Proceedings of the 22nd international conference on Machine learning, 2005, pp. 792–799.
 [6] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky, “Tensor decompositions for learning latent variable models.” Journal of Machine Learning Research, vol. 15, no. 1, pp. 2773–2832, 2014.
 [7] D. B. Dunson and C. Xing, “Nonparametric Bayes modeling of multivariate categorical data,” Journal of the American Statistical Association, vol. 104, no. 487, pp. 1042–1051, 2009.
 [8] Y. Chi, S. Zhu, Y. Gong, and Y. Zhang, “Probabilistic polyadic factorization and its application to personalized recommendation,” in Proceedings of the 17th ACM Conference on Information and Knowledge Management, 2008, pp. 941–950.
 [9] L.H. Lim and P. Comon, “Nonnegative approximations of nonnegative tensors,” Journal of Chemometrics, vol. 23, no. 78, pp. 432–441, July 2009.
 [10] T. Hofmann, “Probabilistic latent semantic indexing,” in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp. 50–57.
 [11] E. Gaussier and C. Goutte, “Relation between PLSA and NMF and implications,” in Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2005, pp. 601–602.
 [12] A. Beutel, P. P. Talukdar, A. Kumar, C. Faloutsos, E. E. Papalexakis, and E. P. Xing, “FlexiFact: Scalable flexible factorization of coupled tensors on hadoop,” in Proceedings of the 2014 SIAM International Conference on Data Mining, 2014, pp. 109–117.
 [13] E. Acar, T. G. Kolda, and D. M. Dunlavy, “Allatonce optimization for coupled matrix and tensor factorizations,” in Proceedings of Mining and Learning with Graphs, Aug. 2011.
 [14] M. Sørensen and L. D. De Lathauwer, “Coupled canonical polyadic decompositions and (coupled) decompositions in multilinear rank terms—Part I: Uniqueness,” SIAM Journal on Matrix Analysis and Applications, vol. 36, no. 2, pp. 496–522, 2015.
 [15] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends in Machine Learning, vol. 3, no. 1, pp. 1–122, Jan. 2010.
 [16] W. Wang and M. A. CarreiraPerpinán, “Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application,” ArXiv preprint arXiv:1309.1541, 2013.
 [17] B. W. Bader and T. G. Kolda, “Efficient matlab computations with sparse and factored tensors,” SIAM Journal on Scientific Computing, vol. 30, no. 1, pp. 205–231, 2008.
 [18] S. Smith, N. Ravindran, N. D. Sidiropoulos, and G. Karypis, “SPLATT: Efficient and parallel sparse tensormatrix multiplication,” in 2015 IEEE International Parallel and Distributed Processing Symposium, May 2015, pp. 61–70.
 [19] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” Computer, vol. 42, no. 8, pp. 30–37, Aug. 2009.
 [20] F. M. Harper and J. A. Konstan, “The MovieLens datasets: History and context,” ACM Transactions on Interactive Intelligent Systems (TiiS), vol. 5, no. 4, pp. 1–19, Dec. 2016.
 [21] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins, “Eigentaste: A constant time collaborative filtering algorithm,” Information Retrieval, vol. 4, no. 2, pp. 133–151, July 2001.
Comments
There are no comments yet.