Completing a joint PMF from projections: a low-rank coupled tensor factorization approach

by   Nikos Kargas, et al.

There has recently been considerable interest in completing a low-rank matrix or tensor given only a small fraction (or few linear combinations) of its entries. Related approaches have found considerable success in the area of recommender systems, under machine learning. From a statistical estimation point of view, the gold standard is to have access to the joint probability distribution of all pertinent random variables, from which any desired optimal estimator can be readily derived. In practice high-dimensional joint distributions are very hard to estimate, and only estimates of low-dimensional projections may be available. We show that it is possible to identify higher-order joint PMFs from lower-order marginalized PMFs using coupled low-rank tensor factorization. Our approach features guaranteed identifiability when the full joint PMF is of low-enough rank, and effective approximation otherwise. We provide an algorithmic approach to compute the sought factors, and illustrate the merits of our approach using rating prediction as an example.



There are no comments yet.


page 1

page 2

page 3

page 4


Recovery of Joint Probability Distribution from one-way marginals: Low rank Tensors and Random Projections

Joint probability mass function (PMF) estimation is a fundamental machin...

Tensor Ranks for the Pedestrian for Dimension Reduction and Disentangling Interactions

A tensor is a multi-way array that can represent, in addition to a data ...

Random Projections and Dimension Reduction

This paper, broadly speaking, covers the use of randomness in two main a...

Low-Rank Matrix Approximations with Flip-Flop Spectrum-Revealing QR Factorization

We present Flip-Flop Spectrum-Revealing QR (Flip-Flop SRQR) factorizatio...

Recovering Joint Probability of Discrete Random Variables from Pairwise Marginals

Learning the joint probability of random variables (RVs) lies at the hea...

Tensors, Learning, and 'Kolmogorov Extension' for Finite-alphabet Random Vectors

Estimating the joint probability mass function (PMF) of a set of random ...

Information-theoretic Feature Selection via Tensor Decomposition and Submodularity

Feature selection by maximizing high-order mutual information between th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Consider the setting where we are given a partially observed dataset of discrete samples , and we are interested in predicting the missing entries. This scenario often arises in recommender systems where we are interested in predicting user preferences, pertaining to news, movies or music, based on a user’s history as well as the history of other users.

Among the various approaches used in recommender systems, data completion using matrix and (more recently) tensor factorization is one of the most pervasive. The premise of factorization-based recommendation approaches is that the data approximately follow a low-rank model, i.e., there are few basic types of customers (and movies, songs, or news items), and every customer (movie, song, story) is a linear combination of the respective types. Thus a low-rank model is appropriate for the data, and can be used for completion.

In this paper, we propose a fundamentally different approach. From a statistical inference point of view, having access to the joint distribution of all variables of interest is the ‘gold standard’. Given the joint Probability Mass Function (PMF) of a set of discrete random variables, it is possible to compute any marginal or conditional probability for subsets of these variables, and use it to solve regression or classification problems. For example, one may be interested in finding the Maximum A Posteriori (MAP) estimate of an unobserved entry, or its conditional expectation given a number of observed variables. In practice however, it is often not possible to learn a joint PMF of all random variables without making restrictive assumptions, due to computational or statistical reasons; the number of free parameters grows exponentially in the number of variables. In addition, when the dataset is incomplete an imputation mechanism is needed.

In this work, we propose modeling a joint PMF of a set of random variables using a low-rank non-negative tensor (multi-way array) factorization model. In effect, we propose using a low-rank model of the joint PMF, as opposed to using a low-rank model of the raw data. Tensor factorization techniques are widely used in numerous diverse fields such as signal processing, computer vision, chemistry and more recently in machine learning and data mining 

[1]. Canonical Polyadic Decomposition (CPD) also known as PARAllel FACtor analysis (PARAFAC) [2, 3] and the Tucker decomposition [4] are the two most widely used factorization models. In this work, we focus on the CPD model which is known to be unique under mild conditions.

Any joint PMF of size can be regarded as a non-negative CPD model of non-negative rank – this is easy to see, using the same argument as for real-valued CPD in [1], and the trivial factorization . Symmetric CPD has been considered for the related (but different) problem of modeling co-occurrence data [5, 6]. The most relevant prior work is [7] (and [8]), where CPD (resp. Tucker) was used to model the joint PMF of multivariate categorical data, assuming access to the full joint PMF.

We do not assume access to the full joint PMF. The reason is that, when dealing with many random variables (large ), the probability of encountering any particular realization decays very fast (usually exponentially) as a function of . That makes joint PMF estimation, even from complete samples, essentially intractable. One needs very long data records (very high ) for reliable estimates of the joint PMF values. In this paper, we show how we can utilize information regarding lower-order marginals of subsets of the random variables to provably infer the full joint PMF. Estimates of lower-order marginals are easier to compute even in the case of missing data, and a low-rank CPD model has the advantage of reducing the dimension of the parameter space, which becomes linear in the number of variables. We formulate the problem as a coupled tensor factorization problem and derive an algorithmic approach to solve it. We illustrate the method using synthetic data and give a motivating example using real data for rating prediction.

I-a Notation

denotes a vector,

denotes a matrix and denotes a tensor. The outer product of vectors is a -way tensor with elements . The Khatri-Rao (columnwise Kronecker) product of two matrices and is . The Hadamard (elementwise) product of commensurate matrices is denoted . is the vector obtained by vertically stacking the columns of matrix . denotes the diagonal matrix with the elements of vector on its diagonal.

Ii Tensor Decomposition Preliminaries

-way tensor admits a CPD of rank if it can be decomposed as a sum of rank-1 tensors


where and is the smallest number for which such a decomposition exists (Fig. 1). We then write , with elements

Fig. 1: CPD model.

We denote the mode- matrix unfolding of as the matrix of size . We have that , where


The mode- matrix unfolding can be expressed as




We can also express a tensor in a vectorized form. , where


The vectorized form of a tensor can be expressed as

Fig. 2: Naive Bayes model.

Iii Non-Negative Tensor Factorization and Latent Variable Models

It has been shown that the joint PMF of a finite set of discrete random variables satisfying the naive Bayes hypothesis can be regarded as a non-negative CPD model [5, 9]. More specifically, let denote discrete random variables that can take one of , , distinct values respectively. We define a -way tensor that models the joint PMF of the random variables i.e., . Suppose that the discrete random variables satisfy the naive Bayes hypothesis, that is, they are conditionally independent given a hidden variable that can take distinct values (Fig. 2). Then, the joint PMF can be decomposed as


where is the prior distribution of the latent variable and , are conditional distributions. In fact, every joint PMF can be decomposed as in Equation (8) for large enough, as we explained in the introduction; the naive Bayes hypothesis is just an interpretation. Notice the similarity of equations (2) and (8). In the CPD model of the joint PMF, each column of the CPD factor matrices is a conditional PMF; and the vector

contains the prior probabilities of the latent variable


For the case , the model described by Equation (8) is equivalent to Probabilistic Latent Semantic Indexing (PLSI) [10], a popular method for document clustering based on dyadic co-occurence data which is known to be closely related to NMF with K-L divergence [11]. In the following, we focus on the general case where can be larger than two and we are interested in cases where the PMF can be approximated by a low-rank CPD model


ideally for some .

Iv Problem Formulation

In practice it is not always possible to have “point” estimates of . When is large, a very large number of samples is needed in order to obtain a reliable empirical estimate of the PMF. Furthermore, much (most) of the data may be missing, as in collaborative filtering applications. An alternative approach is to extract estimates for lower-order marginals of subsets of random variables, which can be viewed as linear measurements (sums over the remaining modes, i.e., lower-dimensional projections) of the complete tensor, and seek a joint PMF that is consistent with this information.

For brevity, we focus on the case where we have estimates of marginal distributions corresponding to every possible combination of triples of random variables, i.e., we are given estimates , , which we put in a tensor


The method can be easily generalized to any type of lower-order marginals. Under the assumption of a low-rank CPD model as in  (9), every marginal distribution of three random variables can be decomposed as follows


This is a direct consequence of the law of total probability. Marginalizing with respect to the

-th random variable we have that


since . Therefore, in order to compute an estimate of the full joint PMF, we propose solving the following optimization problem

subject to

where , , . The optimization problem in (13) is an instance of coupled tensor factorization. Coupled tensor/matrix factorization has attracted a lot of interest lately, especially in data mining as a way of combining various datasets that share dimensions and corresponding matrix factors [12, 13]. Notice that in the case where we have estimates of pair-wise marginals, the optimization problem in (13) corresponds to coupled matrix factorization.

An important question that arises is whether the parameters of the model in Equation (8) are identifiable from the lower-order marginals. We know that this is not the case when only first-order marginals are given, unless the random variables are independent. But what about the case where third-order marginals are given? The answer in this case is affirmative. In fact is is possible to derive combinatorially-many identifiability results here, so we restrict ourselves to two illustrative ones. Uniqueness conditions for coupled CPD of third-order tensors with one common factor have been provided in [14], which showed that the coupling between several CPDs can in fact enhance identifiability relative to considering individual CPDs. The result in [14] can be applied to derive identifiability conditions in our context. Another possibility is outlined next. Consider the third-order marginals for random variables (RVs) 1, 2, and a third RV. Using the mode- unfolding and stacking the marginal distributions


where we have absorbed the scaling in . Let denote the compound matrix [1] of . If rank() and rank( then the rank of the tensor is and the decomposition is essentially unique – cf. Theorem 6 in [1]. If , , and , then the rank of the tensor is and the decomposition is unique almost surely – cf. Theorem 7 in [1]. We have simply scratched the surface here; there are many more possibilities for proving identifiability under further relaxed conditions. For example, further exploiting the coupling between the factors by also accounting for the marginals that depend on , regarding as the common factor of two CPDs.

V Alternating Optimization Based on ADMM.

Alternating Optimization (AO) is one of the most commonly used methods for computing a constrained CPD model. In order to solve the optimization problem in (13) we develop an AO algorithm in which we cyclically update variables and while fixing the remaining variables at their last updated values. Problem (13) is non-convex but it becomes convex with respect to each variable if we fix the remaining ones. Assume that we fix estimates , . Then, the optimization problem with respect to becomes

subject to

Note that we have dropped the terms that do not depend on . The number of marginals that depend on the -th variable is . By using the mode-1 unfolding of each tensor , the problem can be equivalently written as

subject to

which is a least-squares problem with respect to matrix under probability simplex constraints on its columns. Similar expressions can be derived for each factor due to symmetry. Finally, in order to update we solve the following optimization problem

subject to

We use the alternating direction method of multipliers (ADMM) for solving problems (16), (17). Let be the convex set that represents the probability simplex constraints on the factors and define


which is the indicator function of set . We reformulate Problem (16) and write it as

subject to

It is easy to adopt the ADMM algorithm [15] and derive the following updates




The update of is called proximity operator of the function which is the indicator function of a convex set, and thus a projection operator. To project onto the probability simplex we use a simple complexity algorithm [16]. Note that in order to efficiently compute matrix we use a property of the Khatri-Rao product.


Very efficient algorithms also exist for the computation of matrix which is a sum of Matricized Tensor Times Khatri-Rao Product (MTTKRP) terms [17, 18]. Similarly, we can derive updates for .


In this case we need to compute matrices


Matrix can be computed using the property of the Khatri-Rao product. Matrix can be efficiently computed without explicitly forming the Khatri-Rao products and also by exploiting sparsity in . We run the ADMM algorithm for each subproblem until the primal and dual residuals are below a certain threshold [15] or a maximum number of iterations has been reached.

Vi Numerical Results

In this section, we evaluate our method on both synthetic and real datasets.

Vi-a Synthetic Dataset

We generate a low-rank five-way tensor with factor matrices , , , and . The elements of each factor and the vector

are drawn from an i.i.d uniform distribution between zero and one and are normalized so that the tensor elements sum up to one. We simulate scenarios where we are given different noiseless marginal distributions of the PMF as input, that is we observe only projections of the original tensor. The different types of input are pair-wise marginals, marginals of triples and quadruples of the random variables. We run 20 Monte Carlo simulations with randomly generated tensors and compute the mean relative error of the factors as well as the mean relative error of the recovered tensor which are defined as follows

Rank Rel. Fact. Error Rel. Ten. Error

TABLE I: Relative factor and tensor error for different choices of rank in noiseless data.

where is the -th realization of the -th factor, is the -th tensor realization, is a permutation matrix to fix the inherent permutation ambiguity and , are the estimated tensor and the corresponding factors. We run the alternating optimization algorithm based on ADMM until the maximum number of iterations is met which was set to . Table I shows the mean relative factor and tensor error for the different types of input and different choices of rank. We observe that using triples or quadruples of random variables we are able to identify the true model parameters. On the other hand, using every combination of pair-wise marginals we observe that we are not able to identify the true model parameters.

We repeated the above experiments using a slightly perturbed five-way tensor. After generating the low-rank five-way tensor, we add white Gaussian noise with standard deviation

. The low-rank tensor is then projected onto the probability simplex. Table II shows the mean relative factor and tensor error for the noisy data. Similarly to the noiseless case, we observe that using triples or quadruples of random variables we are able to achieve low relative tensor and factor errors.

Vi-B Real Datasets

Next, we evaluate the performance of our method in real datasets. More specifically, we test our method in the task of rating prediction. We compute a low-rank CPD model of a joint PMF by using lower-order marginals of pairs, triples and quadruples of variables that we estimate using a training set. Then, we use the estimated PMF in order to compute the expected value of users’ ratings that we do not observe given the ones we observe. As a baseline algorithm we use the Biased Matrix Factorization (BMF) method [19]. We test our method using two collaborative filtering datasets, Movielens and Jester.

MovieLens [20] is a collaborative filtering dataset that contains 5-star movie ratings with 0.5 star increments. In order to test our algorithm we select three different subsets of the full dataset. Three different categories (action, animation and romance) are selected first. From each category we extract a small and relatively dense submatrix by keeping the 10 most rated movies.

Rank Rel. Fact. Error Rel. Ten. Error

TABLE II: Relative factor and tensor error for different choices of rank in noisy data with .

Jester [21] is a collaborative filtering dataset that contains continuous ratings (- to +) of 100 jokes. Again, we extract a dense submatrix corresponding to jokes that have been rated from almost all users. The resulting dataset is processed such that the ratings correspond to integers ( to ) by rounding each continuous rating.

For each dataset we randomly hide ratings that we use as a test set, ratings that we use as a validation set and the remaining dataset is used as a training set. We run Monte Carlo simulations using the alternating optimization ADMM and BMF algorithms until no improvement is observed in the cost function. At each iteration we calculate the Root Mean Squared Error (RMSE) based on the validation set and after the algorithm converges we return the model that reports the best RMSE on the validation set.

MovieLens Dataset 1 MovieLens Dataset 2 MovieLens Dataset 3 Jester Dataset
CP (Pairs)
CP (Triples)
CP (Quadruples)
Global Average
User Average
Item Average
TABLE III: RMSE and MAE of different algorithms on MovieLens (Ratings are in the range [0.5-5] ) and Jester (Ratings are in the range [1-21]) datasets.
(a) MovieLens dataset 1.
(b) MovieLens dataset 2.
(c) MovieLens dataset 3.
(d) Jester dataset.
Fig. 3: RMSE as a function of rank.

Table III shows the performance of the two algorithms in terms of the RMSE and Mean Absolute Error (MAE). For our method, we tested different values for the rank parameter and report the model that performed the best on the test set. Similarly, for BMF we tested different values for both the rank and the regularization parameter and reported the model that performed the best on the test set. In addition, we present the RMSE and MAE obtained when we use as our prediction the global average of the ratings, the user average and the item average. We observe that the performance of our method is better than the baselines when we use information corresponding to marginals of triples and quadruples of the random variables.

Figure 3 shows the performance of our method for different values of rank. We observe that the behavior of the algorithm for the different datasets is similar. As rank increases the RMSE drops until it reaches a plateau.

Vii Conclusion

In this work, we proposed a method based on tensor decomposition for computing a parsimonious model of a joint PMF using lower-order marginals. We formulated the problem as coupled tensor factorization and described an algorithmic approach to solve it. When the joint PMF admits a low-rank decomposition we showed that it is possible to recover the true factors using synthetic data. Finally, we showed some preliminary results in collaborative filtering datasets for rating prediction.


  • [1] N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis, and C. Faloutsos, “Tensor decomposition for signal processing and machine learning,” IEEE Transactions on Signal Processing, overview paper, to appear in 2017; arXiv:1607.01668, July 2016.
  • [2] J. D. Carroll and J.-J. Chang, “Analysis of individual differences in multidimensional scaling via an n-way generalization of Eckart-Young decomposition,” Psychometrika, vol. 35, no. 3, pp. 283–319, 1970.
  • [3] R. A. Harshman and M. E. Lundy, “PARAFAC: Parallel factor analysis,” Computational Statistics & Data Analysis, vol. 18, no. 1, pp. 39–72, Aug. 1994.
  • [4] L. R. Tucker, “Some mathematical notes on three-mode factor analysis,” Psychometrika, vol. 31, no. 3, pp. 279–311, 1966.
  • [5] A. Shashua and T. Hazan, “Non-negative tensor factorization with applications to statistics and computer vision,” in Proceedings of the 22nd international conference on Machine learning, 2005, pp. 792–799.
  • [6] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky, “Tensor decompositions for learning latent variable models.” Journal of Machine Learning Research, vol. 15, no. 1, pp. 2773–2832, 2014.
  • [7] D. B. Dunson and C. Xing, “Nonparametric Bayes modeling of multivariate categorical data,” Journal of the American Statistical Association, vol. 104, no. 487, pp. 1042–1051, 2009.
  • [8] Y. Chi, S. Zhu, Y. Gong, and Y. Zhang, “Probabilistic polyadic factorization and its application to personalized recommendation,” in Proceedings of the 17th ACM Conference on Information and Knowledge Management, 2008, pp. 941–950.
  • [9] L.-H. Lim and P. Comon, “Nonnegative approximations of nonnegative tensors,” Journal of Chemometrics, vol. 23, no. 7-8, pp. 432–441, July 2009.
  • [10] T. Hofmann, “Probabilistic latent semantic indexing,” in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp. 50–57.
  • [11] E. Gaussier and C. Goutte, “Relation between PLSA and NMF and implications,” in Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2005, pp. 601–602.
  • [12] A. Beutel, P. P. Talukdar, A. Kumar, C. Faloutsos, E. E. Papalexakis, and E. P. Xing, “FlexiFact: Scalable flexible factorization of coupled tensors on hadoop,” in Proceedings of the 2014 SIAM International Conference on Data Mining, 2014, pp. 109–117.
  • [13] E. Acar, T. G. Kolda, and D. M. Dunlavy, “All-at-once optimization for coupled matrix and tensor factorizations,” in Proceedings of Mining and Learning with Graphs, Aug. 2011.
  • [14] M. Sørensen and L. D. De Lathauwer, “Coupled canonical polyadic decompositions and (coupled) decompositions in multilinear rank- terms—Part I: Uniqueness,” SIAM Journal on Matrix Analysis and Applications, vol. 36, no. 2, pp. 496–522, 2015.
  • [15] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends in Machine Learning, vol. 3, no. 1, pp. 1–122, Jan. 2010.
  • [16] W. Wang and M. A. Carreira-Perpinán, “Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application,” ArXiv preprint arXiv:1309.1541, 2013.
  • [17] B. W. Bader and T. G. Kolda, “Efficient matlab computations with sparse and factored tensors,” SIAM Journal on Scientific Computing, vol. 30, no. 1, pp. 205–231, 2008.
  • [18] S. Smith, N. Ravindran, N. D. Sidiropoulos, and G. Karypis, “SPLATT: Efficient and parallel sparse tensor-matrix multiplication,” in 2015 IEEE International Parallel and Distributed Processing Symposium, May 2015, pp. 61–70.
  • [19] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” Computer, vol. 42, no. 8, pp. 30–37, Aug. 2009.
  • [20] F. M. Harper and J. A. Konstan, “The MovieLens datasets: History and context,” ACM Transactions on Interactive Intelligent Systems (TiiS), vol. 5, no. 4, pp. 1–19, Dec. 2016.
  • [21] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins, “Eigentaste: A constant time collaborative filtering algorithm,” Information Retrieval, vol. 4, no. 2, pp. 133–151, July 2001.