Estimating a joint Probability Mass Function (PMF) of a set of random variables is of great interest in numerous applications in the fields of machine learning, data mining and signal processing. In many cases, we are given partial observations and/or statistics of the data, i.e., incomplete data, marginalized lower-dimensional distributions, or lower-order moments of the data, and our goal is to estimate the missing data. If the full joint PMF of all variables of interest were known, this would have been a straightforward task. A classical example is in recommender systems, where users rate only a small fraction of the total items (e.g., movies) and the objective is to make item recommendations to users according to predicted ratings. If the joint PMF of the item ratings is known, such recommendation is readily implementable based on the conditional expectation or mode of the unobserved ratings given the observed ratings. A closely related problem is top-recommendation, where the goal is to predict the items that a user is most likely to buy next. When the joint PMF of the items is known, it is easy to identify the items with the highest individual or joint (‘bundle’) conditional probability given the observed user ratings. Another example is data classification. If the joint PMF of the features and the label is known, then given a test sample it is easy to infer the label according to the Maximum a Posteriori
(MAP) principle. In fact, the joint PMF can be used to infer any of the features (or subsets of them), which is useful in imputing incomplete information in surveys or databases.
Despite its importance in signal and data analytics, estimating the joint PMF is often considered mission impossible in general, if no structure or relationship between the variables (e.g., a tree structure or a Markovian structure) can be assumed. This is true even when the problem size is merely moderate. The reason is that the number of unknown parameters is exponential in the number of variables. Consider a simple scenario of variables taking distinct values each. The number of parameters we need to estimate in this case is . The ‘naive’ approach for joint PMF estimation is counting the occurences of the joint variable realizations. In practice, however, when dealing with even moderately large sets of random variables, the probability of encountering any particular realization is very low. Therefore, only a small portion of the empirical distribution will be non-zero given a reasonable amount of data samples – this makes the approach very inaccurate.
In many applications, different workarounds have been proposed to circumvent this sample complexity problem. For example, in recommender systems, instead of trying to estimate the joint PMF of the ratings (which would be the estimation-theoretic gold standard), the most popular approach is based on low-rank matrix completion [2, 3, 4]
. The idea is that the users can be roughly clustered into several types, and users of the same type would rate different movies similarly. Consequently, the user-rating matrix is approximately low rank and this is used as prior information to infer the missing ratings. In classification, parsimonious function approximations are employed to model the relationship (or the conditional probability function) between the features and the label. Successful methods that fall into this category are support vector machines (linear function approximation), logistic regression (log-linear function approximation) and more recently kernels and neural networks (nonlinear function approximation).
The above mentioned methods are nice and elegant, have triggered a tremendous amount of theoretical research and practical applications, and have been successful in many ways. However, these workarounds have not yet answered our question of interest: Can we ever reliably estimate the joint PMF of variables given limited data? This question is very well-motivated in practice, since knowledge of the joint PMF is indeed the gold standard: it enables optimal estimation under a variety of well-established criteria, such as mean-square error and minimum probability of error or Bayes risk. Knowing the joint PMF can facilitate a large variety of applications including recommender systems and classification in a unified and statistically optimal way, instead of resorting to often ad-hoc modeling tools.
This paper shows, perhaps surprisingly, that if the joint PMF of any three variables can be estimated, then the joint PMF of all the variables can be provably recovered under relatively mild conditions. The result is reminiscent of Kolmogorov’s extension theorem – consistent specification of lower-dimensional distributions induces a unique probability measure for the entire process. The difference is that for processes of limited complexity (rank of the high-dimensional PMF) it is possible to obtain complete characterization from only three-dimensional distributions. In fact not all three-dimensional PMFs are needed; and under more stringent conditions even two-dimensional will do. The rank condition on the high-dimensional joint PMF has an interesting interpretation: loosely speaking, it means that the random variables are ‘reasonably (in)dependent’. This makes sense, because estimation problems involving fully independent or fully dependent regressors and unknowns are contrived – it is the middle ground that is interesting. It is also important to note that the marginal PMFs of triples can be reliably estimated at far smaller sample complexity than the joint PMF of all variables. For example, for user-movie ratings, the marginal PMF of three given variables (movies) can be estimated by counting the co-occurrences of the given ratings (values of the variables) of the three given movies; but no user can rate all movies.
Contributions Our specific contributions are as follows:
We propose a novel framework for joint PMF estimation given limited and possibly very incomplete data samples. Our method is based on a nice and delicate connection between the Canonical Polyadic Decomposition (CPD) [6, 7]
and the naive Bayes model. The CPD model, sometimes referred to as the Parallel Factor Analysis (PARAFAC) model, is a popular analytical tool from multiway linear algebra. The CPD model has been used to model and analyze tensor data (data with more than two indices) in signal processing and machine learning, and it has found many successful applications, such as speech separation, blind CDMA detection , array processing , spectrum sensing and unmixing in cognitive radio , topic modeling , and community detection  – see the recent overview paper in . Nevertheless, CPD has never been considered as a statistical learning tool for recovering a general joint PMF and our work is the first to establish the exciting connection 111There are works that considered using CPD to model a joint PMF for some specific problems . However, these works rely on specific physical interpretation of the associated model, which is sharply different to our setup – in which we employ the CPD model to explain a general joint PMF without assuming any physical model..
We present detailed identifiability analysis of the proposed approach. We first show that, any joint PMF can be represented by a naive Bayes model with a finite-alphabet latent variable – and the size of the latent alphabet (which happens to be the rank of the joint PMF tensor, as we will see) is bounded by a function of the alphabet sizes of the (possibly intermittently) observed variables. We further show that, if the latent alphabet size is under a certain threshold, then the joint PMF of an arbitrary number of random variables can be identified from three-dimensional marginal distributions. We prove this identifiability result by relating the joint PMF and marginal PMFs to the CPD model, which is known for its uniqueness even when the tensor rank is much larger than its outer dimensions.
In addition to the novel formulation and identifiability results, we also propose an easily implementable joint PMF recovery algorithm. Our identification criterion can be considered as a coupled simplex-constrained tensor factorization problem, and we propose a very efficient alternating optimization-based algorithm to handle it. To deal with the probability simplex constraints that arise for PMF estimation, the celebrated Alternating Direction Method of Multipliers (ADMM) algorithm is employed, resulting in lightweight iterations. Judiciously designed simulations and real experiments on movie recommendation and classification tasks are used to showcase the effectiveness of the approach.
Preliminary version of part of this work appeared at ITA 2017 . This journal version includes new and stronger identifiability theorems and interpretations, detailed analysis of the theorems, and insightful experiments on a number of real datasets.
Bold, lowercase and uppercase letters denote vectors and matrices respectively. Bold, underlined, uppercase letters denote -way () tensors. Uppercase (lowercase) letters denote scalar random variables (realizations thereof, respectively). The outer product of vectors is a -way tensor with elements . The Kronecker product of matrices and is denoted as . The Khatri-Rao (column-wise Kronecker) product of matrices and is denoted as . The Hadamard (element-wise) product of matrices and is denoted as . We define the vector obtained by vertically stacking the elements of a tensor into a vector. Additionally, denotes the diagonal matrix with the elements of vector on its diagonal. The set of integers is denoted as and denotes the cardinality of the set .
Ii Problem Statement
Consider a set of random variables, i.e., . Assume that each can take discrete values and only the joint PMFs of variable triples, i.e., ’s, are available. Can we identify the joint PMF of , i.e., , from the three-dimensional marginals? This question lies at the heart of statistical learning. To see this, consider a classification problem and let represent the set of observed features, and the sought label. If
is known, then given a specific realization of the features, one can easily compute the posterior probability
and predict the label according the MAP principle (here is shorthand for and likewise for . In recommender systems, given a set of observed item ratings one can compute the conditional expectation of an unobserved rating given the observed ones
At this point the reader may wonder why we consider recovery from three-dimensional joint PMFs and not from one- or two-dimensional PMFs. It is well-known that recovery from one-dimensional marginal PMFs is possible when all random variables are known to be independent. In this case, the joint PMF is equal to the product of the individual one-dimensional marginals. Interestingly, recovery from one-dimensional marginals is also possible when the random variables are known to be fully dependent i.e., one is completely determined by the other. In this case, the joint PMF can be recovered if each one-dimensional marginal is a unique permutation of the other.
However, complete (in)dependence is unrealistic in statistical estimation and learning practice. In general it is not possible to recover a joint PMF from one-dimensional marginals. An illustration for two variables is shown in Figure 2: can be represented as a matrix, and , are ‘projections’ of the matrix along the row and column directions using the projector and , respectively: and . In this case, if we denote the matrix such that , then if and are not independent. From basic linear algebra, one can see that knowing and is not enough for recovering in general – since this is equivalent to solving a very underdetermined system of linear equations with variables but only equations.
What if we know two-dimensional marginals? When the given random variables obey a probabilistic graphical model, and a genie reveals that model to us, then estimating a high-dimensional joint PMF from two-dimensional marginals may be possible. An example is shown in Figure 3. If we know a priori that random variables and are conditionally independent given , one can verify that knowledge of and is sufficient to recover . However, this kind of approach hinges on knowing the probabilistic graph structure. Unfortunately, genies are hard to come by in real life, and learning the graph structure from data is itself a very challenging problem in statistical learning .
In our problem setup, we do not assume any a priori knowledge of the graph structure, and in this sense we have a ‘blind’ joint PMF recovery problem. Interestingly, under certain conditions, this is no hindrance.
Our framework is heavily based on low-rank tensor factorization and its nice identifiability properties. To facilitate our later discussion, we briefly introduce pertinent aspects of tensors in this section.
Iii-a Rank Decomposition
An -way tensor is a data array whose elements are indexed by indices. A two-way tensor is a matrix, whose elements have two indices; i.e., denotes the -th element of the matrix . If a matrix has rank , it admits a rank decomposition where we have and denotes the outer product of two vectors, i.e., . Similarly, if an -way tensor has rank , it admits the following rank decomposition:
where and is the smallest number for which such a decomposition exists. For convenience, we use the notation to denote the decomposition. The above rank decomposition is also called the Canonical Polyadic Decomposition (CPD) or Parallel Factor Analysis (PARAFAC) model of a tensor. It is critical to note that every tensor admits a CPD, and that the rank is not necessarily smaller than – the latter is in sharp contrast to the matrix case .
In the matrix case, it is easy to see that . Similarly, for an -way tensor we have . Sometimes one wishes to restrict the columns of ’s to have unit norm (e.g., as in SVD). Therefore, the tensors can be represented as
where for a certain , , and with is employed to ‘absorb’ the norms of columns. An illustration of a three-way tensor and its CPD is shown in Figure 4. Under such cases, we denote the -way tensor as – again, in this expression, we have automatically assumed that , and a certain . We will refer to the decomposition of into nonnegative factors , as nonnegative decomposition.
The following definitions will prove useful in the rest of the paper. We define the mode- matrix unfolding of as the matrix of size . We have that , where
In terms of the CPD factors, the mode- matrix unfolding can be expressed as
We can also express a tensor in a vectorized form , where
In terms of the CPD factors, the vectorized form of a tensor can be expressed as
Iii-B Uniqueness of Rank Decomposition
A distinctive feature of tensors is that they have essentially unique CPD under mild conditions – even when is much larger than . To continue our discussion, let us first formally define what we mean by essential uniqueness of rank decomposition of tensors.
(Essential uniqueness) For a tensor of (nonnegative) rank , we say that a nonnegative decomposition , , is essentially unique if the factors are unique up to a common permutation. This means that if there exists another nonnegative decomposition , then, there exists a permutation matrix such that
In other words, if a tensor has an essentially unique nonnegative CPD, then the only ambiguity is column permutation of the column-normalized factors , which simply amounts to a permutation of the rank-one ‘chicken feet’ outer products (rank-one tensors) in Fig. 4, that is clearly unavoidable222Generally, there is also column scaling / counter-scaling ambiguity : a red column can be multiplied by and the corresponding yellow column divided by without any change in the outer product. There is no scaling ambiguity for nonnegative column-normalized representation , where there is obviously no sign ambiguity and all scaling is ‘absorbed’ in .. Regarding the essential uniqueness of tensors, let us consider the three-way case first. The following is arguably the most well-known uniqueness condition that was revealed by Kruskal in 1977.
 Let , where , , . If then and the decomposition of is essentially unique.
Here, denotes the Kruskal rank of the matrix which is equal to the largest integer such that every subset of columns are linearly independent. Lemma 1 implies the following generic result: The decomposition is essentially unique, almost surely, if
This is because with probability one if the elements of are generated following a certain absolutely continuous distribution. More relaxed and powerful uniqueness conditions have been proven in recent years.
 Let , where , , , . Let be the largest integers such that and . If then the decomposition of is essentially unique almost surely. The condition also implies that if , then has a unique decomposition almost surely.
There are many more different uniqueness conditions for CPD. The take-home point here is that the CPD model is essentially generically unique even if is much larger than – so long it is less than maximal possible rank. For example, in Lemma 3, can be as large as (but not equal to ), and the CPD model is still unique.
We should mention that the above identifiability results are derived for tensors under a noiseless setup333In this context, noise will typically come from insufficient sample averaging in empirical frequency estimation.. In addition, although the results are stated for real factor matrices, they are very general and also cover nonnegative ’s due to the fact that the nonnegative orthant has positive measure. It follows that if a tensor is generated using random nonnegative factor matrices then under the noiseless setup, a plain CPD can recover the true nonnegative factors. On the other hand, in practice, instead of considering exact tensor decomposition, often low-rank tensor approximation is of interest, because of limited sample size and other factors. The best low-rank tensor approximation might not even exist in this case; fortunately, adding structural constraints on the latent factors can mitigate this, see . In this work, our interest lies in revealing the fundamental limits of joint PMF estimation. Therefore, our analysis will be leveraging exact decomposition results, e.g., Lemmas 2-3. However, since the formulated problem naturally involves nonnegative latent factors, our computational framework utilizes this structural prior knowledge to enhance performance in practice.
Iv Naive Bayes Model: A Rank-decomposition Perspective
We will show that any joint PMF admits a naive Bayes model representation
, i.e., it can be generated from a latent variable model with just one hidden variable. The naive Bayes model postulates that there is a hidden discrete random variabletaking possible values, such that given the discrete random variables are conditionally independent. It follows that the joint PMF of can be decomposed as
where is the prior distribution of the latent variable and are the conditional distributions (Fig. 5). The naive Bayes model in (7) is also referred to as the latent class model  and is the simplest form of a Bayesian network . It has been employed in diverse applications such as classification , density estimation  and crowdsourcing , just to name a few.
An interesting observation is that the naive Bayes model can be interpreted as a special nonnegative polyadic decomposition. This was alluded to in [24, 25] but not exploited for identifying the joint PMF from lower-dimensional marginals, as we do. Consider the element-wise representation in (3) and compare it with (7): each column of the factor matrices can represent a conditional PMF and the vector
contains the prior probabilities of the latent variable, i.e.,
This is a special nonnegative polyadic decomposition model because it restricts . There is a subtle point however: the maximal rank in a CPD (canonical polyadic decomposition) model is bounded, but the number of latent states (latent alphabet size) for the naive Bayes model may exceed this bound. Even if the number of latent states is under the maximal rank bound, a naive Bayes model may be reducible, in the sense that there exists a naive Bayes model with fewer latent states that generates the same joint PMF. The net result is that every joint PMF admits a naive Bayes model interpretation with bounded , and every naive Bayes model is or can be reduced to a special CPD model. We have the following result.
The maximum needed to represent an arbitrary PMF as a naive Bayes model is bounded by the following inequality
Let denote a joint PMF of three random variables i.e., . We define the following matrices
where and have used MATLAB notation to denote the frontal slabs of the tensor . Additionally,
denotes the identity matrix of sizeand is a vector of all ’s of size . Then every frontal slab of the tensor can be synthesized as Upon normalizing the columns of matrix such that they sum to one and absorbing the scaling in , i.e., we can decompose the tensor as . The number of columns of each factor is . Due to role symmetry, by permuting the modes of the tensor it follows that we need at most columns for each factor for exact decomposition.
The result is easily generalized to a four-way tensor by noticing that each slab is a three-way tensor and thus can be decomposed as as before. We define
The four-way tensor can therefore be decomposed as . Due to symmetry, the number of columns of each factor is at most . By the same argument it follows that for a -way tensor the bound on the nonnegative rank is . ∎
The proof of Proposition 1
employs the same type of argument used to prove the upper bound on tensor rank. The main difference is in the normalization – latent nonnegativity follows from data nonnegativity “for free” since the latent factors used for constructing the CPD are either fibers drawn from the joint PMF itself, or from identity matrices or Kronecker products thereof. While the proof is fairly straightforward for someone versed in tensor analysis, the implication of this proposition to probability theory is significant: it asserts thatevery joint PMF can be represented by a naive Bayes model with a bounded number of latent states
. In fact, the connection between a naive Bayes model and CPD was utilized to approach some machine learning problems such as community detection and Gaussian Mixture Model (GMM) estimation in. However, in those cases, the hidden variable has a specific physical meaning (e.g., represents the th community in community detection) and thus connection was established using a specific data generative model. Here, we emphasize that even when there is no physically meaningful or presumed generative model, one can always represent an arbitrary joint PMF, possibly corresponding to a very complicated probabilistic graphical model, as a “simple” naive Bayes model with a bounded number of latent states . This result is very significant, also because it spells out that the latent structure of a probabilistic graphical model cannot be identified by simply assuming few hidden nodes; one has to limit the number of hidden node states as well.
We should remark that although any joint PMF admits a naive Bayes representation, this does not mean that such representation is unique. Clearly, needs to be strictly smaller than the upper bound in (9) to guarantee uniqueness (cf. Lemmas 1-3). Fortunately, many joint PMFs that we encounter in practice are relatively low-rank tensors, since random variables in the real world are only moderately dependent. This leads to an interesting connection between linear dependence/independence and statistical dependence/independence. To explain, let us consider the simplest case where . In this case, we have
The two-way model corresponds to Nonnegative Matrix Factorization (NMF) and is related to Probabilistic Latent Semantic Indexing (PLSI) , . For the two-way model, independence of the variables implies that the probability matrix is rank-. On the other hand, when the variables are fully dependent i.e., the value of one variable exactly determines the value of the other, the probability matrix is full-rank. However, low-rank does not necessarily mean that the variables are close to being independent as shown in Figure 6. There, a low rank probability matrix () can also model highly dependent random variables. In practice, we expect that random variables will be neither independent nor fully dependent and we are interested in cases where the rank of the joint PMF is lower (and ideally much lower) than the upper bound given in Proposition 1.
As a sanity check, we conducted preliminary experiments on some real-life data. As anticipated, we verified that many joint PMFs are indeed low-rank tensors in practice. Table I shows interesting results: The joint PMF of three movies over rating values was first estimated, using data from the MovieLens project. The joint PMF is then factored using a nonnegative CPD model with different rank values. One can see that with rank as low as , the modeling error in terms of the relative error is quite small, meaning that the low-rank modeling is fairly accurate. The same applies to two more datasets drawn from the UCI repository.
V Joint PMF Recovery
V-a General Procedures
The key observation that enables our approach is that the marginal distribution of any subset of random variables is also a nonnegative CPD model. This is a direct consequence of the law of total probability. Marginalizing with respect to the-th random variable we have that
Consider the model in (7) and assume that the marginal distributions denoted for brevity, are available and perfectly known. Then, there exists an exact decomposition of the form
The marginal distributions of triples of random variables satisfy , where and are defined as in (8) and they satisfy , , , , , , , . Based on the connection between the naive Bayes model of lower-dimensional marginals and the joint PMF, we propose the following steps to recover the complete joint PMF from three-dimensional marginals.
Procedure: Joint PMF Recovery From Triples [S1] Estimate from data; [S2] Jointly factor to estimate using a CPD model with rank ; [S3] Synthesize the joint PMF via , w/ , .
One can see from step [S2], that if the individual factorization of at least one is unique, then the joint PMF is readily identifiable via [S3]. This is already very interesting. However, as we will show in Sec. VI, we may identify the joint PMF even when the marginal tensors do not have unique CPD. The reason is that many marginal tensors share factors and we can exploit this to come up with much stronger identifiability results.
V-B Algorithm: Coupled Matrix/Tensor Factorization
Before we discuss theoretical results such as identifiability of the joint PMF using three or higher-dimensional marginals, we first propose an implementation of [S2] in the proposed procedure. For brevity, we assume we have estimates of three-dimensional marginal distributions, i.e., we are given empirical estimates , which we put in a tensor i.e., .
The method can be easily generalized to any type of low-dimensional marginal distributions. Under the assumption of a low-rank CPD model, every empirical marginal distribution of three random variables can be approximated as follows
Therefore, in order to compute an estimate of the full joint PMF, we propose solving the following optimization problem
The optimization problem in (14) is an instance of coupled tensor factorization. Coupled tensor/matrix factorization is usually used as a way of combining various datasets that share dimensions and corresponding factor matrices [28, 29]. Notice that in the case where we have estimates of two-dimensional marginals, the optimization problem in (14) corresponds to coupled matrix factorization. The optimization problem per se is very challenging and deserves developing sophisticated algorithms for handling it: first, when the number of random variables () gets large, there is a large number of optimization variables (i.e., ) to be determined in (14) – and each is an matrix where (the alphabet size of the -th random variable) can be large. In addition, the probability simplex constraints impose some extra computational burden. Nevertheless, we found that, by carefully re-arranging terms, the formulated problem can be recast in convenient form and handled in a way that is reminiscent of the classical alternating least squares algorithm with constraints.
The idea is that we cyclically update variables and while fixing the remaining variables at their last updated values. Assume that we fix estimates , . Then, the optimization problem with respect to becomes
Note that we have dropped the terms that do not depend on . By using the mode- matrix unfolding of each tensor , the problem can be equivalently written as
which is a least-squares problem with respect to matrix under probability simplex constraints on its columns. The optimization problem has the same form for each factor due to role symmetry. In order to update we solve the following optimization problem
Both Problems (16) and (17) are linearly constrained quadratic programs, and can be solved to optimality by many standard solvers. Here, we propose to employ the Alternating Direction Method of Multipliers (ADMM) to solve these two sub-problems because of its flexibility and effectiveness in handling large-scale tensor decomposition [30, 31]. Details of the ADMM algorithm for solving Problems (16)-(17) can be found in the Appendix B. The whole procedure is listed in Algorithm 1. As mentioned, the algorithm is easily modified to cover the cases where higher-dimensional marginals or pairwise marginals are given, and thus these cases are omitted.
Vi Joint PMF Identifiability Analysis
In this section, we study the conditions under which we can identify from marginalized lower-dimensional distributions. For brevity, we focus on three-dimensional as lower-dimensional distributions, and even though many more results are possible, we concentrate here on the case for ease of exposition and manuscript length considerations. Similar type of analysis applies when are different, however the analysis should be customized to properly address particular cases. Our aim here is to convey the spirit of what is possible in terms of identifiability results, as we cannot provide an exhaustive treatment (there are combinatorially many cases, clearly).
Obviously, if is individually identifiable for each combination of , then, , , , and are identifiable. This means that given three-dimensional marginal distributions, is generically identifiable if assuming that . This can be readily shown by invoking Lemma 1, equation (6), and the link between the naive Bayes model and tensor factorization discussed in Sec. IV. Note that is already not a bad condition, since in many cases we have approximately low-rank tensors in practice. However, since we have many factor-coupled ’s, this identifiability condition can be significantly improved. We have the following theorems.
Assume that are drawn from an absolutely continuous distribution, that , and that the joint PMF can be represented using a naive Bayes model of rank . If then, is almost surely (a.s) identifiable from the ’s if
If then, is a.s. identifiable from the ’s if
The proof is relegated to Appendix A. ∎
Assume that are drawn from an absolutely continuous distribution, that , and that the joint PMF can be represented using a naive Bayes model of rank . Let be the largest integer such that . Then, is a.s. identifiable from the ’s if
which is implied by
The proof is relegated to Appendix A. ∎
The rank bounds in Theorems 1-2 are nontrivial, albeit far from the maximal attainable rank for the cases considered. Recalling that higher-order tensors are identifiable for higher ranks, a natural question is whether knowledge of four- or higher-dimensional marginals can further enhance identifiability of the complete joint PMF. The next theorem shows that the answer is affirmative.
Assume that are drawn from an absolutely continuous distribution, that , and that the joint PMF can be represented using a naive Bayes model of rank . Further assume that can be partitioned into disjoint subsets denoted by such that the four-dimensional marginals are available. Then, the joint PMF is a.s. identifiable if