1 Introduction
Learning mixture models is a fundamental problem in statistics and machine learning having numerous applications such as density estimation and clustering. In this work, we consider the special case of mixture models whose component distributions factor into the product of the associated marginals. An example is a mixture of axisaligned Gaussian distributions, an important class of Gaussian Mixture Models (GMMs). Consider a scenario where different diagnostic tests are applied to patients, and test results are assumed to be independent conditioned on the binary disease status of the patient which is the latent variable. The joint Probability Density Function (PDF) of the tests can be expressed as a weighted sum of two components, and each component factors into the product of univariate marginals. Fitting a mixture model to an unlabeled dataset allows us to cluster the patients into two groups by determining the value of the latent variable using the Maximum a Posteriori (MAP) principle.
Most of the existing literature in this area has focused on the fullyparametric setting, where the mixture components are members of a parametric family, such as Gaussian distributions. The most popular algorithm for learning a parametric mixture model is Expectation Maximization (EM)
(Dempster et al., 1977). Recently, methods based on tensor decomposition and particularly the Canonical Polyadic Decomposition (CPD) have gained popularity as an alternative to EM for learning various latent variable models (Anandkumar et al., 2014). What makes the CPD a powerful tool for data analysis is its identifiability properties, as the CPD of a tensor is unique under relatively mild rank conditions (Sidiropoulos et al., 2017).In this work we propose a twostage approach based on tensor decomposition for recovering the conditional densities of mixtures of smooth product distributions. We show that when the unknown conditional densities are approximately bandlimited it is possible to uniquely identify and recover them from partially observed data. The key idea is to jointly factorize histogram estimates of lowerdimensional PDFs that can be easily and reliably estimated from observed samples. The conditional densities can then be recovered using an interpolation procedure. We formulate the problem as a coupled tensor factorization and propose an alternatingoptimization algorithm. We demonstrate the effectiveness of the approach on both synthetic and real data.
Notation: Bold, lowercase, , and uppercase letters,
, denote vectors and matrices respectively. Bold, underlined, uppercase letters,
, denote way () tensors. We use the notation , , to refer to specific elements of a vector, matrix and tensor respectively. We denote the vector obtained by vertically stacking the columns of the tensor into a vector by . Additionally, denotes the diagonal matrix with the elements of vector on its diagonal. The set of integers is denoted as . Uppercase, , and lowercase letters,, denote scalar random variables and realizations thereof, respectively.
2 Background
2.1 Canonical Polyadic Decomposition
In this section, we briefly introduce basic concepts related to tensor decomposition. An way tensor is a multidimensional array whose elements are indexed by indices. A polyadic decomposition expresses as a sum of rank terms
(1) 
where , , denotes the th column of matrix and denotes the outer product. If the number of rank terms is minimal, then Equation (1) is called the CPD of and is called the rank of . Without loss of generality, we can restrict the columns of to have unit norm and have the following equivalent expression
(2) 
where for a certain , , and ‘absorbs’ the norms of columns. For convenience, we use the shorthand notation . We can express the CPD of a tensor in a matricized form. With denoting the KhatriRao (columnwise Kronecker) matrix product, it can be shown that the mode matrix unfolding of is given by
(3) 
where The CPD can be expressed in a vectorized form as
(4) 
It is clear that the rank terms can be arbitrarily permuted without affecting the decomposition. We say that a CPD of a tensor is unique when it is only subject to this trivial indeterminacy.
2.2 Learning Problem
Let denote a set of random variables. We say that a PDF is a mixture of component distributions if it can be expressed as a weighted sum of multivariate distributions
(5) 
where are conditional PDFs and are nonnegative numbers such that , called mixing weights. When each conditional PDF factors into the product of its marginal densities we have that
(6) 
which can be seen as a continuous extension of the CPD model of Equation (2). A sample from the mixture model is generated by first drawing a component according to and then independently drawing samples for every variable from the conditional PDFs . The problem of learning the mixture is that of finding the conditional PDFs as well as the mixing weights given observed samples.
2.3 Related Work
Mixture models have numerous applications in statistics and machine learning including clustering and density estimation to name a few (McLachlan and Peel, 2000)
. A common assumption made in multivariate mixture models is a parametric form of the conditional PDFs. For example, when the conditional PDFs are assumed to be Gaussian, the goal is to recover the mean vectors and covariance matrices defining each multivariate Gaussian component and the mixing weights. Other common choices include categorical, exponential, Laplace or Poisson distributions. The most popular algorithm for learning the parameters of the mixture is the EM algorithm
(Dempster et al., 1977) which maximizes the likelihood function with respect to the parameters. EMbased methods have been also considered for learning mixture models of nonparametric distributions^{1}^{1}1The term nonparametric is used to describe the case in which no assumptions are made about the form of the conditional densities.by parameterizing the unknown conditional PDFs using kernel density estimators
(Benaglia et al., 2009; Levine et al., 2011), which lack however theoretical guarantees.Tensor decomposition methods can be used as an alternative to EM for learning various latent variable models (Anandkumar et al., 2014)
. Highorder moments of several probabilistic models can be expressed using lowrank CPDs. Decomposing these tensors reveals the true parameters of the probabilistic models. In the absence of noise and model mismatch, algebraic algorithms can be applied to compute the CPD under certain conditions, see
(Sidiropoulos et al., 2017) and references therein, and (Hsu and Kakade, 2013) for the application to GMMs. Tensor decomposition approaches have been proposed for learning mixture models but are mostly restricted to Gaussian or categorical distributions (Hsu and Kakade, 2013; Jain and Oh, 2014; Gottesman et al., 2018). In practice, mainly due to sampling noise the result of these algorithms may not be satisfactory and EM can be used for refinement (Zhang et al., 2014; Ruffini et al., 2017). In the case of nonparametric mixtures of product distributions, identifiability of the components has been established in (Allman et al., 2009). The authors have shown that it is possible to identify the conditional PDFs given the true joint PDF, if the conditional PDFs of each across different mixture components are linearly independent i.e., the continuous factor “matrices” have linearly independent columns. However, the exact true joint PDF is never given – only samples drawn from it are available in practice, and elements may be missing from any given sample. Furthermore, (Allman et al., 2009) did not provide an estimation procedure, which limits the practical appeal of an interesting theoretical contribution.In this work, we focus on mixtures of product distributions of continuous variables and do not specify a parametric form of the conditional density functions. We show that it is possible to recover mixtures of smooth product distributions from observed samples. The key idea is to first transform the problem to that of learning a mixture of categorical distributions by decomposing lowerdimensional and (possibly coarsely) discretized joint PDFs. Given that the conditional PDFs are (approximately) bandlimited (smooth), they can be recovered from the discretized PDFs under certain conditions.
3 Approach
Our approach consists of two stages. We express the problem as a tensor factorization problem and show that if , we can recover points of the unknown conditional CDFs. Under a smoothness condition, these points can be used to recover the true conditional PDFs using an interpolation procedure.
3.1 Problem Formulation
We assume that we are given dimensional samples that have been generated from a mixture of product distributions as in Equation (6). We discretize each random variable by partitioning its support into uniform intervals . Specifically, we consider a discretization of the PDF and define the probability tensor (histogram) given by
(7) 
Let , . Note that is an way tensor and admits a CPD with nonnegative factor matrices and rank , i.e., . From equation (7) it is clear that the discretized conditional PDFs are identifiable and can be recovered by decomposing the true joint discretized probability tensor, if and is small enough, by virtue of the uniqueness properties of CPD (Sidiropoulos et al., 2017).
In practice we do not observe but we have to deal with perturbed versions. Based on the observed samples, we can compute an approximation of the probability tensor by counting how many samples fall into each bin and normalizing the tensor by dividing with the total number of samples. The size of the probability tensor grows exponentially with the number of variables and therefore the estimate will be highly inaccurate even when the number of discretization intervals is small. More importantly, datasets often contain missing data and therefore its impossible to construct such tensor. On the other hand, it may be possible to estimate lowdimensional discretized joint PDFs of subsets of the random variables which correspond to loworder tensors. For example, in the clustering example given in the introduction some patients may be tested on a subset of the available tests. Finally, the model of Equation (7) is just an approximation of our original model, as our ultimate goal is to recover the true conditional PDFs. To address the aforementioned challenges we have to answer the following two questions

Is it possible to learn the mixing weights and discretized conditional PDFs from missing/limited data?

Is it possible to recover nonparametric conditional PDFs from their discretized counterparts?
Regarding the first question, it has been recently shown that a joint Probability Mass Function (PMF) of a set of random variables can be recovered from lowerdimensional joint PMFs if the joint PMF has low enough rank (Kargas et al., 2018). This result allows us to recover the discretized conditional PDFs from lowdimensional histograms but cannot be extended to the continuous setting in general because of the loss of information induced from the discretization step. We further discuss and provide conditions under which we can overcome these issues.
3.2 Identifiability using Lowerdimensional Statistics
In this section we provide insights regarding the first issue. It turns out that realizations of subsets of only three random variables are sufficient to recover and . Under the mixture model (6), a histogram of any subset of three random variables denoted as , with can be written as which is a thirdorder tensor of rank . A fundamental result on the uniqueness of tensor decomposition of thirdorder tensors was given by in (Kruskal, 1977). The result states that if admits a decomposition , with then and the decomposition of is unique. Here, denotes the Kruskal rank of the matrix which is equal to the largest integer such that every subset of
columns are linearly independent. When the rank is small and the decomposition is exact, the parameters of the CPD model can be computed exactly via Generalized Eigenvalue Decomposition (GEVD) and related algebraic algorithms
(Leurgans et al., 1993; Domanov and Lathauwer, 2014; Sidiropoulos et al., 2017).Theorem 1
(Leurgans et al., 1993) Let be a tensor that admits a polyadic decomposition , , , , and suppose that , are full column rank and . Then , the decomposition of is unique and can be found algebraically.
More relaxed uniqueness conditions from the field of algebraic geometry have been proven in recent years.
Theorem 2
(Chiantini and Ottaviani, 2012) Let be a tensor that admits a polyadic decomposition , where , , , . Let be the largest integers such that and . If then the decomposition of is essentially unique almost surely.
Theorem 2 is a generic uniqueness result i.e, all nonidentifiable parameters form a set of Lebesgue measure zero. To see how the above theorems can be applied in our setup, consider the joint decomposition of the probability tensors . Let , , and denote disjoint ordered subsets of , with cardinality , , and , respectively. Let be the block tensor whose th block is the tensor , , , . It is clear that the tensor admits a CPD where , , . By considering the joint decomposition of lowerdimensional discretized PDFs, we have constructed a single virtual nonnegative CPD model and therefore uniqueness properties hold. For example, by setting , , we have that
According to Theorem 1, the CPD can be computed exactly if . Similarly, it is easy to verify that by setting , i.e., , the CPD of is generically unique for according to Theorem 2. The later inequality is implied by which shows that the bound is quadratic in and .
Remark 1: The previous discussion suggests that finer discretization can lead to improved identifiability results. The number of hidden components may be arbitrarily large and we may still be able to identify the discretized conditional PDFs by increasing the dimensions of the subtensors i.e., the discretization intervals of the random variables. The caveat is that one will need many more samples to reliably estimate these histograms. Ideally, one would like to have the minimum number of intervals that can guarantee identifiability of the conditional PDFs.
Remark 2: The factor matrices can be recovered by decomposing the lowerorder probability tensors of dimension . It is important to note that histograms of subsets of two variables correspond to Nonnegative Matrix Factorization (NMF) which is not identifiable unless additional conditions such as sparsity are assumed for the latent factors (Fu et al., 2018). Therefore, secondorder distributions are not sufficient for recovering dense latent factor matrices.
3.3 Recovery of the Conditional PDFs
In the previous section we have shown that given lowerdimensional discretized PDFs, we can uniquely identify and recover discretized versions of the conditional PDFs via joint tensor decomposition. Recovering the true conditional PDFs from the discretized counterparts can be viewed as a signal reconstruction problem. We know that this is not possible unless the signals have some smoothness properties. We will use the following result.
Proposition 1
A PDF that is (approximately) bandlimited with cutoff frequency can be recovered from uniform samples of the associated CDF taken apart.
Proof: Assume that the PDF is bandlimited with cutoff frequency
i.e., its Fourier transform
, . Let denote the CDF of , . We can express the integration as a convolution of the PDF with a unit step function, i.e., . The Fourier transform of a convolution is the pointwise product of Fourier transforms. Therefore, we can express the Fourier transform of the CDF as(8) 
where is the Dirac delta. From Equation (8), it is clear that the CDF obeys the same bandlimit as the PDF . From Shannon’s sampling theorem we have that
(9) 
where . The PDF can then be determined by differentiation, which amounts to linear interpolation of the CDF samples using the derivative of the sinc kernel. Note that for exact reconstruction of an infinite number of data points are needed. In signal processing practice we always deal with finite support signals which are only approximately bandlimited; the point is that the bandlimited assumption is accurate enough to afford highquality signal reconstruction. In our present context, a good example is the Gaussian distribution: even though it is of infinite extent, it is not strictly bandlimited (as its Fourier transform is another Gaussian); but it is approximately bandlimited, and that is good enough for our purposes, as we will see shortly.
In section 3.2, we saw how lowerdimensional histograms can be used to obtain estimates of the discretized conditional PDFs. Now, consider the conditional PDF of the th variable given the th component. The corresponding column of factor matrix is
Since, , we can compute , . We also know that , . Therefore, we can recover the conditional CDFs using the interpolation formula
(10) 
where and a large integer. The conditional PDF can then be recovered via differentiation.
3.4 Toy example
An example to illustrate the idea is shown in Figure 1. Assume that the PDF of a random variable is a mixture of two Gaussian distributions with means ,
. It is clear from Figure 1 that for and therefore the PDF is approximately bandlimited. The CDF has the same bandlimit, thus, it can be recovered from points being apart. In this example we have used only discretization intervals as they suffice to capture of the data. We use the finite sum formula of Equation (10) to recover the CDF and then we recover the PDF by differentiating the CDF. The recovered PDF essentially coincides with the true PDF given a few exact estimates of the CDF as shown in Figure 1.Figure 2 shows the approximation error for different methods when we do not have exact points of the CDF but estimate them from randomly drawn samples. We know that a histogram converges to the true PDF as the number of samples grows and the bin width goes to at appropriate rate. However, when the conditional PDF is smooth, the interpolation procedure using a few discretization intervals leads to a lower approximation error compared to plain histogram estimates as illustrated in the figure.
4 Algorithm
In this section we develop an algorithm for recovering the latent factors of the CPD model given the histogram estimates of lowerdimensional PDFs (Alg. 1). We define the following optimization problem
(11)  
s.t.  
where is a suitable metric. The Frobenious norm and KullbackLeibler (KL) divergence are considered in this work. For probability tensors we define
Optimization problem (11) is nonconvex and NPhard in its general form. Nevertheless, sensible approximation algorithms can be derived, based on wellappreciated nonconvex optimization tools. The idea is to cyclically update the variables while keeping all but one fixed. By fixing all other variables and optimizing with respect to we have
(12) 
where .
Problem (12) is convex and can be solved efficiently using Exponentiated Gradient (EG) (Kivinen and Warmuth, 1997) – which is a special case of mirror descent (Beck and Teboulle, 2003). At each iteration of mirror descent we update by solving
where is a Bregman divergence. Setting to be the negative entropy , the update becomes
(13) 
where is the Hadamard product, followed by column normalization . The optimization problem with respect to is the following
(14) 
The update rules for are similar
(15) 
The step can be computed by the Armijo rule (Bertsekas, 1999).
5 Experiments
5.1 Synthetic Data
In this section, we employ synthetic data simulations to showcase the effectiveness of the proposed algorithm. Experiments are conducted on synthetic datasets of varying sample sizes, generated from component distributions. We set the number of variables to , and vary the number of components . We run the algorithms using different random initializations and for each algorithm keep the model that yields the lowest cost. We evaluate the performance of the algorithms by calculating the KL divergence between the true and learned model, which is approximated using Monte Carlo integration. Specifically, we generate test points, drawn from the mixture and approximate the KL divergence between the true and learned model by
We also compute the clustering accuracy on the test points as follows. Each data point
is first assigned to the component yielding the highest posterior probability
. Due to the label permutation ambiguity, the obtained components are aligned with the true components using the Hungarian algorithm (Kuhn, 1955). The clustering accuracy is then calculated as the ratio of wrongly labeled data points over the total number of data points.For each scenario, we repeat Monte Carlo simulations and report the average results. We explore the following settings for the conditional PDFs: (1) Gaussian (2) GMM with two components (3) Gamma and (4) Laplace. The mixing weights are drawn from a Dirichlet distribution with . We emphasize that our approach does not use any knowledge of the parametric form of the conditional PDFs; it only assumes smoothness.Gaussian Conditional Densities: In the first experiment we assume that each conditional PDF is a Gaussian. For cluster and random variable we set
. Mean and variance are drawn from uniform distributions,
, . We compare the performance of our algorithms to that of EM (EM GMM). Figure 3 shows the KL divergence between the true and the learned model for various dataset sizes and different number of components. We see that the performance of our methods converges to that of EM despite the fact that we do not assume a particular model for the conditional densities. Interestingly, our approach performs better in terms of clustering accuracy as shown in Figure 4. We can see that although the joint distribution learned by EM is closer to the true in terms of the KL divergence, EM may fail to identify the true parameters of every component.
GMM Conditional Densities: In the second experiment we assume that each conditional PDF is itself a mixture model of two univariate Gaussian distributions. More specifically, we set . Means and variances are drawn from uniform distributions , , , . Our method is able to learn the mixture model in contrast to the EM GMM which exhibits poor performance, due to the model mismatch, as shown in Figures 5, 6.
Gamma Conditional Densities
: Another example of a smooth distribution is the shifted Gamma distribution. We set
with , , . As the number of samples grows our method exhibits better performance, significantly outperforming EM GMM as shown in Figures 7, 8.Laplace Conditional Densities: In the last simulated experiment we assume that each conditional PDF is a Laplace distribution with mean and standard deviation i.e., . A Laplace distribution in contrast to the previous cases is not smooth (at its mean). Parameters are drawn from uniform distributions, , . We compare the performance of our methods to that of the EM GMM and an EM algorithm for a Laplace mixture model (EM LMM). The proposed method approaches the performance of EM LMM and exhibits better performance in terms of KL and clustering accuracy compared to the EM GMM for higher number of data samples, as shown in Figures 9, 10.
5.2 Real Data
Finally, we conduct several realdata experiments to test the ability of the algorithms to cluster data. We selected datasets with continuous variables suitable for classification or regression tasks from the UCI repository. For each labeled dataset we hide the label and treat it as the latent component. For datasets that contained a continuous variable as a response, we discretized the response into uniform intervals and treated it as the latent component. For each dataset we repeated Monte Carlo simulations by randomly splitting the dataset into three sets; was used as a training set, as a validation set and as a test set. The validation set was used to select the number of discretization intervals which was either or
. We compare our methods against the EM GMM with diagonal covariance, EM GMM with fullcovariance and the Kmeans algorithm in terms of clustering accuracy. Note that although the conditional independence assumption may not actually hold in practice, almost all the algorithms give satisfactory results in the tested datasets. The proposed algorithms perform well, outperforming the baselines in
out of datasets while performing reasonably well in the remaining.6 Discussion and Conclusion
We have proposed a twostage approach based on tensor decomposition and signal processing tools for recovering the conditional densities of mixtures of smooth product distributions. Our method does not assume a parametric form for the unknown conditional PDFs. We have formulated the problem as a coupled tensor factorization and proposed an alternatingoptimization algorithm. Experiments on synthetic data have shown that when the underlying conditional PDFs are indeed smooth our method can recover them with high accuracy. Results on real data have shown satisfactory performance on data clustering tasks.
References
 Allman et al. (2009) E. S. Allman, C. Matias, J. A. Rhodes, et al. Identifiability of parameters in latent structure models with many observed variables. The Annals of Statistics, 37(6A):3099–3132, 2009.
 Anandkumar et al. (2014) A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15(1):2773–2832, Aug. 2014.
 Beck and Teboulle (2003) A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
 Benaglia et al. (2009) T. Benaglia, D. Chauveau, and D. R. Hunter. An EMlike algorithm for semiand nonparametric estimation in multivariate mixtures. Journal of Computational and Graphical Statistics, 18(2):505–526, 2009.
 Bertsekas (1999) D. P. Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999.
 Chiantini and Ottaviani (2012) L. Chiantini and G. Ottaviani. On generic identifiability of 3tensors of small rank. SIAM Journal on Matrix Analysis and Applications, 33(3):1018–1037, 2012.
 Dempster et al. (1977) A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological), pages 1–38, 1977.
 Domanov and Lathauwer (2014) I. Domanov and L. D. Lathauwer. Canonical polyadic decomposition of thirdorder tensors: reduction to generalized eigenvalue decomposition. SIAM Journal on Matrix Analysis and Applications, 35(2):636–660, 2014.
 Fu et al. (2018) X. Fu, K. Huang, and N. D. Sidiropoulos. On identifiability of nonnegative matrix factorization. IEEE Signal Processing Letters, 25(3):328–332, Mar. 2018.

Gottesman et al. (2018)
O. Gottesman, W. Pan, and F. DoshiVelez.
Weighted tensor decomposition for learning latent variables with
partial data.
In
Proceedings of the TwentyFirst International Conference on Artificial Intelligence and Statistics
, volume 84, pages 1664–1672, Apr. 2018.  Hsu and Kakade (2013) D. Hsu and S. M. Kakade. Learning mixtures of spherical gaussians: moment methods and spectral decompositions. In Proceedings of the 4th conference on Innovations in Theoretical Computer Science, pages 11–20, 2013.
 Jain and Oh (2014) P. Jain and S. Oh. Learning mixtures of discrete product distributions using spectral decompositions. In Conference on Learning Theory, pages 824–856, 2014.
 Kargas et al. (2018) N. Kargas, N. D. Sidiropoulos, and X. Fu. Tensors, learning, and “kolmogorov extension” for finitealphabet random vectors. IEEE Transactions on Signal Processing, 66(18):4854–4868, Sept 2018.
 Kivinen and Warmuth (1997) J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1–63, 1997.
 Kruskal (1977) J. B. Kruskal. Threeway arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear algebra and its applications, 18(2):95–138, 1977.
 Kuhn (1955) H. W. Kuhn. The Hungarian method for the assignment problem. Naval research logistics quarterly, 2(12):83–97, 1955.
 Leurgans et al. (1993) S. Leurgans, R. Ross, and R. Abel. A decomposition for threeway arrays. SIAM Journal on Matrix Analysis and Applications, 14(4):1064–1083, 1993.
 Levine et al. (2011) M. Levine, D. R. Hunter, and D. Chauveau. Maximum smoothed likelihood for multivariate mixtures. Biometrika, pages 403–416, 2011.
 McLachlan and Peel (2000) G. J. McLachlan and D. Peel. Finite mixture models. Wiley Series in Probability and Statistics, 2000.
 Ruffini et al. (2017) M. Ruffini, R. Gavalda, and E. Limon. Clustering patients with tensor decomposition. In Machine Learning for Healthcare Conference, pages 126–146, 2017.
 Sidiropoulos et al. (2017) N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis, and C. Faloutsos. Tensor decomposition for signal processing and machine learning. IEEE Transactions on Signal Processing, 65(13):3551–3582, July 2017.
 Zhang et al. (2014) Y. Zhang, X. Chen, D. Zhou, and M. I. Jordan. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing. In Advances in neural information processing systems, pages 1260–1268, 2014.
Comments
There are no comments yet.