DeepAI

# Hierarchical mixtures of Gaussians for combined dimensionality reduction and clustering

To avoid the curse of dimensionality, a common approach to clustering high-dimensional data is to first project the data into a space of reduced dimension, and then cluster the projected data. Although effective, this two-stage approach prevents joint optimization of the dimensionality-reduction and clustering models, and obscures how well the complete model describes the data. Here, we show how a family of such two-stage models can be combined into a single, hierarchical model that we call a hierarchical mixture of Gaussians (HMoG). An HMoG simultaneously captures both dimensionality-reduction and clustering, and its performance is quantified in closed-form by the likelihood function. By formulating and extending existing models with exponential family theory, we show how to maximize the likelihood of HMoGs with expectation-maximization. We apply HMoGs to synthetic data and RNA sequencing data, and demonstrate how they exceed the limitations of two-stage models. Ultimately, HMoGs are a rigorous generalization of a common statistical framework, and provide researchers with a method to improve model performance when clustering high-dimensional data.

• 4 publications
• 9 publications
05/06/2018

### Branching embedding: A heuristic dimensionality reduction algorithm based on hierarchical clustering

This paper proposes a new dimensionality reduction algorithm named branc...
07/05/2021

### Randomized Dimensionality Reduction for Facility Location and Single-Linkage Clustering

Random dimensionality reduction is a versatile tool for speeding up algo...
02/23/2022

### Human Motion Detection Using Sharpened Dimensionality Reduction and Clustering

Sharpened dimensionality reduction (SDR), which belongs to the class of ...
05/21/2016

### Learning From Hidden Traits: Joint Factor Analysis and Latent Clustering

Dimensionality reduction techniques play an essential role in data analy...
04/10/2020

### Supervised Autoencoders Learn Robust Joint Factor Models of Neural Activity

Factor models are routinely used for dimensionality reduction in modelin...
06/07/2016

### Sifting Common Information from Many Variables

Measuring the relationship between any pair of variables is a rich and a...
08/02/2022

### Cluster Weighted Model Based on TSNE algorithm for High-Dimensional Data

Similar to many Machine Learning models, both accuracy and speed of the ...

## 1 Introduction

Clustering — the grouping of individuals or items based on their similarity — is an essential part of knowledge discovery. The growing complexity of modern datasets thus poses a challenge, because many clustering algorithms — such as mixtures of Gaussians (MoG) — suffer from the so-called “curse of dimensionality”, and exhibit limited performance when classifying high-dimensional data. This arises not only because model complexity scales with dimension, but also because similarity metrics used to evaluate model performance tend to break down in high-dimensional spaces

[beyer_when_1999, assent_clustering_2012].

A common approach for addressing this challenge is to first apply dimensionality reduction techniques — such as principle component analysis (PCA) or factor analysis (FA) — to project the data into a lower dimensional space, and then cluster the projected data. Such two-stage algorithms are widely applied in problem domains such as image processing [houdard_high-dimensional_2018, hertrich_pca_2022] and time-series analysis [warren_liao_clustering_2005], and fields such as neuroscience [lewicki_review_1998, baden_functional_2016] and bioinformatics [witten_framework_2010, duo_systematic_2020].

Nevertheless, two-stage approaches struggle with two primary limitations: Firstly, the dimensionality-reduction is typically implemented as a preprocessing step, and is not further optimized based on the results of the subsequent clustering. This can lead to suboptimal projections — if we consider PCA, for example, the directions in the data that best separate the clusters might not be the directions of maximum variance that define the PCA projection

[chang_using_1983, mclachlan_finite_2019].

Secondly, two-stage algorithms rely on optimizing two distinct objectives, and it is therefore non-trivial to quantify how well the complete model describes the data. As such, even if we consider algorithms that update projections based on clustering results [ding_adaptive_2002, fern_random_2003], it is difficult to to be sure that these updates improve overall model performance, because we cannot evaluate the likelihood of the complete model given the data.

To address these limitations, we consider a family of two-stage models that implement dimensionality reduction with a linear Gaussian model (e.g. PCA or FA), and clustering with a mixture of Gaussians (MoG). By applying a novel theory of conjugate priors in latent variable exponential family models, we show how these two stages can be combined and generalized into a single, hierarchical probabilistic model that we call a hierarchical mixture of Gaussians (HMoG).

An HMoG captures both dimensionality reduction and clustering in a single model, and has a single objective given by the likelihood function. Critically, the tractability of HMoGs allows us to derive closed-form expressions for the HMoG density function, and a novel expectation-maximization (EM) algorithm for maximizing the HMoG likelihood. Although the derivation of HMoG theory requires a number of technical innovations, the upshot of the theory is straightforward: a closed-form expression for the HMoG log-likelihood allows us to rigorously compare various algorithms for dimensionality reduction and clustering, and the HMoG EM algorithm allows HMoGs to strictly exceed the performance of the two-stage models that they generalize.

To validate our theory we apply HMoGs to several datasets. In synthetic data, we show how HMoGs can overcome fundamental limitations of standard two-stage algorithms. Finally, in a RNA sequencing (RNA-Seq) data set we show how our methods can significantly exceed the predictive performance of standard two-stage approaches. Our work is related to previous methods on mixtures of PCA/FA [ghahramani_em_1996, tipping_mixtures_1999, houdard_high-dimensional_2018, hertrich_pca_2022]. In contrast with these methods, however, our model has a hierarchical structure that affords a more compact parameterization, and ensures that a shared, low-dimensional feature space is extracted during training.

## 2 Theory

Our focus will be on HMoGs where the dimensionality reduction is based on either PCA or FA, and the clustering is based on MoGs. Historically, each of these techniques has fairly different theoretical foundations — PCA finds directions of maximum variance in the data, FA explains the data as a linear combination of hidden factors, and MoGs model the data distribution as a weighted sum of multivariate normal distributions. Nevertheless, each technique may be formulated as a probabilistic latent variable model, that may be fit to data with EM. This unified, probabilistic foundation is essential for combining PCA/FA and MoGs into an HMoG, and ultimately deriving a single EM algorithm that jointly optimizes dimensionality-reduction and clustering.

### 2.1 Linear Gaussian models and mixture models are probabilistic latent variable models

Suppose we make observations

of the random variable

, and we hypothesize that is influenced by some latent random variable . In the maximum likelihood framework, a statistical model with parameters aims to capture the distribution of by maximizing the log-likelihood objective with respect to . We may extend this framework to a latent variable model that captures both the distribution of and the effect of the latent variable , by maximizing the log-likelihood of the marginal distribution of . As we will see, both linear Gaussian models and mixture models afford a probabilistic latent variable formulation, and linear Gaussian models include FA and PCA as a special case [roweis_em_1997, bishop_pattern_2006].

On one hand, where and are continuous variables of dimensions and , respectively, a linear Gaussian model is simply an dimensional multivariate normal distribution. As a consequence, the observable distribution and the prior , as well as the likelihood111Unfortunately there is no naming convention for distinguishing the likelihood of non-Bayesian model parameters and the likelihood of random variables. We will refer to as the log-likelihood objective where necessary to help distinguish them. and the posterior are also multivariate normal distributions [[, see]]bishop_pattern_2006. Where and , we may define probabilistic PCA as the case where the columns of

are proportional to the ranked eigenvectors of the sample covariance matrix and

is isotropic. FA is then the case where the columns of are the so-called factor loadings and is a diagonal matrix. Finally, in both cases, the mean of the posterior distribution is given by , and probabilistically formalizes projecting a datapoint into the reduced space (“classic” PCA is recovered in the limit as ).

A mixture model, on the other hand, is a latent variable model where the latent variable represents an index that ranges from to . Since our goal for HMoGs is ultimately to cluster low-dimensional features , let us introduce a mixture as the model over and an index-valued latent variable . In this formulation, the components of the mixture are given by the likelihood for each index , the weights

of the mixture are the prior probabilities

, and the weighted sum of the components (“the mixture distribution”) is equal to the observable distribution . MoGs, then, are simply the case where each component is a multivariate normal distributions with mean and covariance .

FA, PCA, and MoGs can all be fit to data using expectation-maximization (EM) in our probabilistic framework. The two-stage approach to clustering a high-dimensional dataset is thus first to fit an appropriate linear Gaussian model (i.e. FA or PCA) to the dataset. Then, to fit a MoG to the projected dataset , where (Fig. 1). After training, the cluster membership of an arbitrary datapoint is determined by first computing the projection

, and then computing (and perhaps taking the maximum of) the index probabilities

.

### 2.2 New exponential family theory enables tractable computation with latent variable models

When analyzing high-dimensional data, one must take care to avoid making complex computations in the high-dimensional space of observations, as this can quickly lead to computational intractability and numerical instability. In order to address this in our present context, we introduce the class of model known as exponential family harmoniums [smolensky_information_1986, welling_exponential_2005], which allows us to unify the mathematics of our latent variable models. We then develop a theory of conjugate priors for harmoniums, which we use to break down complex probabilistic computations with harmoniums (and ultimately HMoGs) into tractable components.

To begin, an exponential family is a statistical model defined by a sufficient statistic and base measure , and has the form , where are the natural parameters, and is the normalizer known as the log-partition function. There are two exponential families that are particularly relevant for our purposes. Firstly, the family of -dimensional multivariate normal distributions is the exponential family with base measure and sufficient statistic , where is the outer product operator. Secondly, the family of categorical distributions over indices is the exponential family with base measure

sufficient statistic given by a so-called “one-hot” vector, so that

, and is a length vector with all zero elements except for a 1 at element .

An exponential family harmonium is then defined as a kind of product exponential family, which includes various models as special cases [welling_exponential_2005]. Given two exponential families defined by and , and and , respectively, a harmonium is the model

 p(x,y;\eprmsX,\eprmsY,\iprmsXY)=e\VsX(x)⋅\eprmsX+\VsY(y)⋅\eprmsY+\VsX(x)⋅\iprmsXY⋅\VsY(y)−ψXY(\eprmsX,\eprmsY,\iprmsXY)νX(x)νY(y), (1)

which is an exponential family defined by the base measure and the sufficient statistic , with natural parameters and log-partition function .

To return to our dimensionality-reduction and clustering problem, let us define and , and , and and as the sufficient statistics and base measures of the -dimensional multivariate normal family, the -dimensional multivariate normal family, and the categorical family over indices, respectively. Based on our definitions, we may immediately define a MoG as the harmonium (where we absorb the constant base measure into the proportionality relation). We leave the detailed equations of the natural parameters to the Appendix A Appendix, but sufficed to say the natural parameters and correspond to the parameters and of the mixture components of the MoG, and the parameters correspond to the mixture weights of the MoG.

On the other hand, we may express a linear Gaussian model in the form . This is a restricted version of the harmonium defined by and , and and , where the term in the exponent ensures that the variables and only interact through their first order statistics, rather than the second order statistics that are part of and . We again leave out the details (see Appendix A Appendix), but the parameters correspond to and , and the parameters correspond to the parameters of the PCA and FA models.

To continue, a prior distribution is said to be conjugate to the posterior when both have the same exponential family form, and conjugate priors can greatly simplify statistical inference in exponential family models. Conjugate priors usually arise in the context of inferring the parameters of well known models such as normal or Poisson distributions, but here we wish to use the concept to infer distributions over our features

and clusters . Due to the log-linear form of harmonium, the harmonium posterior is given by

 p(y∣x)=e\VsY(y)⋅(\eprmsY+\VsX(x)⋅\iprmsXY)−ψY(\eprmsY+\VsX(x)⋅\iprmsXY)νY(y), (2)

and is therefore always in the exponential family defined by and . As such we can reduce the question of whether a harmonium prior is conjugate to its posterior to whether the prior is also in the exponential family defined by and .

###### Lemma 1.

Suppose that is a harmonium. Then the prior satisfies for some if and only if there exists a vector and a constant such that

 ψX(\eprmsX+\iprmsXY⋅\VsY(\Vy))=\VsY(\Vy)⋅\rprmsY+ρ0, (3)

and

 \eprms∗Y=\eprmsY+\rprmsY. (4)

When Eq. 3 is satisfied, we say that the harmonium is conjugated, and we refer to the parameters and as the conjugation parameters. In general, Eq. 3 is a strong constraint, and is not satisfied by most models. Nevertheless, both linear Gaussian models and mixture models are indeed conjugated harmoniums with closed-form expressions for their conjugation parameters. We again leave the out details of their evaluation (see Appendix A Appendix), but sufficed to say, for PCA and FA, the computational complexity of evaluating the conjugation parameters scales only linearly with the dimension of the observations .

By applying the theory of conjugation, we can overcome one of our major challenges, namely computing the harmonium observable density . The observable density is given by

 p(x)=e\VsX(x)⋅\eprmsX+ψY(\eprmsY+\VsX(x)⋅\iprmsXY)−ψXY(\eprmsX,\eprmsY,\iprmsXY)νX(x), (5)

which is not in general an exponential family distribution. Typically, the log-partition function will be tractable, which reduces the difficulty of evaluating this density to evaluating the harmonium log-partition function . Although is typically intractable, for conjugated harmoniums we can also reduce its evaluation to the evaluation of .

###### Corollary 2.

Suppose that is a conjugated harmonium, with conjugation parameters and . Then

 ψXY(\eprmsX,\eprmsY,\iprmsXY)=ψY(\eprmsY+\rprmsY)+ρ0. (6)

In the context of our dimensionality reduction problem, Eq. 6 also allows us to evaluate the log-partition function of the linear Gaussian model in terms of the log-partition function on the feature space, and thereby avoid computations in the high-dimensional observable space.

Our other major challenge is fitting our models to data, and thankfully, the exponential family structure of harmoniums also affords a general description of the EM algorithm. This follows from the fundamental property of exponential families that the gradient of the log-partition function is equal to the expected value of the sufficient statistics [wainwright_graphical_2008]. Suppose we wish to fit the harmonium to the sample , and that and are the gradients of the log-partition functions and , respectively. Then an iteration of EM may be formulated as

for all  do
Compute latent expectations
end for
Compute updated average sufficient statistics
Compute updated natural parameters

The tractability of training a harmonium thus reduces to the tractability of the log-partition gradient , and the inverse of the gradient . In the case of linear Gaussian models and MoGs, both functions are available in closed-form (see Appendix A Appendix), and there is thus a closed-form EM for training them.

### 2.3 Hierarchical mixtures of Gaussians can be tractably evaluated and fit to data

We exploit the probabilistic formulation of the linear Gaussian model and MoG to define an HMoG as the model over observations , features , and clusters . Intuitively, we define an HMoG by taking a PCA or FA model, and swapping out the standard normal prior for a MoG. This definition is more than a notational trick, as it ensures that fitting the two-stage model (i.e. PCA/FA and MoG separately) maximizes a lower-bound on the marginal log-likelihood of the HMoG (see Appendix A Appendix).

Putting the expressions for and together allows us to write the HMoG density as

 logp(\Vx,\Vy,z;\eprmsX,\eprmsY,\eprmsZ,\iprmsXY,\iprmsYZ)=\eprmsX⋅\VsX(\Vx)+\eprmsY⋅\VsY(\Vy)+\eprmsZ⋅\VsZ(z)+\Vx⋅\iprmsXY⋅\Vy+\VsY(\Vy)⋅\iprmsYZ⋅\VsZ(z)+logνXYZ(\Vx,\Vy,z)−ψXYZ(\eprmsX,\eprmsY,\eprmsZ,\iprmsXY,\iprmsYZ), (7)

where is the HMoG base measure, and is the log-partition function. This construction ensures that and are conditionally independent given , which allows us to depict an HMoG as a hierarchical graphical model (Fig. 1).

By reorganizing terms, the density of an HMoG may expressed as a form of harmonium density, so that we may apply the theory of conjugation to HMoGs. In particular, by applying Corollary 2, we may express the observable density (Eq. 5) of an HMoG as

 p(x)=νX(x)e\VsX(\Vx)⋅\eprmsX+ψZ(\eprmsZ+\rprms′Z)−ψZ(\eprmsZ+\rprms∗Z)+ρ′1−ρ∗1−ρ0, (8)

where and are the conjugation parameters of the linear Guassian model ; where and are the conjugation parameters of the mixture model for ; and where and are the conjugation parameters of the mixture model for (see Appendix A Appendix for a detailed derivation and the conjugation parameters).

With regards to fitting, computing the expectation step for HMoGs reduces to computing the log-partition gradient of a mixture model, which is available in closed-form. The maximization step, on the other hand, does not afford a closed-form expression. Nevertheless, we can solve the maximization step with gradient ascent as long as we can evaluate the gradient of the HMoG log-partition function , which is indeed possible for an HMoG through judicious use of Eq. 4 (see Appendix A Appendix). Our strategy for fitting HMoGs is thus to evaluate the expectation-step in closed-form, and then use the Adam gradient optimizer to approximately solve the maximization step [kingma_adam_2014].

Finally, once we have trained our HMoG model we may project and classify new observations. The probabilistic structure of HMoGs suggests that the projection of a data point is given by , which mathematically involves marginalizing out the cluster index . Similarly, the cluster membership of a data point is given by , which involves marginalizing out the features . In practice these probabilistic definitions of projection and classification perform very similarly to their two-stage versions, and for consistency we use the two-stage projection and classification algorithms with two-stage models, and the probabilistic versions with unified models. For a more detailed discussion see the Appendix A Appendix.

### Notes on training parameters and implementation details

When fitting two-stage algorithms, we always run 100 iterations of EM each for the dimensionality-reduction models and the MoG, as training times were negligible. We initialized the PCA/FA parameters by setting the mean equal to the empirical mean, the isotropic/diagonal variance to the corresponding empirical value, and initializing the projection matrix randomly with values between -0.01 and 0.01. After fitting PCA/FA, we initialized the MoG parameters by fitting a multivariate normal to the projected data, and setting the mean and covariance of each cluster to a sample point from the fit normal, and the covariance of the fit normal, respectively. We set the weight distribution to be uniform.

Fitting the HMoG PCA/FA algorithms on the Iris and synthetic datasets was possible using the two-stage initialization schemes and a range of learning parameters. For the RNA-Seq data, we initialized an HMoG by first fitting its parameters with a two-stage algorithm, and then running 800 iterations of unified EM. We used relatively conservative learning parameters: a learning rate of for the Adam optimizer, with 2000 steps per EM iteration. To avoid local maxima during optimization, we always ran 10 simulations in parallel and selected the model that maximized training data performance.

All algorithms were implemented in Haskell and targeted the CPU, and simulations were run on an AMD Ryzen 9 5950X processor. By distributing parallel simulations over cores, a single fit of an HMoG FA model on the RNA-Seq data took about 10 minutes.

## 3 Applications

We next show how HMoG theory is useful for dimensionality-reduction and clustering in three application scenarios: (i) a simple validation of the HMoG theory on the Iris flower dataset [noauthor_iris_2022]; (ii) a demonstration on synthetic data of the limitations of standard two-stage algorithms, and how HMoGs can overcome them; and (iii) an application to RNA-Seq data [duo_systematic_2020, lause_analytic_2021] to show how HMoGs enhance performance over two-stage approaches on real data.

On each dataset, we compare four methods for dimensionality-reduction and clustering: we fit (i) probabilistic PCA or (ii) FA to the given dataset with EM, followed by separately fitting a MoG to the projected data with EM; or we combine either (iii) probabilistic PCA or (iv) FA with a MoG into an HMoG, and fit the model with the unified EM algorithm. We refer to these methods as two-stage PCA, two-stage FA, HMoG PCA, and HMoG FA, respectively (Fig. 1).

### 3.1 HMoGs afford rigorous comparison of high-dimensional clustering methods

As we have shown, two-stage algorithms maximize the likelihood of an equivalent HMoG, even when they optimize dimensionality-reduction and clustering separately. We demonstrate this by comparing the log-likelihood trajectories of all four methods given the Iris flower dataset [noauthor_iris_2022] (Fig. 2A). We observed that the log-likelihood increased every iteration for all four training algorithms. For this dataset, we saw little performance difference between the four methods for the fully trained model. Qualitatively, the projections and clusters learned by the two-stage PCA method were indeed sufficient for capturing and clustering features of the data (Fig. 2B), and the HMoG FA model did not perform noticeably better (Fig. 2C). In conclusion, HMoG theory supports quantitative model comparison through evaluation of the HMoG log-likelihood (Eq. 8).

### 3.2 HMoG EM overcomes limitations of two-stage algorithms

We next compared our methods on a synthetic dataset designed to reveal the limitations of the two-stage approaches (Fig. 3). Data was generated from a ground-truth, HMoG FA model with two clusters, a 1-dimensional feature space, and a 2-dimensional observable space. We designed the ground-truth HMoG so that the dimension along which cluster membership changes in observation space is perpendicular to the direction of maximum variance (Fig. 3A-C).

Unsurprisingly, two-stage PCA learned a projection that only followed the direction of maximum variance, and failed to effectively model either the observable density (Fig. 3A) or feature density (Fig. 3D). Although we do not present the results here, HMoG PCA also failed to effectively model the data in the same way as two-stage PCA, indicating that the limitation is due to the simpler structure of the PCA model with its single (Fig. 1), rather than the training algorithm used.

On the other hand, two-stage FA performed better than PCA, and learned a projection that captures the feature direction along which cluster membership changes (Fig. 3B). This was reflected in the feature distribution learned by the two-stage MoG, which captured the clusters in the feature space of the ground-truth HMoG (Fig. 3E). Nevertheless, two-stage FA failed to capture the fine-structure of the observable densities, and in repeated simulations we found that it often learned suboptimal clusterings. In contrast, HMoG FA nearly perfectly modeled the observable density (Fig. 3C), and reliably separated the two clusters in feature space (Fig. 3F). Overall, we found that FA-based models of dimensionality reduction have important advantages over PCA-based models, which are only fully exploited by training them with the unified, HMoG EM algorithm.

### 3.3 HMoGs exceed predictive performance of two-stage models on RNA-Seq data

Finally, we applied our methods to a single-cell RNA-Seq dataset from peripheral blood mononuclear cells (PBMCs), a subset of the human immune system. The data contained cells with measurements of genes [duo_systematic_2020]. The data was preprocessed by computing analytic Pearson residuals [lause_analytic_2021] using scanpy 1.9 default settings, and twenty genes with the highest residual variance were selected. The dataset was grouped into eight immune cell subtypes identified by previously known genetic markers by [zheng_massively_2017], allowing us to compare model classification performance to ground-truth clusters.

We evaluated the predictive performance of all four methods on the RNA-Seq data through 5-fold cross-validation of the log-likelihood, as we varied the number of dimensions and number of clusters in the feature space (Fig. 4A-C). In contrast to the Iris flower data, we found that HMoG FA significantly outperformed all the alternative approaches. For example, for four clusters and four features, two-stage PCA, two-stage FA, and HMoG FA achieved predictive log-likelihoods of , , and , respectively. Moreover, HMoG FA could achieve equivalent performance (predictive log-likelihood of ) to the two-stage methods with merely three clusters and three features. We again found that HMoG PCA performance did not significantly exceed that of two-stage PCA, and so do not present the results here.

To better understand the feature space learned by each model, we refit the models on the full datasets. For each method, we set the number of clusters and feature dimensions to four, as we found that predictive performance saturates around 4 clusters, and chose 4 features for simplicity. We then focused on a representative plane in the 4-dimensional feature space. Overall we found that each model successfully separated out three of the cell subtypes, and used a fourth cluster to represent “everything else” (Fig. 4D-F).

Since the feature spaces of the methods were qualitatively similar, we sought to better understand why the log-likehood of HMoG FA was so much higher than that of the other techniques. Due to its hierarchical structure, we can marginalize out the features in the HMoG density, and then factor the joint distribution over observations

and cluster indices into . Given labelled data , performance may then be quantified as the combination of classification performance — i.e. how much the labels under the model match the true labels — and how well the model captures the data distribution Quantitatively, we found no significant difference in classification performance between our four methods when using 4 clusters and 4 features — when we associated each cluster with its most strongly correlated label on the training data, we found that all models achieved held-out classification performance of . When we allowed one cluster to represent multiple labels (to permit one cluster to model the “everything else”), we found each technique achieved held-out classification performance of . Overall, this suggests that each model was able to saturate classification performance given 4 clusters, and that the gains achieved by HMoG FA were entirely due to how well the clusters it learned in the observable space were able to represent the relevant data.

## Discussion

In this paper we have presented a unified model of dimensionality reduction and clustering we call a hierarchical mixture of Gaussians. We showed how the theory of HMoGs generalized existing two-stage algorithms for dimensionality reduction and clustering, allowing us to derive an exact expression for the log-likelihood of these methods, and to enhance their performance with an EM algorithm that trains both stages in parallel. We applied our theory of HMoGs to three datasets, and showed how HMoGs can help existing methods exceed their limitations.

A notable finding in our applications was the performance advantage achieved by FA-based models over PCA-based models. It is well-known that PCA-based dimensionality reduction can fail to capture the dimensions relevant for clustering [chang_using_1983], and it has been suggested that the rescaling properties of FA might help address this [mclachlan_finite_2019]. Our applications confirmed the potential advantages of FA over PCA for dimensionality-reduction, both in standard two-stage models and HMoGs. Given the relative dominance of PCA over FA in dimensionality-reduction applications [[, see]]duo_systematic_2020, our results call for additional investigation of the advantages offered by FA-based methods.

An advantage of our framework is its theoretical simplicity and flexibility, as ultimately we fit HMoGs by maximizing the likelihood using EM, where the maximization step is implemented with a gradient ascent procedure. As such, the basic fitting algorithm we provided could be further enhanced with regularization or sparsity techniques [witten_framework_2010, yi_regularized_2015] or specialized EM algorithms for large datasets [chen_stochastic_2018].

Finally, our exponential family-based theory of HMoGs opens two directions for further research. On one hand, the theory of conjugation we developed provides a toolbox for designing tractable hierarchical models out of arbitrary exponential families. As such, it should prove useful to researchers who work with data that call for more specialized distributions, such as Poisson distributions for count data. On the other hand, the log-linear structure of our exponential family models makes them highly amenable to GPU-based computation, and should allow optimized implementations of our algorithms to scale up to the massive sizes of modern datasets.

## Appendix A Appendix

Towards developing and explaining the theory of HMoGs, this appendix is organized into the following sections:

1. A brief introduction to exponential families, largely in order to fix notation.

2. An overview of the theory of exponential family graphical models known as harmoniums.

3. Derivation of the theory of conjugated harmoniums.

4. Formulations of mixture models and linear Gaussian models as conjugated harmoniums.

5. Combining these various components to formalize an HMoG.

The culmination of these theoretical developments is a series of algorithms for computing the log-likelihood and expectation-maximization algorithms for HMoGs.

### a.1 Exponential families

For a thorough development of exponential family theory see amari_methods_2007 (amari_methods_2007), and wainwright_graphical_2008 (wainwright_graphical_2008).

Consider a random variable on the sample space , with an unknown distribution , and suppose is an independent and identically distributed sample from , such that . We may model based on the sample by defining a statistic , where is a space with dimension

, and looking for a probability distribution

that satisfies , where is the expected value of under . On its own this is an under-constrained problem, but if we further assume that must have maximum entropy, then we may define a family (or manifold) of distributions that is uniquely described by the possible values of .

A -dimensional exponential family is defined by a sufficient statistic , as well as a base measure which helps define integrals and expectations within the family. An exponential family is parameterized by a set of natural parameters , such that each element of the family may be identified with some parameters . The density of the distribution is given by , where is the log-partition function. Expectations of any are then given by . Because each is uniquely defined by , the means of the sufficient statistic also parameterize . The space of all mean parameters is , and we denote them by . Finally, a sufficient statistic is minimal when its component functions are non-constant and linearly independent. If the sufficient statistic of a given family is minimal, then and are isomorphic. Moreover, the transition functions between them and are given by , and , where is the negative entropy of . The transition functions and are also referred to as the forward and backward mappings, respectively.

### a.2 Exponential Family Harmoniums

An exponential family harmonium

is a kind of product exponential family which includes restricted Boltzmann machines as a special case

[smolensky_information_1986, welling_exponential_2005]. We may construct an exponential family harmonium out of the exponential families and by defining the base measure of as the product measure , and by defining the sufficient statistic of as the vector which contains the concatenation of all the elements in , , and the outer product . More concretely, is the manifold that contains all the distributions with densities of the form , where , , and are the natural parameters of .

The linear structure of harmoniums affords simple expressions for their training algorithms. On one hand, given a sample and a harmonium with parameters , , and , an iteration of the expectation-maximization algorithm (EM) may be formulated as:

Expectation Step:

compute the unobserved means for every ,

Maximization Step:

eval. .

On the other hand, the stochastic log-likelihood gradients of the parameters of are

 ∂\eprmsXlogq(X(i)) =\VsX(X(i))−\mprmsX, ∂\eprmsYlogq(X(i)) =\VτY(\eprmsY+\VsX(X(i))⋅\iprmsXY)−\mprmsY, ∂\iprmsXYlogq(X(i)) =\VsX(X(i))⊗\VτY(\eprmsY+\VsX(X(i))⋅\iprmsXY)−\VHXY.

where . It is often the case that we have a closed-form expression for , but not for , and therefore cannot compute the maximization step in closed-form. Although we could simply train a harmonium with gradient ascent on the log-likelihood, for certain models this can lead to numerical instability, and an effective strategy is rather to evaluate the expectation step, and then approximate the maximization step with gradient descent.

### a.3 Harmoniums and Conjugate Priors

The conditional distributions and of a harmonium have a simple linear structure, and are themselves always exponential family distributions; in particular, and for any or , respectively. In general, however, the marginal distributions and are not members and , respectively. This is both a blessing and a curse, as on one hand, the marginal distribution can model much more complex datasets than the simpler elements of . On the other hand, because the prior may not be computationally tractable, various computations with harmoniums, such as sampling and inference, may also prove intractable.

Ideally, the prior would be in to facilitate computability, while the modelled observable distribution would be more complex than the elements of , and some classes of harmoniums do indeed have this structure. In general, a prior is said to be conjugate to a posterior if both the prior and posterior have the same form for any observation . In the context of harmoniums, since the posterior for any , the prior is conjugate to the posterior iff . We refer to harmoniums with conjugate priors as conjugated harmoniums.

###### Lemma 1.

Suppose that is a harmonium family defined by the exponential families and , and that has parameters . Then is conjugated if and only if there exists a vector and a constant such that

 ψX(\eprmsX+\iprmsXY⋅\VsY(y))=\VsY(y)⋅\rprmsY+ρ0, (9)

for any , and where

 \eprms∗Y=\eprmsY+\rprmsY. (10)
###### Proof.

In general, the density of the harmonium prior distribution is given by

 q(y)=e\VsY(y)⋅\eprmsY+ψX(\eprmsX+\iprmsXY⋅\VsY(y))−ψXY(\eprmsX,\eprmsY,\iprmsXY). (11)

On one hand, if we assume that is conjugated, then it also holds that with some natural parameters , and therefore

 q(y)∝ e\eprmsY⋅\VsY(x)+ψX(\eprmsX+\iprmsXY⋅\VsY(y))∝e\eprms∗Y⋅\VsY(y) ⟹ \eprmsY⋅\VsY(y)+ψX(\eprmsX+\iprmsXY⋅\VsY(y))=\eprms∗Y⋅\VsY(y)+ρ0 ⟹ ψX(\eprmsX+\iprmsXY⋅\VsY(y))=\VsY(y)⋅\rprmsY+ρ0.

for some , and .

On the other hand, if we first assume that Eq. 9 holds, then is given by

 q(y)∝e\eprmsY⋅\VsY(y)+ψX(\eprmsX+\iprmsXY⋅\VsY(y))∝e(\eprmsY+\rprmsY)⋅\VsY(y), (12)

which implies that with parameters . ∎

We refer to and as the conjugation parameters. The value of this lemma is that it allows us to reduce various computations to evaluating the conjugation parameters, and show how these computations are fundamentally the same even for apparently unrelated models. To wit, the following corollary describes how to compute the log-partition function of a conjugated harmonium.

###### Corollary 2.

Suppose that is a harmonium family defined by the exponential families and , and that the harmonium with parameters is conjugated, with prior and parameters . Then satisfies

 ψXY(\eprmsX,\eprmsY,\iprmsXY)=ψY(\eprms∗Y)+ρ0.
###### Proof.

Lemma 1 implies that the prior of the conjugated harmonium satisfies

 q(y) =e\eprms∗Y⋅\VsY(y)−ψY(\eprms∗Y) =e\eprmsY⋅\VsY(y)+\rprmsY⋅\VsY(y)+ρ0−ψXY(\eprmsX,\eprmsY,\iprmsXY) ⟺ ψXY(\eprmsX,\eprmsY,\iprmsXY) =(\eprmsY+\rprmsY−\eprms∗Y)⋅\VsY(y)+ψY(\eprms∗Y)+ρ0 =ψY(\eprms∗Y)+ρ0.

Typically, evaluating the log-partition function of will be tractable, which means that various computations, including evaluating the density of , are also tractable. In particular, given the conjugated harmonium with parameters , , and

 logq(x) =\VsX(x)⋅\eprmsX+ψY(\eprmsY+\VsX(x)⋅\iprmsXY)−ψXY(\eprmsX,\eprmsXY,\eprmsY) =\VsX(x)⋅\eprmsX+ψY(\eprmsY+\VsX(x)⋅\iprmsXY)−ψY(\eprms∗Y)−ρ0,

where .

### a.4 Mixture Models and Linear Guassian Models

Constructing a hierarchical mixture of Gaussians ultimately requires putting together a mixture model and a linear Gaussian model in the “right” way. To this do we use the theory of conjugated harmoniums, by first defining the exponential families of categorical and normal distributions, and then deriving the conjugation parameters of mixture models and linear Gaussian models.

The -dimensional categorical exponential family contains all the distributions over integer values between 0 and . The base measure of is the counting measure, and the th element of its sufficient statistic at is 1, and 0 for any other elements. The sufficient statistic is thus a vector of all zeroes when , and all zeroes except for the th element when . Finally, the log-partition function is , and the forward mapping is given by

Now, a mixture distribution is simply a harmonium defined by and , where is the family of categorical distributions. For a mixture model , the observable distribution can typically model distributions (e.g. multimodal distributions) outside of , yet the prior is indeed in . We show how to compute the conjugation parameters of a mixture model in Algorithm 1, and how to compute the forward mapping in Algorithm 2.