References
1 Introduction
Clustering — the grouping of individuals or items based on their similarity — is an essential part of knowledge discovery. The growing complexity of modern datasets thus poses a challenge, because many clustering algorithms — such as mixtures of Gaussians (MoG) — suffer from the socalled “curse of dimensionality”, and exhibit limited performance when classifying highdimensional data. This arises not only because model complexity scales with dimension, but also because similarity metrics used to evaluate model performance tend to break down in highdimensional spaces
[beyer_when_1999, assent_clustering_2012].A common approach for addressing this challenge is to first apply dimensionality reduction techniques — such as principle component analysis (PCA) or factor analysis (FA) — to project the data into a lower dimensional space, and then cluster the projected data. Such twostage algorithms are widely applied in problem domains such as image processing [houdard_highdimensional_2018, hertrich_pca_2022] and timeseries analysis [warren_liao_clustering_2005], and fields such as neuroscience [lewicki_review_1998, baden_functional_2016] and bioinformatics [witten_framework_2010, duo_systematic_2020].
Nevertheless, twostage approaches struggle with two primary limitations: Firstly, the dimensionalityreduction is typically implemented as a preprocessing step, and is not further optimized based on the results of the subsequent clustering. This can lead to suboptimal projections — if we consider PCA, for example, the directions in the data that best separate the clusters might not be the directions of maximum variance that define the PCA projection
[chang_using_1983, mclachlan_finite_2019].Secondly, twostage algorithms rely on optimizing two distinct objectives, and it is therefore nontrivial to quantify how well the complete model describes the data. As such, even if we consider algorithms that update projections based on clustering results [ding_adaptive_2002, fern_random_2003], it is difficult to to be sure that these updates improve overall model performance, because we cannot evaluate the likelihood of the complete model given the data.
To address these limitations, we consider a family of twostage models that implement dimensionality reduction with a linear Gaussian model (e.g. PCA or FA), and clustering with a mixture of Gaussians (MoG). By applying a novel theory of conjugate priors in latent variable exponential family models, we show how these two stages can be combined and generalized into a single, hierarchical probabilistic model that we call a hierarchical mixture of Gaussians (HMoG).
An HMoG captures both dimensionality reduction and clustering in a single model, and has a single objective given by the likelihood function. Critically, the tractability of HMoGs allows us to derive closedform expressions for the HMoG density function, and a novel expectationmaximization (EM) algorithm for maximizing the HMoG likelihood. Although the derivation of HMoG theory requires a number of technical innovations, the upshot of the theory is straightforward: a closedform expression for the HMoG loglikelihood allows us to rigorously compare various algorithms for dimensionality reduction and clustering, and the HMoG EM algorithm allows HMoGs to strictly exceed the performance of the twostage models that they generalize.
To validate our theory we apply HMoGs to several datasets. In synthetic data, we show how HMoGs can overcome fundamental limitations of standard twostage algorithms. Finally, in a RNA sequencing (RNASeq) data set we show how our methods can significantly exceed the predictive performance of standard twostage approaches. Our work is related to previous methods on mixtures of PCA/FA [ghahramani_em_1996, tipping_mixtures_1999, houdard_highdimensional_2018, hertrich_pca_2022]. In contrast with these methods, however, our model has a hierarchical structure that affords a more compact parameterization, and ensures that a shared, lowdimensional feature space is extracted during training.
2 Theory
Our focus will be on HMoGs where the dimensionality reduction is based on either PCA or FA, and the clustering is based on MoGs. Historically, each of these techniques has fairly different theoretical foundations — PCA finds directions of maximum variance in the data, FA explains the data as a linear combination of hidden factors, and MoGs model the data distribution as a weighted sum of multivariate normal distributions. Nevertheless, each technique may be formulated as a probabilistic latent variable model, that may be fit to data with EM. This unified, probabilistic foundation is essential for combining PCA/FA and MoGs into an HMoG, and ultimately deriving a single EM algorithm that jointly optimizes dimensionalityreduction and clustering.
2.1 Linear Gaussian models and mixture models are probabilistic latent variable models
Suppose we make observations
of the random variable
, and we hypothesize that is influenced by some latent random variable . In the maximum likelihood framework, a statistical model with parameters aims to capture the distribution of by maximizing the loglikelihood objective with respect to . We may extend this framework to a latent variable model that captures both the distribution of and the effect of the latent variable , by maximizing the loglikelihood of the marginal distribution of . As we will see, both linear Gaussian models and mixture models afford a probabilistic latent variable formulation, and linear Gaussian models include FA and PCA as a special case [roweis_em_1997, bishop_pattern_2006].On one hand, where and are continuous variables of dimensions and , respectively, a linear Gaussian model is simply an dimensional multivariate normal distribution. As a consequence, the observable distribution and the prior , as well as the likelihood^{1}^{1}1Unfortunately there is no naming convention for distinguishing the likelihood of nonBayesian model parameters and the likelihood of random variables. We will refer to as the loglikelihood objective where necessary to help distinguish them. and the posterior are also multivariate normal distributions [[, see]]bishop_pattern_2006. Where and , we may define probabilistic PCA as the case where the columns of
are proportional to the ranked eigenvectors of the sample covariance matrix and
is isotropic. FA is then the case where the columns of are the socalled factor loadings and is a diagonal matrix. Finally, in both cases, the mean of the posterior distribution is given by , and probabilistically formalizes projecting a datapoint into the reduced space (“classic” PCA is recovered in the limit as ).A mixture model, on the other hand, is a latent variable model where the latent variable represents an index that ranges from to . Since our goal for HMoGs is ultimately to cluster lowdimensional features , let us introduce a mixture as the model over and an indexvalued latent variable . In this formulation, the components of the mixture are given by the likelihood for each index , the weights
of the mixture are the prior probabilities
, and the weighted sum of the components (“the mixture distribution”) is equal to the observable distribution . MoGs, then, are simply the case where each component is a multivariate normal distributions with mean and covariance .FA, PCA, and MoGs can all be fit to data using expectationmaximization (EM) in our probabilistic framework. The twostage approach to clustering a highdimensional dataset is thus first to fit an appropriate linear Gaussian model (i.e. FA or PCA) to the dataset. Then, to fit a MoG to the projected dataset , where (Fig. 1). After training, the cluster membership of an arbitrary datapoint is determined by first computing the projection
, and then computing (and perhaps taking the maximum of) the index probabilities
.2.2 New exponential family theory enables tractable computation with latent variable models
When analyzing highdimensional data, one must take care to avoid making complex computations in the highdimensional space of observations, as this can quickly lead to computational intractability and numerical instability. In order to address this in our present context, we introduce the class of model known as exponential family harmoniums [smolensky_information_1986, welling_exponential_2005], which allows us to unify the mathematics of our latent variable models. We then develop a theory of conjugate priors for harmoniums, which we use to break down complex probabilistic computations with harmoniums (and ultimately HMoGs) into tractable components.
To begin, an exponential family is a statistical model defined by a sufficient statistic and base measure , and has the form , where are the natural parameters, and is the normalizer known as the logpartition function. There are two exponential families that are particularly relevant for our purposes. Firstly, the family of dimensional multivariate normal distributions is the exponential family with base measure and sufficient statistic , where is the outer product operator. Secondly, the family of categorical distributions over indices is the exponential family with base measure
sufficient statistic given by a socalled “onehot” vector, so that
, and is a length vector with all zero elements except for a 1 at element .An exponential family harmonium is then defined as a kind of product exponential family, which includes various models as special cases [welling_exponential_2005]. Given two exponential families defined by and , and and , respectively, a harmonium is the model
(1) 
which is an exponential family defined by the base measure and the sufficient statistic , with natural parameters and logpartition function .
To return to our dimensionalityreduction and clustering problem, let us define and , and , and and as the sufficient statistics and base measures of the dimensional multivariate normal family, the dimensional multivariate normal family, and the categorical family over indices, respectively. Based on our definitions, we may immediately define a MoG as the harmonium (where we absorb the constant base measure into the proportionality relation). We leave the detailed equations of the natural parameters to the Appendix A Appendix, but sufficed to say the natural parameters and correspond to the parameters and of the mixture components of the MoG, and the parameters correspond to the mixture weights of the MoG.
On the other hand, we may express a linear Gaussian model in the form . This is a restricted version of the harmonium defined by and , and and , where the term in the exponent ensures that the variables and only interact through their first order statistics, rather than the second order statistics that are part of and . We again leave out the details (see Appendix A Appendix), but the parameters correspond to and , and the parameters correspond to the parameters of the PCA and FA models.
To continue, a prior distribution is said to be conjugate to the posterior when both have the same exponential family form, and conjugate priors can greatly simplify statistical inference in exponential family models. Conjugate priors usually arise in the context of inferring the parameters of well known models such as normal or Poisson distributions, but here we wish to use the concept to infer distributions over our features
and clusters . Due to the loglinear form of harmonium, the harmonium posterior is given by(2) 
and is therefore always in the exponential family defined by and . As such we can reduce the question of whether a harmonium prior is conjugate to its posterior to whether the prior is also in the exponential family defined by and .
Lemma 1.
Suppose that is a harmonium. Then the prior satisfies for some if and only if there exists a vector and a constant such that
(3) 
and
(4) 
When Eq. 3 is satisfied, we say that the harmonium is conjugated, and we refer to the parameters and as the conjugation parameters. In general, Eq. 3 is a strong constraint, and is not satisfied by most models. Nevertheless, both linear Gaussian models and mixture models are indeed conjugated harmoniums with closedform expressions for their conjugation parameters. We again leave the out details of their evaluation (see Appendix A Appendix), but sufficed to say, for PCA and FA, the computational complexity of evaluating the conjugation parameters scales only linearly with the dimension of the observations .
By applying the theory of conjugation, we can overcome one of our major challenges, namely computing the harmonium observable density . The observable density is given by
(5) 
which is not in general an exponential family distribution. Typically, the logpartition function will be tractable, which reduces the difficulty of evaluating this density to evaluating the harmonium logpartition function . Although is typically intractable, for conjugated harmoniums we can also reduce its evaluation to the evaluation of .
Corollary 2.
Suppose that is a conjugated harmonium, with conjugation parameters and . Then
(6) 
In the context of our dimensionality reduction problem, Eq. 6 also allows us to evaluate the logpartition function of the linear Gaussian model in terms of the logpartition function on the feature space, and thereby avoid computations in the highdimensional observable space.
Our other major challenge is fitting our models to data, and thankfully, the exponential family structure of harmoniums also affords a general description of the EM algorithm. This follows from the fundamental property of exponential families that the gradient of the logpartition function is equal to the expected value of the sufficient statistics [wainwright_graphical_2008]. Suppose we wish to fit the harmonium to the sample , and that and are the gradients of the logpartition functions and , respectively. Then an iteration of EM may be formulated as
The tractability of training a harmonium thus reduces to the tractability of the logpartition gradient , and the inverse of the gradient . In the case of linear Gaussian models and MoGs, both functions are available in closedform (see Appendix A Appendix), and there is thus a closedform EM for training them.
2.3 Hierarchical mixtures of Gaussians can be tractably evaluated and fit to data
We exploit the probabilistic formulation of the linear Gaussian model and MoG to define an HMoG as the model over observations , features , and clusters . Intuitively, we define an HMoG by taking a PCA or FA model, and swapping out the standard normal prior for a MoG. This definition is more than a notational trick, as it ensures that fitting the twostage model (i.e. PCA/FA and MoG separately) maximizes a lowerbound on the marginal loglikelihood of the HMoG (see Appendix A Appendix).
Putting the expressions for and together allows us to write the HMoG density as
(7) 
where is the HMoG base measure, and is the logpartition function. This construction ensures that and are conditionally independent given , which allows us to depict an HMoG as a hierarchical graphical model (Fig. 1).
By reorganizing terms, the density of an HMoG may expressed as a form of harmonium density, so that we may apply the theory of conjugation to HMoGs. In particular, by applying Corollary 2, we may express the observable density (Eq. 5) of an HMoG as
(8) 
where and are the conjugation parameters of the linear Guassian model ; where and are the conjugation parameters of the mixture model for ; and where and are the conjugation parameters of the mixture model for (see Appendix A Appendix for a detailed derivation and the conjugation parameters).
With regards to fitting, computing the expectation step for HMoGs reduces to computing the logpartition gradient of a mixture model, which is available in closedform. The maximization step, on the other hand, does not afford a closedform expression. Nevertheless, we can solve the maximization step with gradient ascent as long as we can evaluate the gradient of the HMoG logpartition function , which is indeed possible for an HMoG through judicious use of Eq. 4 (see Appendix A Appendix). Our strategy for fitting HMoGs is thus to evaluate the expectationstep in closedform, and then use the Adam gradient optimizer to approximately solve the maximization step [kingma_adam_2014].
Finally, once we have trained our HMoG model we may project and classify new observations. The probabilistic structure of HMoGs suggests that the projection of a data point is given by , which mathematically involves marginalizing out the cluster index . Similarly, the cluster membership of a data point is given by , which involves marginalizing out the features . In practice these probabilistic definitions of projection and classification perform very similarly to their twostage versions, and for consistency we use the twostage projection and classification algorithms with twostage models, and the probabilistic versions with unified models. For a more detailed discussion see the Appendix A Appendix.
Notes on training parameters and implementation details
When fitting twostage algorithms, we always run 100 iterations of EM each for the dimensionalityreduction models and the MoG, as training times were negligible. We initialized the PCA/FA parameters by setting the mean equal to the empirical mean, the isotropic/diagonal variance to the corresponding empirical value, and initializing the projection matrix randomly with values between 0.01 and 0.01. After fitting PCA/FA, we initialized the MoG parameters by fitting a multivariate normal to the projected data, and setting the mean and covariance of each cluster to a sample point from the fit normal, and the covariance of the fit normal, respectively. We set the weight distribution to be uniform.
Fitting the HMoG PCA/FA algorithms on the Iris and synthetic datasets was possible using the twostage initialization schemes and a range of learning parameters. For the RNASeq data, we initialized an HMoG by first fitting its parameters with a twostage algorithm, and then running 800 iterations of unified EM. We used relatively conservative learning parameters: a learning rate of for the Adam optimizer, with 2000 steps per EM iteration. To avoid local maxima during optimization, we always ran 10 simulations in parallel and selected the model that maximized training data performance.
All algorithms were implemented in Haskell and targeted the CPU, and simulations were run on an AMD Ryzen 9 5950X processor. By distributing parallel simulations over cores, a single fit of an HMoG FA model on the RNASeq data took about 10 minutes.
3 Applications
We next show how HMoG theory is useful for dimensionalityreduction and clustering in three application scenarios: (i) a simple validation of the HMoG theory on the Iris flower dataset [noauthor_iris_2022]; (ii) a demonstration on synthetic data of the limitations of standard twostage algorithms, and how HMoGs can overcome them; and (iii) an application to RNASeq data [duo_systematic_2020, lause_analytic_2021] to show how HMoGs enhance performance over twostage approaches on real data.
On each dataset, we compare four methods for dimensionalityreduction and clustering: we fit (i) probabilistic PCA or (ii) FA to the given dataset with EM, followed by separately fitting a MoG to the projected data with EM; or we combine either (iii) probabilistic PCA or (iv) FA with a MoG into an HMoG, and fit the model with the unified EM algorithm. We refer to these methods as twostage PCA, twostage FA, HMoG PCA, and HMoG FA, respectively (Fig. 1).
3.1 HMoGs afford rigorous comparison of highdimensional clustering methods
As we have shown, twostage algorithms maximize the likelihood of an equivalent HMoG, even when they optimize dimensionalityreduction and clustering separately. We demonstrate this by comparing the loglikelihood trajectories of all four methods given the Iris flower dataset [noauthor_iris_2022] (Fig. 2A). We observed that the loglikelihood increased every iteration for all four training algorithms. For this dataset, we saw little performance difference between the four methods for the fully trained model. Qualitatively, the projections and clusters learned by the twostage PCA method were indeed sufficient for capturing and clustering features of the data (Fig. 2B), and the HMoG FA model did not perform noticeably better (Fig. 2C). In conclusion, HMoG theory supports quantitative model comparison through evaluation of the HMoG loglikelihood (Eq. 8).
3.2 HMoG EM overcomes limitations of twostage algorithms
We next compared our methods on a synthetic dataset designed to reveal the limitations of the twostage approaches (Fig. 3). Data was generated from a groundtruth, HMoG FA model with two clusters, a 1dimensional feature space, and a 2dimensional observable space. We designed the groundtruth HMoG so that the dimension along which cluster membership changes in observation space is perpendicular to the direction of maximum variance (Fig. 3AC).
Unsurprisingly, twostage PCA learned a projection that only followed the direction of maximum variance, and failed to effectively model either the observable density (Fig. 3A) or feature density (Fig. 3D). Although we do not present the results here, HMoG PCA also failed to effectively model the data in the same way as twostage PCA, indicating that the limitation is due to the simpler structure of the PCA model with its single (Fig. 1), rather than the training algorithm used.
On the other hand, twostage FA performed better than PCA, and learned a projection that captures the feature direction along which cluster membership changes (Fig. 3B). This was reflected in the feature distribution learned by the twostage MoG, which captured the clusters in the feature space of the groundtruth HMoG (Fig. 3E). Nevertheless, twostage FA failed to capture the finestructure of the observable densities, and in repeated simulations we found that it often learned suboptimal clusterings. In contrast, HMoG FA nearly perfectly modeled the observable density (Fig. 3C), and reliably separated the two clusters in feature space (Fig. 3F). Overall, we found that FAbased models of dimensionality reduction have important advantages over PCAbased models, which are only fully exploited by training them with the unified, HMoG EM algorithm.
3.3 HMoGs exceed predictive performance of twostage models on RNASeq data
Finally, we applied our methods to a singlecell RNASeq dataset from peripheral blood mononuclear cells (PBMCs), a subset of the human immune system. The data contained cells with measurements of genes [duo_systematic_2020]. The data was preprocessed by computing analytic Pearson residuals [lause_analytic_2021] using scanpy 1.9 default settings, and twenty genes with the highest residual variance were selected. The dataset was grouped into eight immune cell subtypes identified by previously known genetic markers by [zheng_massively_2017], allowing us to compare model classification performance to groundtruth clusters.
We evaluated the predictive performance of all four methods on the RNASeq data through 5fold crossvalidation of the loglikelihood, as we varied the number of dimensions and number of clusters in the feature space (Fig. 4AC). In contrast to the Iris flower data, we found that HMoG FA significantly outperformed all the alternative approaches. For example, for four clusters and four features, twostage PCA, twostage FA, and HMoG FA achieved predictive loglikelihoods of , , and , respectively. Moreover, HMoG FA could achieve equivalent performance (predictive loglikelihood of ) to the twostage methods with merely three clusters and three features. We again found that HMoG PCA performance did not significantly exceed that of twostage PCA, and so do not present the results here.
To better understand the feature space learned by each model, we refit the models on the full datasets. For each method, we set the number of clusters and feature dimensions to four, as we found that predictive performance saturates around 4 clusters, and chose 4 features for simplicity. We then focused on a representative plane in the 4dimensional feature space. Overall we found that each model successfully separated out three of the cell subtypes, and used a fourth cluster to represent “everything else” (Fig. 4DF).
Since the feature spaces of the methods were qualitatively similar, we sought to better understand why the loglikehood of HMoG FA was so much higher than that of the other techniques. Due to its hierarchical structure, we can marginalize out the features in the HMoG density, and then factor the joint distribution over observations
and cluster indices into . Given labelled data , performance may then be quantified as the combination of classification performance — i.e. how much the labels under the model match the true labels — and how well the model captures the data distribution Quantitatively, we found no significant difference in classification performance between our four methods when using 4 clusters and 4 features — when we associated each cluster with its most strongly correlated label on the training data, we found that all models achieved heldout classification performance of . When we allowed one cluster to represent multiple labels (to permit one cluster to model the “everything else”), we found each technique achieved heldout classification performance of . Overall, this suggests that each model was able to saturate classification performance given 4 clusters, and that the gains achieved by HMoG FA were entirely due to how well the clusters it learned in the observable space were able to represent the relevant data.Discussion
In this paper we have presented a unified model of dimensionality reduction and clustering we call a hierarchical mixture of Gaussians. We showed how the theory of HMoGs generalized existing twostage algorithms for dimensionality reduction and clustering, allowing us to derive an exact expression for the loglikelihood of these methods, and to enhance their performance with an EM algorithm that trains both stages in parallel. We applied our theory of HMoGs to three datasets, and showed how HMoGs can help existing methods exceed their limitations.
A notable finding in our applications was the performance advantage achieved by FAbased models over PCAbased models. It is wellknown that PCAbased dimensionality reduction can fail to capture the dimensions relevant for clustering [chang_using_1983], and it has been suggested that the rescaling properties of FA might help address this [mclachlan_finite_2019]. Our applications confirmed the potential advantages of FA over PCA for dimensionalityreduction, both in standard twostage models and HMoGs. Given the relative dominance of PCA over FA in dimensionalityreduction applications [[, see]]duo_systematic_2020, our results call for additional investigation of the advantages offered by FAbased methods.
An advantage of our framework is its theoretical simplicity and flexibility, as ultimately we fit HMoGs by maximizing the likelihood using EM, where the maximization step is implemented with a gradient ascent procedure. As such, the basic fitting algorithm we provided could be further enhanced with regularization or sparsity techniques [witten_framework_2010, yi_regularized_2015] or specialized EM algorithms for large datasets [chen_stochastic_2018].
Finally, our exponential familybased theory of HMoGs opens two directions for further research. On one hand, the theory of conjugation we developed provides a toolbox for designing tractable hierarchical models out of arbitrary exponential families. As such, it should prove useful to researchers who work with data that call for more specialized distributions, such as Poisson distributions for count data. On the other hand, the loglinear structure of our exponential family models makes them highly amenable to GPUbased computation, and should allow optimized implementations of our algorithms to scale up to the massive sizes of modern datasets.
Appendix A Appendix
Towards developing and explaining the theory of HMoGs, this appendix is organized into the following sections:

A brief introduction to exponential families, largely in order to fix notation.

An overview of the theory of exponential family graphical models known as harmoniums.

Derivation of the theory of conjugated harmoniums.

Formulations of mixture models and linear Gaussian models as conjugated harmoniums.

Combining these various components to formalize an HMoG.
The culmination of these theoretical developments is a series of algorithms for computing the loglikelihood and expectationmaximization algorithms for HMoGs.
a.1 Exponential families
For a thorough development of exponential family theory see amari_methods_2007 (amari_methods_2007), and wainwright_graphical_2008 (wainwright_graphical_2008).
Consider a random variable on the sample space , with an unknown distribution , and suppose is an independent and identically distributed sample from , such that . We may model based on the sample by defining a statistic , where is a space with dimension
, and looking for a probability distribution
that satisfies , where is the expected value of under . On its own this is an underconstrained problem, but if we further assume that must have maximum entropy, then we may define a family (or manifold) of distributions that is uniquely described by the possible values of .A dimensional exponential family is defined by a sufficient statistic , as well as a base measure which helps define integrals and expectations within the family. An exponential family is parameterized by a set of natural parameters , such that each element of the family may be identified with some parameters . The density of the distribution is given by , where is the logpartition function. Expectations of any are then given by . Because each is uniquely defined by , the means of the sufficient statistic also parameterize . The space of all mean parameters is , and we denote them by . Finally, a sufficient statistic is minimal when its component functions are nonconstant and linearly independent. If the sufficient statistic of a given family is minimal, then and are isomorphic. Moreover, the transition functions between them and are given by , and , where is the negative entropy of . The transition functions and are also referred to as the forward and backward mappings, respectively.
a.2 Exponential Family Harmoniums
An exponential family harmonium
is a kind of product exponential family which includes restricted Boltzmann machines as a special case
[smolensky_information_1986, welling_exponential_2005]. We may construct an exponential family harmonium out of the exponential families and by defining the base measure of as the product measure , and by defining the sufficient statistic of as the vector which contains the concatenation of all the elements in , , and the outer product . More concretely, is the manifold that contains all the distributions with densities of the form , where , , and are the natural parameters of .The linear structure of harmoniums affords simple expressions for their training algorithms. On one hand, given a sample and a harmonium with parameters , , and , an iteration of the expectationmaximization algorithm (EM) may be formulated as:
 Expectation Step:

compute the unobserved means for every ,
 Maximization Step:

eval. .
On the other hand, the stochastic loglikelihood gradients of the parameters of are
where . It is often the case that we have a closedform expression for , but not for , and therefore cannot compute the maximization step in closedform. Although we could simply train a harmonium with gradient ascent on the loglikelihood, for certain models this can lead to numerical instability, and an effective strategy is rather to evaluate the expectation step, and then approximate the maximization step with gradient descent.
a.3 Harmoniums and Conjugate Priors
The conditional distributions and of a harmonium have a simple linear structure, and are themselves always exponential family distributions; in particular, and for any or , respectively. In general, however, the marginal distributions and are not members and , respectively. This is both a blessing and a curse, as on one hand, the marginal distribution can model much more complex datasets than the simpler elements of . On the other hand, because the prior may not be computationally tractable, various computations with harmoniums, such as sampling and inference, may also prove intractable.
Ideally, the prior would be in to facilitate computability, while the modelled observable distribution would be more complex than the elements of , and some classes of harmoniums do indeed have this structure. In general, a prior is said to be conjugate to a posterior if both the prior and posterior have the same form for any observation . In the context of harmoniums, since the posterior for any , the prior is conjugate to the posterior iff . We refer to harmoniums with conjugate priors as conjugated harmoniums.
Lemma 1.
Suppose that is a harmonium family defined by the exponential families and , and that has parameters . Then is conjugated if and only if there exists a vector and a constant such that
(9) 
for any , and where
(10) 
Proof.
In general, the density of the harmonium prior distribution is given by
(11) 
On one hand, if we assume that is conjugated, then it also holds that with some natural parameters , and therefore
for some , and .
On the other hand, if we first assume that Eq. 9 holds, then is given by
(12) 
which implies that with parameters . ∎
We refer to and as the conjugation parameters. The value of this lemma is that it allows us to reduce various computations to evaluating the conjugation parameters, and show how these computations are fundamentally the same even for apparently unrelated models. To wit, the following corollary describes how to compute the logpartition function of a conjugated harmonium.
Corollary 2.
Suppose that is a harmonium family defined by the exponential families and , and that the harmonium with parameters is conjugated, with prior and parameters . Then satisfies
Proof.
Typically, evaluating the logpartition function of will be tractable, which means that various computations, including evaluating the density of , are also tractable. In particular, given the conjugated harmonium with parameters , , and
where .
a.4 Mixture Models and Linear Guassian Models
Constructing a hierarchical mixture of Gaussians ultimately requires putting together a mixture model and a linear Gaussian model in the “right” way. To this do we use the theory of conjugated harmoniums, by first defining the exponential families of categorical and normal distributions, and then deriving the conjugation parameters of mixture models and linear Gaussian models.
The dimensional categorical exponential family contains all the distributions over integer values between 0 and . The base measure of is the counting measure, and the th element of its sufficient statistic at is 1, and 0 for any other elements. The sufficient statistic is thus a vector of all zeroes when , and all zeroes except for the th element when . Finally, the logpartition function is , and the forward mapping is given by
Now, a mixture distribution is simply a harmonium defined by and , where is the family of categorical distributions. For a mixture model , the observable distribution can typically model distributions (e.g. multimodal distributions) outside of , yet the prior is indeed in . We show how to compute the conjugation parameters of a mixture model in Algorithm 1, and how to compute the forward mapping in Algorithm 2.