1 Introduction
Many important problems in modern machine learning and artificial intelligence can be gracefully captured by probabilistic graphical models with latent variables. However, learning such general latentvariable probabilistic graphical models has remained very difficult, largely due to the intractability introduced by the latent variables and the possibly complicated loopy structures of the graphical models.
Previously, most learning algorithms for latentvariable graphical models would resort to local search heuristics such as the Expectation Maximization algorithm
Dempster et al. (1977) that turn the parameter learning problem into a maximization problem over a nonconvex objective function. However, such methods based on iterative local search can lead to bad local optima and can be very slow to converge.In order to overcome these shortcomings, recently researchers have been proposing and studying a new family of nonparametric learning algorithms called spectral learning that are based on the idea of method of moments
Hsu et al. (2009); Anandkumar et al. (2014). These spectral learning algorithms seek to learn alternative parameterizations of the latentvariable graphical models that are purely based on the observable variables and can be efficiently recovered from their loworder observable moments through tensor decompositions. Therefore these algorithms enjoy the advantages of being localminimumfree and provably consistent, and also can often be orders of magnitude faster than the searchbased methods such as the EM algorithm.
Although the spectral learning algorithms have recently been successfully applied to tackle the learning problems of more and more types of latentvariable graphical models, there are still several severe limitations and problems to these spectral algorithms that need to be solved. First of all, most of the spectral learning algorithms that have been proposed so far are only restricted to learning one or several specific types of latentvariable graphical models (the majority of them are treestructured), thus making themselves hard to generalize to other types of latentvariable graphical model structures (possibly beyond trees). Second, most of the current spectral learning algorithms can only deal with discretevalued random variables with moderate cardinalities due to their parameterization assumptions, thus cannot be easily applied to scenarios where the variables in the graphical models are continuousvalued or have large cardinalities, which are quite often encountered in many important realworld applications. Third, the current spectral algorithms are generally idiosyncratic to the specific graphical model structures that they are targeted to learn, and thus cannot provide a flexible learning framework or template for users to easily incorporate different prior knowledge and probabilistic assumptions when facing different learning problems.
In order to solve these problems and to overcome these limitations, in this paper we will develop a new algorithmic framework for learning general latentvariable probabilistic graphical models based on the ideas of predictive belief propagation and reproducingkernel Hilbert space embeddings of distributions. Our new learning framework is built on the latent junction tree graphical representation of latentvariable probabilistic graphical models, and reduces the hard parameter learning problem down to a pipeline of supervised learning problems named twostage regression, where we can flexibly insert any supervised learning algorithms into the procedure as we see fit. When dealing with continuous variables, we will embed their distributions as elements in reproducingkernel Hilbert spaces, and then use the kernel trick to propagate and integrate these embeddings over the latent junction tree via tensor operations in a nonparametric fashion, thus seamlessly extending our algorithm from the discrete domain to the continuous domain under the same general framework. As a result, our learning framework is general enough to apply to all different structures of latentvariable graphical models (both trees and loopy graphs), can handle both discrete variables and continuous variables, and can allow users to easily incorporate prior information and probabilistic assumptions during learning.
2 Predictive State Representations and Predictive Belief Propagation
In its original formulation, a predictive state representation (PSR) Singh et al. (2004)
is a compact way to model the state of a controlled dynamical system using a vector consisting of predictions over a set of observable future experiments (or tests) that can be performed on the system. Under this representation, the state of the dynamical system can be compactly summarized as a
dimensional vector of the predictions over a finite set oftests, which is called the “core set of tests”, and the success probabilities of all other future tests are simply determined by linear functions of this state vector.
A key insight is that the idea of predictive state representations can actually be carried over to characterize the information flows of message passing and belief propagation in probabilistic graphical models. In essence, the state of a system is a time bottleneck that compactly summarizes everything we need to know about the past in order to predict the future, as illustrated in Figure 1(a) below.
When we perform belief propagation during inference, at each separator set, the classical approach is to propagate a message containing the partial sumproduct results from one part of the junction tree to the other across the separator set in order to compute the posterior distribution of downstream variables using the evidence information from the upstream variables. Therefore, when the belief propagation agent visits each separator set, the message that it is sending over can be essentially viewed as a representation of the current “state” of the junction tree inference system at that paticular separator set. This important analogy is illustrated in Figure 1(b) below.
Once this analogy of state is built for junctions trees, we can further generalize the idea of PSR onto junction trees. Now instead of using the partial result of sumproduct calculations over the “past” part (i.e., outside tree) of the junction tree as the state as well as the message to send across the current separator set (as were done by the classical ShaferShenoy algorithm Shenoy & Shafer (1990) and Hugin algorithm Lauritzen & Spiegelhalter (1988); Anderson et al. (1989)), we can use the predictions of the success probabilities of a core set of future tests consisting of observations in the “future” part (i.e., inside tree^{1}^{1}1The inside tree of a separator set is defined to be the subtree rooted at (including ) in the juntion tree; the outside tree is defined to be rest of the junction tree excluding the inside tree .
) of the junction tree to represent the state of the junction tree at the current separator set, and send this prediction vector as the message into the inside tree during belief propagation (see Figure 1(b)). More concretely, at a separator set in a junction tree, among all the observable variables in the inside tree, there always exists a minimal subset of observable variables whose posterior joint probabilistic distributions given the evidence information passed in from the outside tree would completely determine the posterior joint distribution of all the observable variables in the inside tree. This minimal subset of observable variables in the inside tree would be designated as
the core set of variablesfor this particular separator set, and thus all the possible joint value realizations of these variables can be defined as the core set of tests for a PSR over the latent junction tree. Therefore, the predictive state at a separator set can be defined as the posterior joint probability distribution of its core set of variables (or equivalently a sufficient statistic for this joint distribution).
Now during the inference, instead of passing partial sumproduct results around as the messages, we can pass the predictive state vectors around as the new form of predictive messages to calculate the desired posterior. And during the learning process, we can use twostage instrumental regression to learn the operators at each separator set that process incoming predictive messages and distribute outgoing predictive messages, purely based on the observed quantities in the training dataset. We name this novel message passing framework predictive belief propagation.
3 TwoStage Regression
After determining the form of messages that we are passing around during the inference process, we now need to figure out how different messages relate to each other and how we could learn these intermessage relationships from the training data. Without loss of generality, let’s consider a nonleaf separator set in a latent junction tree , where is connected with “childen” separator sets below it (see Figure 2). Let indicate the core set of observable variables that is associated with a separator set . Then according to the definition of PSR on latent junction trees, knowing the posterior joint probability distribution over would completely determine the posterior joint probability distribution over all the observable variables in the inside tree below . Let denote a sufficient statistics feature vector for the posterior joint distribution over of a separator set , and let denote the evidence information that we observed in the outside tree of , then we have that the outer product of , , … , is completely determined by a linear function of , for any , because , , … , are all sets of variables that are contained in the inside tree below . That is to say, there exists a linear operator such that^{2}^{2}2The mathematical notation indicates the modespecific tensor multiplication along mode .:
for any outside tree evidence .
Therefore, the goal of our new learning algorithm under this PSR representation is mainly to learn these linear operators for each of the nonleaf separator sets
in the latent junction tree from our training data. A first naive attempt for this would be to perform a linear regression directly using the empirical samples of
and , since andthemselves cannot be directly observed in the training data. However, this estimation of
that we would obtain from linear regression over samples of and may be biased, because the variables in can overlap with the variables in and thus the noise term on and can be correlated with each other. One powerful method to overcome this problem of correlated noise in modern statistics and econometrics is instrumental variable regression Stock & Watson (2011); Hefny et al. (2015). In our current setting, a valid instrumental variable must satisfy the criteria that it is correlated with the input variable , and that it is not correlated with the noise term on . Thus an ideal choice of instrumental variable that satisfies the above criteria would be some features of the observable variables that are contained in the outside tree , because these observable variables in the outside tree are strongly correlated with but also do not overlap with the variables in . Now let indicate the set of observable variables in the outside tree whose featureswe decide to use as our instrumental variable. Then we can efficiently obtain an unbiased estimation of the linear operator
in three straightforward steps: (1) regress upon ; (2) regress upon ; (3) run a final linear regression from the predictions obtained in step (1) to the predictions obtained in step (2) to recover an unbiased estimation of the value of . These three steps of supervised learning constitute the twostage regression procedure in our new learning algorithm.4 Main Algorithm
In this section, we present our main algorithm^{3}^{3}3Some mathematical notations that we will use here are: denotes the set of observed variables ; denotes the set of hidden variables ; denotes the number of elements in a set;
denotes the number of possible realization values of a discrete random variable;
denotes the length of a vector; denotes the order index of a certain value among all the possible realization values of a discrete random variable ; denotes the set of all separator sets that are connected with a clique node in a junction tree. for learning general latentvariable graphical models in two consecutive parts: the learning algorithm followed by the inference algorithm. Here we will describe the basic version of the algorithm for the case where all the variables are discretevalued, and the extension to handle models with continuousvalued variables is given in section 5. We give the full proof of the consistency of our algorithm in the appendix.At a high level, the desired inputoutput behavior of our algorithm can be formulated as:
Input: the topology of the latentvariable probabilistic graphical model , a training set of samples , a set of observed evidence (where denotes the set of index for the set of observable variables that are observed as evidence), and the index of the query node .
Output:
Estimated posterior probability distribution of the query node conditioned on the observations of the set of evidence nodes:
.4.1 The Learning Algorithm:
Model Construction: Convert the latentvariable graphical model into an appropriate corresponding latentvariable junction tree such that each observable variable in the junction tree can be associated with one leaf clique in . (See Figure 3(a) and 3(b) for an example.)
Model Specification: For each separator set in the junction tree , among all the observable variables that are associated with its inside tree , determine a minimal subset of observable variables^{4}^{4}4For each separator set that is right above a leaf clique node , we require that must be the set of observable variables that are associated with . whose posterior joint probabilistic distributions given the evidence information passed in from the outside tree would completely determine the posterior joint distribution of all the observable variables in the . This is the core set of variables for , and will be denoted as . Then among all the observable variables that are associated with the outside tree , select a subset of variables . Now pick features for the inside tree and the outside tree , where we require that must be a sufficient statistic for . (See Figure 4 for an example.)
Stage 1A Regression (S1A): At each nonleaf separator set in the latent junction tree , learn a (possibly nonlinear) regression model to estimate . The training data for this regression model are across all training samples.
Stage 1B Regression (S1B): At each nonleaf separator set in the latent junction tree , where is connected with “childen” separator sets below it (see Figure 3), learn a (possibly nonlinear) regression model to estimate . The training data for this regression model are across all training samples.
Stage 2 Regression (S2): At each nonleaf separator set in the latent junction tree , use the feature expectations estimated in S1A and S1B to train a linear regression model to predict , where is a linear operator associated with . The training data for this model are estimates of for all the training samples that we obtained from S1A and S1B regressions.
Root Covariance Estimation: At the root clique node of the latent junction tree , estimate the expectation of the outer product of the inside tree feature vectors of all adjecent separator sets that are connected with by taking average across all the training samples:
.
4.2 The Inference Algorithm:
Design Tensor Construction: For each leaf separator set of the latent junction tree , construct an inside feature design tensor with modes , , , …, , and dimensions . Each fiber of along mode would be the realization vector of the inside feature function at the corresponding values for variables , i.e.,
Initial Leaf Message Generation: At each leaf clique node of the latent junction tree , let indicate the set of observable variables that are contained in , and let indicate the separator set right above . We define a function of an observable variable that evaluates to an allone vector if
is not observed in the evidence and evaluates to a onehotencoding value indicator vector
if is observed to have value in the evidence. That is to say:
Then the upward message that we send from to its parent clique node can be calculated as:
Message Passing:
From leaf to root (the upward phase):
For each nonroot parent clique node in where it is separated by a separator set from its parent node , once it has received the messages from all of its children nodes , which are separated from it by the separator sets repectively, compute the upward message:
and then send this message to its parent node .
At the Root:
At the root clique node in where it is surrounded by all of its children node (each separated by the separator sets respectively), for each , once has received all the upward messages from all of its other children cliques except for , compute the downward message:
and then send this message to its childen node .
From root to leaf (the downward phase):
For each nonroot parent clique node in where it is separated from its parent node by a separator set and separated from its children nodes by the separator sets respectively, once it has received the downward message from its parent node , for each , compute the downward message:
and then send this message to its children node .
Computing Query Result: For the query node , locate the leaf clique node that is associated with and call it . Call ’s parent node and the separator set between them . We first use the design tensor and its MoorePenrose psuedoinverse to transform the downward incoming message and the upward outgoing message , respectively, and then compute the Hadamard product of these transformed versions of the two messages to obtain an estimate of the unnormalized conditional probability of given all the evidence :
We can then marginalize out the variables in
and renormalize to obtain the final query result  the estimate of the conditional probability distribution of the query variable
given all the evidence: .[Additional Note]: Another important type of queries that is very commonly encountered in practice and that we can also easily compute here is the joint probability of all the observed evidence: . It turns out that in the previous step, before marginalization and renormalization, the Hadamard product is actually equal to:
(See the appendix for the proof.) Now if we marginalize out all the variables in from the above probability table, we can easily obtain .
4.3 Proof of Consistency
The learning and inference algorithm that we presented above can be proved to be a consistent estimator of the inquired ground truth conditional probability distribution . Here we list the four key results in the derivation of the proof, and we present the actual full proof in the appendix.
Theorem 1. At each nonleaf separator set , the linear operator that is obtained from the Stage 2 Regression converges in probability to:
Theorem 2. At the root clique of the latent junction tree, the tensor that is obtained from the root covariance estimation step converges in probability to:
And then the next two theorems directly follows through mathematical induction:
Theorem 3. During the upward phase of the inference procudure, all the upward messages that each nonleaf nonroot clique node sends to its parent node (which is separated from by the separator set ) converges in probability to^{5}^{5}5Here denotes the set of all observable variables that are associated with the leaf clique nodes of the inside tree that are actually being observed in the evidence.
:
Theorem 4. During the downward phase of the inference procudure, all the downward messages that each nonleaf clique node sends to any of its children node (which is separated from by the separator set ) converges in probability to^{6}^{6}6Here denotes the set of all observable variables that are associated with the leaf clique nodes of the outside tree that are actually being observed in the evidence.
:
5 Extending to Continuous Domain through Hilbert Space Embeddings of Distributions
One of the biggest advantages of our new learning framework is that it can be seamlessly entended from the discrete domain to the continuous domain in a very general nonparametric fashion. Previously all the existing algorithms for learning graphical models with continuous variables would have to make certain parametric assumptions (such as multivariate Gaussian) about the function forms that they are using to model or approximate the continuous distributions of variables. However, in many realworld data, the underlying continuous distributions are highly complex and irregular, and such parametric approach would severely limit the modeling expressiveness of the graphical models and can often deviate from the true underlying distributions. In contrast, in our algorithm, we can simply use the reproducingkernel Hilbert space (RKHS) embeddings of distributions as sufficient statistic features and express all the (alternative) parameter learning and belief propagation operations as tensor algebra in the infinitedimensional Hilbert space, and then employ the kernel trick to transform these operations back into tractable finitedimensional linear algebra calculations over Gram matrices. We now explain how to formulate our learning algorithm over continuous domain with RKHS embeddings^{7}^{7}7Here we will follow the notations of Boots et al. (2013) in our derivation..
5.1 RKHS Embeddings of Continuous Distributions
First, for each observable variables in the latent graphical model, specify a characteristic kernel:
Then for each nonleaf separator set in the latent junction tree , specify characteristic kernels for and :
and pick the features for the inside tree to be , and the features for the outside tree to be . And for each separator set that is right above a leaf clique node, we require that .
Now given the i.i.d. tuples in our training dataset, we define the following linear operators:
Let denote the adjoint (conjugate transpose) of , we also compute the corresponding Gram matrices:
5.2 TwoStage Regression in the Hilbert Space
After representing the continuous distributions of variables as points embedded in the corresponding RKHS, we can easily perform the twostage regression through linear algebra operations over the Hilbert Space. For example, if we are using kernel ridge regression to run the Stage 1A and Stage 1B regressions, we would have:
and very similarly for Stage 1B, where is the regularization parameter.
We now collect all the estimated features for the inside tree and the extended inside trees that we calculated from the above S1A and S1B regressions into two operators:
And then for the stage 2 regression, we obtain the linear operator as:
where the last equality follows from the matrix inversion lemma.
5.3 Message Passing Inference in Hilbert Space
Now we first compute the estimated root covariance tensor:
,
and then the initial leaf messages that we send from each leaf clique node to its parent clique node would be:
where the function is defined to be^{8}^{8}8Let [u,v] denote the range of value that can take.:
In the actual calculations, we can use the empirical average of over finite number of samples in to approximate .
The remaining steps of the messagepassing inference procedure follow exactly the same form as in the discrete case, and eventually all the computation involved will be nicely carried out over the finitedimensional Gram matrices. To our knowledge, this is the first algorithm that has the ability to learn continuousvalued latentvariable graphical models in a completely nonparametric fashion without making any assumptions about the function forms of the variables’ distributions.
6 Experiments
We designed three sets of experiments to evaluate the performance of our proposed algorithm, including both synthetic data and real data and ranging from both discrete domain and continuous domain.
6.1 Synthetic Dataset
In this experiment, we test the performance of our algorithm on the task of learning and running inference on the discretevalued latentvariable graphical model depicted in Figure 5 using artifically generated synthetic data and compare it with the EM algorithm.
We randomly set all the local conditional probability tables in this directed graphical model as the ground truth model parameters, and then sample a dataset of 30000 sets of joint observations for the observable variables. Next we apply our new algorithm to learn this model, and evaluate its performance on the task of inferring the posterior distribution of variable given the observed values at variables , , and
. We compute the Kullback–Leibler divergence between our algorithm’s inferred posterior and the ground truth posterior calculated using the exact Shafer–Shenoy algorithm and average across all possible joint realizations of the variables
. We report the results in Figure 6 below. From the plot, we see that the average KL divergence between our algorithm’s results and the ground truth posterior quickly decreases and approaches 0 as the size of the training data increases. This result demonstrates that our algorithm learns quickly to perform accurate inference over latent graphical models.We also run the EM algorithm to learn the same model with the same synthetic training dataset, and compare its performance and training time with our algorithm. The results are plotted in Figure 7. Our algorithm achieves comparable learning performance, but is much faster to train than EM.
6.2 DNA Splice Dataset
Here we consider the computational biology task of classifying splice junctions in DNA sequences, and test our learning algorithm using the DNA splice dataset on the UCI machine learning repository Asuncion & Newman (2007)
. This dataset contains a total of 3190 length60 DNA sequences. Each sequence is labeled as one of the three different categories: Intron/Exon site, Exon/Intron site or Neither. Our goal is to learn to classify unseen DNA instances. We adopt a generative approach and use secondorder nonhomogeneous Hiddern Markov Models to model the DNA splice junction sequences. For each of the three categories, we use our algorithm to learn a different secondorder nonhomogeneous HMM. At test time, we compute the probabilities that a test instance is generated from each of the three secondorder HMMs, choosing the one with the highest probability as the predicted category. We are able to achieve an overall classification accuracy of 87.97%, and the detailed results are reported in Figure 8.
6.3 Human Action Recognition in Videos
In this experiment we consider the computer vision problem of recognizing human actions from videos, using the classic KTH human action dataset Schuldt et al. (2004). The KTH dataset contains a total of 2391 video sequences of 6 different categories of actions: boxing, handclapping, handwaving, jogging, running and walking. Here we choose to use the secondorder nonhomogeneous state space model to model the generative process of human action videos, as depicted in Figure 9. For each video episode in the dataset, we take key frames evenly spaced across the whole time frame, and then extract a 2800dimensional histogram of oriented gradients (HOG) feature vector and a 180dimensional histogram of optical flow (HOF) feature vector for each of the key frames. During training, we concatenate the HOG and HOF feature vectors into a 2980dimensional vector that summarizes each key frame, and then use dimensionality reduction to reduce it to 5 dimensions. These 5dimensional feature vectors are the observable variables in our latent graphical model. For each of the 6 action categories, we use our algorithm to learn a different secondorder nonhomogeneous state space model. At test time, we compute the probabilities
that a test video sequence is generated from each of the 6 secondorder state space models, choosing the one with the highest probability as the predicted action category. This formulation poses a difficult problem of learning continuousvalued latentvariable graphical models with loopy structures, since the feature descriptors are continuousvalued and have complex distributions. But as we discussed in section 5, our new algorithm can smoothly handle the learning problem of such models through RKHS embeddings. In our experiment, using Gaussian radial basis function kernels with bandwidth parameter
, regularization parameter and model length, we are able to achieve an overall recognition accuracy of 75.69%. Figure 10 plots the recognition accuracies and the normalized confusion matrix of our results.
7 Conclusion
We have developed a new algorithm for learning general latentvariable graphical models using predictive belief propagation, twostage instrumental regression and RKHS embeddings of distributions. We proved that the algorithm gives a consistent estimator of the inference results. We evaluate the algorithm’s learning performance on both synthetic and real datasets, showing that it learns different types of latent graphical model efficiently and achieves good inference results in both discrete and continuous domains. We believe that our algorithm provides a powerful and flexible new learning framework for latent graphical models.
References
 Anandkumar et al. (2014) Anandkumar, A., Ge, R., Hsu, D., Kakade, S., and Telgarsky, M. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15:2773–2832, 2014.
 Anderson et al. (1989) Anderson, S., Olesen, K., Jensen, F., and Jensen, F. HUGIN  a shell for building Bayesian belief universes for expert systems. In Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI), pp. 1080––1085, 1989.
 Asuncion & Newman (2007) Asuncion, A. and Newman, D. J. UCI machine learning repository. https://archive.ics.uci.edu/ml/datasets/Molecular+Biology+(Splicejunction+Gene+Sequences), 2007.
 Boots et al. (2011) Boots, B., Siddiqi, S., and Gordon, G. Closing the learningplanning loop with predictive state representations. The International Journal of Robotics Research, 30(7):954–966, 2011.
 Boots et al. (2013) Boots, B., Gretton, A., and Gordon, G. Hilbert space embeddings of predictive state representations. In Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence (UAI 2013), pp. 92––101, 2013.
 Dempster et al. (1977) Dempster, A., Laird, N., and Rubin, D. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38, 1977.
 Hefny et al. (2015) Hefny, A., Downey, C., and Gordon, G. Supervised learning for dynamical system learning. In Advances in Neural Information Processing Systems 28 (NIPS 2015), 2015.

Hsu et al. (2009)
Hsu, D., Kakade, S., and Zhang, T.
A spectral algorithm for learning hidden Markov models.
In Proceedings of the 22nd Annual Conference on Learning Theory, 2009.  Lauritzen & Spiegelhalter (1988) Lauritzen, S. and Spiegelhalter, D. Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society, Series B, 50(2):157––224, 1988.
 Littman et al. (2001) Littman, M., Sutton, R., and Singh, S. Predictive representations of state. In Advances in Neural Information Processing Systems 14 (NIPS 2001), 2001.
 Parikh et al. (2012) Parikh, A., Song, L., Ishteva, M., Teodoru, G., and Xing, E. A spectral algorithm for latent junction trees. In Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence (UAI 2012), pp. 675–684, 2012.

Schuldt et al. (2004)
Schuldt, C., Laptev, I., and Caputo, B.
Recognizing human actions: A local svm approach.
In
Proceedings of the 17th International Conference on Pattern Recognition (ICPR 2004)
, 2004.  Shenoy & Shafer (1990) Shenoy, P. and Shafer, G. Axioms for probability and belieffunction propagation. In Proceedings of the 6th Conference on Uncertainty in Artificial Intelligence (UAI 1990), pp. 169––198, 1990.
 Singh et al. (2004) Singh, S., James, M., and Rudary, M. Predictive state representations: A new theory for modeling dynamical systems. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (UAI 2004), pp. 512–519, 2004.
 Stock & Watson (2011) Stock, J. H. and Watson, M. W. Introduction to Econometrics. AddisonWesley, 3rd edition, 2011.
Comments
There are no comments yet.