1 Introduction
Estimating the parameters and structure of Bayesian networks from incomplete data is a fundamental task in many of the social and natural sciences. We consider the setting of latent variable models, where certain variables of interest are never directly observed, but their effects are discernible in the interactions of other directly observable variables. For example, in the clinical setting, diseases themselves are rarely observable, but their presence and absence are inferred from the combined evidence of patient narratives, lab tests, and other measurements. A full understanding of the relationships between diseases, how they interact with each other and how they present in individuals is still largely unknown. Another example is from the social sciences, where we may wish to know how opinions and beliefs interact or change over time. Surveys, textual analysis, and actions (e.g., voting registration, campaign donations) provide a noisy and incomplete view of people’s true beliefs, and we would like to build models to describe and analyze the underlying worldviews.
Estimating models involving latent variables is challenging for two reasons. First, without some constraints on the model, the model may be nonidentifiable, meaning that there may exist multiple parameter settings that cannot be distinguished based on the observed data. This nonidentifiability severely diminishes the interpretability of the learned models, and has been the subject of many critiques of factor analysis and similar techniques. Second, even in the identifiable setting, finding the model that best describes the data is often computationally intractable, even for seemingly simple models. For example, without making assumptions on the data generation process, finding the maximum likelihood latent Dirichlet allocation model is NPhard (Arora et al. , 2012).
We present a new efficient algorithm for discrete factor analysis, learning models involving binary latent variables and nonlinear relationships to binary observed variables (Martin & VanLehn, 1995; Šingliar & Hauskrecht, 2006; Wood et al. , 2006; Jernite et al. , 2013a). Our algorithm is semisupervised, requiring that a domain expert identify anchors, observed variables that are constrained to have only a single parent among the latent variables. Given these anchors, we show how to learn factor analysis models with arbitrary structure in the latent variables. The anchor conditions we present are closely related to the exclusive views conditions used by Chaganty & Liang (2014), which enable learning model parameters provided that each clique has a set of observations that satisfy certain independence criteria. However, identifying exclusive views requires that the structure of the latent variable model be known in advance. This is a significant shortcoming in realworld settings, where the network structure is almost never known and its recovery is the main goal of data analysis.
Our learning algorithm is based on the methodofmoments, which over the last several years has led to polynomialtime algorithms for provably learning a wide range of latent variable models (e.g., Anandkumar et al. , 2011, 2012; Arora et al. , 2013). The algorithm, summarized in Algorithm 1 and illustrated in Figure 1, proceeds in steps, first learning the structure on the latent variables, and then learns edges from the latent to the observed variables.
Contributions: The contributions of this paper are as follows:

We define anchored factor analysis, a subclass of factor analysis models, where the structure of the latent variables can be recovered from data.

We derive a methodofmoments algorithm for recovering both the latent structure and the factor loadings, including an efficient method for the case of treestructured latent variables.

We show that moment recovery can be improved by using constrained optimization with increasingly tight outer bounds on the marginal polytope, suggesting a method of increasing robustness of methodofmoments approaches in the presence of model misspecification.

We present experimental results, learning interpretable factor analysis models on two realworld multilabel text corpora, outperforming discriminative baselines.
Road map: The paper proceeds according to this high level structure:

Section 2 formally defines anchor variables.

Section 3 reviews how anchor variables can be used to recover moments of the latent variables.

Section 4 describes a family of latent variable models that can be learned using the recovered latent moments, and outlines the learning procedure, which consists of two parts: learning a structure for the latent variables and learning connections between the latent variables and observed variables.

Section 5 compares the models learned using the method described here to other standard baselines on realworld prediction tasks.
2 The anchored factor analysis task
Notation:
We use uppercase letters to refer to random variables (e.g.,
) and corresponding lowercase letters to refer to the values. For compactness, we use superscript parenthetical values to denote a variable taking a particular value (e.g., is equivalent to ).Anchored factor analysis: In this paper, we will discuss learning binary latent variable models. We divide the variables of our model into two classes: observed variables and unobserved or latent variables , where all variables have binary states or .
As in other factor analysis models, we assume that all dependencies between the observations are explained by the latent variables. This implies that the observed distribution has the following factorized form: .
We focus on a particular class of latent variable models, that we term anchored models. A latent variable model is anchored if for every , there is a corresponding observed variable that fits the following definition:
Definition 1.
Anchor variables: An observed variable is called an anchor for a latent variable if and for all other .
Using the language of directed probabilistic graphical models, we say that is an anchor of if is the sole parent of . We use the notation to refer to the anchor of or to refer to the set of anchors of a set of latent variables .
In this work, we assume that anchors are identified as an input to the learning algorithm. Specifying anchors can be a way for experts to inject domain knowledge into a learning task. In section 5 we show that even anchors obtained from simple rules can perform well in modeling.
3 Recovering moments of latent variables
3.1 Previous work – exclusive views
For the purpose of exposition, we assume that every anchor has a conditional distribution (i.e., ) which can be estimated consistently, and leave the discussion of estimating these conditional distributions from data to a later section (Section 3.3).
For a set of latent variables , Chaganty & Liang (2014)
show how to recover the marginal probabilities
using only the observed anchors, Eq. 1 relates the observed distribution to the unobserved moments :(1) 
The first equality simply comes from marginalizing the latent variables . The second uses the assumption that an anchor is independent of all other variables conditioned on its parent. Since we assume that can be estimated, Eq. 1 gives linear equations for unknowns. Since the linear equations can be shown to be independent (see supplementary materials), the distribution of the latent variables can be estimated from the observed distribution . This can be trivially extended to a case where includes both latent and observed variables by noticing that every observed variable can act as its own anchor.
Let be the Kroneker product of the conditional anchor distributions, , and
be a vector of dimension
of probabilities for . To recover , we seek to find a distribution that minimizes the divergence between the expected marginal vector of the anchors, , and the observed marginal vector of the anchors, . We solve this as a constrained optimization problem, where is any Bregman divergence and is the probability simplex:(2) 
Chaganty & Liang (2014) show the consistency of this estimator using both L2 distance and KL divergence for . The consistency of the estimator means that the recovered marginals will converge to the true values, , assuming that the anchor assumption is correct.
3.2 Robust moment recovery – connection to variational inference
Our algorithm’s theoretical guarantees, as with other methodofmoments approaches, assume that the data is drawn from a distribution in the model family. In our setting, this means that the anchors must perfectly satisfy Definition 1. We introduce a new approach to improve the robustness of methodofmoment algorithms to model misspecification. Our approach can be used as a dropin replacement for parameter recovery in the exclusive views framework of Chaganty & Liang (2014). For simplicity, we describe it here in the context of anchors.
Consider, for example, the case of learning a treestructured distribution on the latent variables. Before running the ChowLiu algorithm, we would need to estimate the edge marginals for every pair of random variables . Chaganty & Liang (2014)’s approach is to solve independent optimization problems of the form given in Eq. 2, resulting in a set of estimates . Our key insight is that since the true edge marginals all derive from , i.e. , there are additional constraints that they must satisfy. For example, the true edge marginals must satisfy the local consistency constraints:
(3) 
More generally, must lie in the marginal polytope, , consisting of the space of all possible marginal distribution vectors that can arise from any distribution (Wainwright & Jordan, 2008). Note that there exist vectors which satisfy the local consistency constraints but are not in the marginal polytope. The marginal polytope has been widely studied in the context of probabilistic inference in graphical models. MAP inference corresponds to optimizing a linear objective over the marginal polytope, and computing the logpartition function can be shown to be equivalent to optimizing a nonlinear objective over the marginal polytope (Wainwright & Jordan, 2008). Optimizing over the marginal polytope is NPhard, and so approximate inference algorithms such as belief propagation instead optimize over the outer bound given by the local consistency constraints. There has been a substantial amount of work on characterizing tighter relaxations of the marginal polytope, all of which immediately applies to our setting (e.g., Sontag & Jaakkola, 2007).
Putting these together, we obtain the following optimization problem for robust recovery of the true moments of the latent variables from noisy observations of the anchors:
(4) 
where is the size of the moments needed within the structure learning algorithm (e.g., for a treestructured distribution) and denotes a relaxation of the marginal polytope.
We use the conditional gradient, or FrankWolfe, method to minimize (4) (Frank & Wolfe, 1956)
. FrankWolfe solves this convex optimization problem by repeatedly solving linear programs over
. When corresponds to the local consistency constraints or the cycle relaxation (Sontag & Jaakkola, 2007), these linear programs can be solved easily using offtheshelf LP solvers. Alternatively, when there are sufficiently few variables, one can optimize over the marginal polytope itself. For this, we use the observation of Belanger et al. (2013) that optimizing a linear function over the marginal polytope can be performed by solving an integer linear program with local consistency constraints. In the experimental section we show that constrained optimization improves the robustness of the momentrecovery step compared to unconstrained optimization and that using increasingly tight approximations of the marginal polytope within the conditional gradient procedure yields increasingly improved results.Constrained optimization has been used previously to improve the robustness of methodofmoments results (Shaban et al. , 2015). Our work differs in that the constrained space naturally coincides with the marginal polytope, which allows us to leverage the FrankWolfe algorithm for interiorpoint optimization and relaxations of the marginal polytope that have been studied in the context of variational inference.
3.3 Estimating anchor noise rates
Throughout we have assumed that the conditional probabilities of the anchors can be estimated from the observed data. This task of estimating label noise is the subject of a rich literature. In this section we describe four settings where that is a reasonable assumption.

Two or more anchors for are provided:
If two anchors are provided, then the conditional distributions for both can be estimated using a multiview tensor decomposition method
(Berge, 1991; Anandkumar et al. , 2012), where the third view is obtained using any other observed variable that is correlated with the anchors (see supplementary materials). 
Anchors are positiveonly: In the setting where anchors are positiveonly (i.e.
, we have the setting known as Positive and Unlabeled (PU) learning. In this setting, the noise rates of these noisy labels can be estimated using the predictions of a classifier trained to separate the positive from unlabeled cases
(Elkan & Noto, 2008). 
Some (partially) labeled data is available: In the medical diagnosis task that we consider in the experiments, we asked a physician a small number of questions about each patient as part of their regular workflow. This gave us singlylabeled data, where each data point observed and for a single . Using a Chernoff bound, one can show that a small number of such observations suffices to accurately estimate an anchor’s conditional distribution.
4 Model learning
4.1 Learning
Structure learning background: Approaches for Bayesian network structure learning typically follow two basic strategies: they either search over structures that maximize the likelihood of the observed data (scorebased methods), or they test for conditional independencies and use these to constrain the space of possible structures. A popular scoring function is the BIC score (Lam & Bacchus, 1994; Heckerman et al. , 1995):
(5) 
where is the number of samples and are the empirical mutual information and entropy respectively. denotes the parents of node , in graph . The last term is a complexity penalty that biases toward structures with fewer parameters. Once the optimal graph structure is determined, the conditional probabilities, , that parametrize the Bayesian network are estimated from the empirical counts for a maximum likelihood estimate.
BIC is known to be consistent, meaning that if the data is drawn from a distribution which has precisely the same conditional independencies as a graph , once there is sufficient data it will recover , i.e. . In general, finding a maximum scoring Bayesian network structure is NPhard (Chickering, 1996). However, approaches such as integer linear programming (Jaakkola et al. , 2010; Cussens & Bartlett, 2013), branchandbound (Fan et al. , 2014), and greedy hillclimbing (Teyssier & Koller, 2005) can be remarkably effective at finding globally optimal or nearoptimal solutions. Restricting the search to treestructures, we can find the highestscoring network efficiently using a maximum spanning tree algorithm (Chow & Liu, 1968).
A different approach is to use conditional independence tests. Under the assumption that every variable has at most neighbors, these can be used to give polynomial time algorithms for structure learning (Pearl & Verma, 1991; Spirtes et al. , 2001), which are also provably consistent.
We utilize the fact that both scorebased and conditional independence based structure learning algorithms can be run using derived statistics rather than the raw data. Suppose is a Bayesian network with maximum indegree of . Then, to estimate the mutual information and entropy needed to evaluate the BIC score for all possible such graphs, we only need to estimate all moments of size at most . For example, to search over treestructured Bayesian networks, we would need to estimate for every . Alternatively, we could use the polynomialtime algorithms based on conditional independence tests. These need to perform independence tests using conditioning sets with up to variables, requiring us to estimate all moments of order .
We showed in Section 3 how to recover these moments of the latent variables from moments of the anchor variables. Even though these latent variables are completely unobserved, we are able to use the recovered moments in structural learning algorithms designed for fully observed data. Together, this gives us a polynomialtime algorithm for learning the Bayesian network structure for the latent variables (Step 2 of Algorithm 1) as stated in Theorem 1.
Theorem 1.
Let
be the graph structure of a Bayesian network describing the probability distribution,
, of the latent variables in an anchored discrete factor analysis model. Using the loworder moments recovered in Equation 2 in a structure learning algorithm which is consistent for fullyobserved data, is a consistent structural estimator. That is, as the number of samples, , the structure recovered ,, is Markov equivalent to .While conditional independence tests give the theoretical result of a polynomial time recovery algorithm, in our experimental results we use the scorebased approach, optimizing the BIC score which is efficient for trees using the ChowLiu algorithm (Chow & Liu, 1968) and for higherorder graphs using a cuttingplane algorithm (Cussens & Bartlett, 2013).
4.2 Learning factor loadings,
In this section we describe a momentsbased approach to recovering the factor loadings in the Anchored Discrete Factor Analysis model (step 3 of Algorithm 1). We show how to make use of the recovered loworder moments and the structure learned in Section 4.1 to estimate the conditional distributions of the observations conditioned on the latent variables.
In this work, we use a noisyor parametrization, a widelyused parametrization for Bayesian networks in medical diagnosis (e.g. Miller et al. , 1986; Wood et al. , 2006) and binary factor analysis Šingliar & Hauskrecht (2006). In particular, we assume that the distribution of any observation conditioned on the latent variables has the form: . The parameters are known as failure probabilities and the parameters are known as leak probabilities. Intuitively, the generative model can be described as follows: For every observation, , each parent, , that is “on” has a chance to turn it on and fails with probability . The leak probability represents the probability of being turned on without any of its parents succeeding.
We use to denote the Bayesian network structure over the latent variables that describes . This graph structure is learned in section 4.1.
To properly estimate a failure probability from low order moments, we need to separate the direct effect of in turning on and all other indirect effects (e.g. Conditioning on changes the likelihood of another latent variable being on, which in turn affects ).
Estimation using Markov blanket conditioning: The first method of separating direct and indirect effects uses the entire Markov blanket of each latent variable. Denote the Markov blanket of in as . For any setting of , the following is a consistent estimator of :
(6) 
as shown in the supplementary materials. The simplest application of this estimator is in models where the latent variables are assumed to be independent (e.g. Miller et al. , 1986; Jernite et al. , 2013a), where the Markov blanket is empty and the estimator is simply
(7) 
Unfortunately, for general graphs, in order to estimate these moments, we have to form moments that are of the same order as the Markov blanket of each latent variable which can potentially be quite large, even in simple tree models, giving poor computational and statistical efficiency.
Improved estimation for treestructured models: When the is a tree, it is possible to correct for indirect effects serially, only requiring conditioning on two latent variables at a time. Each of the correction factors essentially cancels out the effect of an entire subtree in determining the likelihood that . Let denote the neighbors of in . When calculating , we introduce correction factors, , to correct for indirect effects that flow through (Figure 2).
By correcting for each neighbor in serial we get the following estimator:
(8) 
where has the form:
(9) 
and can be estimated from the loworder moments recovered in Equation 2. We note that for this method we only need to make assumptions about the graph of the latent variables, requiring no assumptions on the indegree of the observations. For the correction factors to be defined, we do require that it is possible to condition on , so we require that is not deterministic for all pairs of latent variables, . A full derivation is provided in the supplementary materials.
4.3 Estimating leak parameters
Leak parameters can be estimated by comparing the empirical estimate of to the one predicted by the model without leak, denoted as .
(10) 
If the latent variables are independent, the Quickscore algorithm (Heckerman, 1990) gives an efficient method of computing marginal probabilities:
(11) 
For trees, this value can be computed efficiently using belief propagation and in more complex models it can be estimated by forward sampling.
4.4 Putting it all together
The full Anchored Factor Analysis model is a product of the latent distribution , which is described by an arbitrary Bayesian network and the factor loadings which are described by noisyor link functions. The computational complexity of learning the model depends on the choice of constraints for the Bayesian network, . Table 1 outlines the different classes of models that can be learned and the associated computational complexities. We note that for some Bayesian network families, the complexity may be hard in a worstcase analysis, but practically cutting plane approaches that use integer linear programs can be successful in learning these models exactly and quickly (Cussens & Bartlett, 2013).
Complexity of learning  Factor loadings  

independent  None  (Eq. 7) 
tree  ChowLiu:  (Eq. 8) 
degreeK  Independence tests:  (Eq. 6) 
indegreeK  Scorebased: NPhard worst case  (Eq. 6) 
5 Experiments
We perform an empirical study of our learning algorithm using two realworld multitag datasets, testing its ability to recover the structure and parameters of the distribution of tags and observed variables.
5.1 Datasets
Medical records: We use a corpus of medical records, collected over 5 years in the emergency department of a Level I trauma center and tertiary teaching hospital in a major city. A physician collaborator manually specified 23 medical conditions that present in the emergency department and provided textual features that can be used as anchors as well as administrative billing codes that can be used for our purposes as ground truth. Features consist of binary indicators from processed medical text and medication records. Details of the processing and a selection of the physicianspecified anchors are provided in the supplementary materials. Patients are filtered to exclude patients with fewer than two of the specified conditions, leaving 16,268 patients.
Stack Overflow: We also evaluate our methods on a large publicly available dataset using questions from Stack Overflow^{1}^{1}1http://blog.stackoverflow.com/category/ccwikidump/. This data used was also used in a Kaggle competition: https://www.kaggle.com/c/facebookrecruitingiiikeywordextraction, we treat user provided tags as latent variables. The observed vocabulary consists of the 1000 most common tokens in the questions, using a different vocabulary for the question header and body. Each question is described by a binary bagofwords feature vector, denoting whether or not each word in the vocabulary occurs in the question text. We use a simple rule to define anchor variables: for each tag, the text of the tag appearing in the header of the question is used as an anchor. For example, for the tag unix, the anchor is a binary feature indicating whether the header of the text contains the word “unix”. We use the 50 most popular tags in our models.
5.2 Methods
In all of our experiments, we provide the algorithms and baselines with the empirical conditional probabilities of the anchors. Other methods of estimating these values exist and are discussed in Section 3.3, but here we use the ground truth values for these noise rates in order to focus on the errors that arise from modeling error and finite data samples. Details on the optimization procedures and parameters used can be found in the supplementary materials.
5.3 Latent variable representation
In this section we evaluate the quality of the learned representations of the latent variables, . This task in interesting in its own right in settings where we care about understanding the interactions of the latent variables themselves for the purposes of knowledge discovery and understanding. Figure 3 shows a treestructured graphical model learned to represent the distribution of the 23 latent variables in the Emergency dataset as well as small subgraphs from a more complex model.
In the treestructured model, highly correlated conditions are linked by an edge. For example, asthma and allergic reactions, or alcohol and suicidal ideation. This is significant, since the model learning procedure learns only using anchors and without access to the underlying conditions.
The insert shows subgraphs from a model learned with two parents per variable. This allows for more complex structures. For example: being HIV positive makes the patient more at risk for developing infections, such as cellulitis and pneumonia. Either one of these is capable of causing septic shock in a patient. The vstructure between cellulitis, septic shock, and pneumonia expresses the explaining away relationship: knowing that a patient has septic shock, raises the likelihood of cellulitis or pneumonia. Once one of those possible parents is discovered (e.g., the patient is known to have cellulitis), the septic shock is explained away and the the probability of having pneumonia is reduced. In the second example relationship, both asthma and urinary tract infections are associated with allergic reactions (asthma and allergic reactions are closely related and many allergic reactions in hospital occur in response to antibiotics administered for infections), but asthma and urinary tract infections are negatively correlated with each other since either one is sufficient to explain away the patient’s visit to the emergency department.
5.3.1 Robustness to model error
In practice, the anchors that we specify are never perfect anchors, i.e., they don’t fully satisfy the conditional independence conditions of Definition 1. We find that enforcing marginal or local polytope constraints (see section 3.2) during the moment recovery process provides moment estimates that are more robust to imperfect anchors, improving the overall quality of the learned models.
The weakest constraint we consider is using independent simplex constraints as in Chaganty & Liang (2014). In addition, we evaluate models learned with moments recovered using local consistency constraints, described in Equation 3 and the marginal polytope constraints described in Section 3.2. Figure 4 shows the heldout likelihood of the tags according to tree models learned from moments recovered with increasingly tight approximations to the marginal polytope constraints on the Stack Overflow corpora. The tighter constraints learn higher quality models. This effect persists even in the large sample regime, suggesting that the residual gaps between the methods are due to sensitivity to model error. On synthetically generated corpora, this effect disappears and all three constrained optimization approaches perform equally well.
Solving with tighter constraints does increase running time. For example, using our current solving procedures, learning a model with 50 latent variables requires 30 seconds using the simplex constraints, 519 seconds for the local polytope and 6,073 seconds for the marginal polytope. Unlike EMbased procedures for learning models with latent variables which iterate over training samples in an inner loop, the running time of these methodofmoments approaches depends on the number of samples only when reading the data for the first time.
5.4 Factor analysis – relating observations to latent variables
Table 2 shows a selection of learned factor loadings on the Stack Overflow dataset, learned using the treestructured estimation procedure described in Section 4.2. The rest of the learned factors for both Stack Overflow and the Emergency datasets are found in the supplementary materials.
5.5 Comparison to related methods
We evaluate the full learned factor analysis model with a simulated lasttag prediction task where the model is presented with all but one of the positive tags (i.e. the latent variables) for a single document or patient and the goal is to predict the final tag. Our learned models use a treestructured Bayesian network to represent the distribution and noisyor gates to represent the conditional distributions .
Noise tolerant discriminative training: As a baseline, we compare to the performance of a noisetolerant learning procedure (Natarajan et al. , 2013)
using independent binary classifiers, trained using logistic regression with reweighted samples as described in their paper. To ensure a fair comparison with our learning algorithm which is provided with the anchor noise rates, we also provide the baselines with the exact values for the noise rates. Since the learning approach of
(Natarajan et al. , 2013) is not designed to use the noisy labels (anchors) at prediction time, we consider two variants: one ignores the anchors at test time, the other predicts only according to the noise rates of the anchors (ignoring all other observations), if the anchor is present. The baseline that we compare against is the best of the baselines described above.Oracle comparisons: We compare to two different oracle implementations to deconvolve different sources of error. The first, Oracle MoM, uses a methodofmoments approach to learning the joint model (as described in Sections 4.1 and 4.2), but uses oracle access to the marginal distributions that include latent variables (i.e. does not incur any error when recovering these distributions using the method described in Section 3). The second oracle method Oracle ML, uses a maximum likelihood approach to learning the joint models with full access to the latent variables for every instance (i.e., learning as though all data is fully observed.) Comparing the gap between oracle ML and oracle MoM shows the loss in accuracy that comes from choosing a methodofmoments approach over a maximum likelihood approach.
Stack Overflow  

Tag  Top weights 
osx  osx, i’m, running, i’ve, install, installed, os, code 
image  image, code, size, upload, html, save, picture, width 
mysql  mysql, query, rows, row, id, 1, tables, join 
webservices  web, service, webservices, client, java, services 
linux  linux, running, command, machine, system, server 
query  query, table, a, result, results, queries, tables, return 
regex  regex, match, expression, regular, pattern, i’m 
Emergency  

Tag  Top weights 
abdominal pain  pain, Ondansetron, nausea, neg:fevers 
alcohol  sex:m, sober, admits, found, drink 
asthmacopd  albuterol sulfate, sob, Methylprednisolone 
stroke  age:8090, admit, patient, head, ekg: 
hematuria  sex:m, urine, urology, blood, foley 
HIV+  sex:m, Truvada, cd4, age:4050, Ritonavir 
motor vehicle collision  car, neg:loc, age:2030, hit, neck 
Imputation:
We compare to an imputationbased learning method, which learns a maximum likelihood model (Tree structured
and noisyor gates for conditional distributions from the fully imputed data, using a single sample from the independent binary baseline classifiers (Natarajan et al. , 2013) to impute the values of the latent variables.Models with independent latent variables: Previous works (e.g., Jernite et al. , 2013a) consider similar models where latent variables are assumed to be independent. In our work, we explicitly model dependencies between the latent variables. We test the advantage of having a structured representation of the latent variables compared to a model with independent latent variables.
Maximizing data likelihood: We could learn a model without explicitly using the anchors, attempting to maximize likelihood of the observations. This maximization would be a nonconvex problem due to the latent variables, and marginalizing the latent variables to compute the likelihood is computationally difficult. However, an EM procedure (Dempster et al. , 1977) can be applied here (more detail in the supplementary materials). We initialize the parameters using the ADFAtree method and run EM to determine whether additional steps of likelihood optimization can improve beyond the solutions found with ADFA.
5.6 Results
Figure 5 shows the improvement over baseline on the heldout tag prediction task for the two realworld datasets. We observe that learning the factor analysis model does indeed help with prediction in both tasks and we believe that this is because the independent classifiers in the baseline cannot take advantage of the correlation structure between latent variables. In the Stack Overflow corpus, we see that there is a clear advantage to learning a structured representation of the latent variables as the performance of the tree structured discrete factor analysis model outperforms the oracle bounds of a model with independent variables. The emergency corpus is difficult to improve upon because the anchors themselves are so informative. In both datasets, the oracle results show that using methodofmoments for parameter learning is not a big source of disadvantage compared to maximum likelihood estimation.
Figure 6 shows the effect of running likelihoodbased optimization. Contrary to expectation, improving the likelihood of data does not improve performance on the tagprediction task. One reason for this is that latent variables can change their meaning from the original intent as a result of the likelihood optimization. One example in Figure 6 shows how the “headache” latent variable is used to model long negation scopes in patient records rather than modeling headaches.
Although the tighter constraints on the marginal polytope help for density estimation of and could be useful for knowledge discovery, for this particular task the improved density estimation does not result in significant changes in accuracy. Thus, we show only the results of the model learned with simplex constraints here.
6 Conclusion
Learning interpretable models with latent variables is a difficult task. In this paper we present a fast and expressive method that allows us to learn models with complex interactions between the latent variables. The learned models are also interpretable due to the effect of the userspecified anchor observations. On realworld datasets, we show that modeling the correlations between latent variables is useful, and outperform competitive baseline procedures. We find that enforcing marginal polytope constraints is useful for improving robustness to model error, a technique we believe can be more widely applied. Anchors have an interesting property that they make the structure and parameter estimation with latent variables as easy as learning with fully observed data for methodofmoments algorithms that require low order moments. In contrast, likelihoodbased learning remains equally hard, to the best of our knowledge, even with anchors.
7 Acknowledgments
This material is based upon work supported by the National Science Foundation under Grant No. 1350965 (CAREER award). Research is also partially support by an NSERC postgraduate scholarship.
References

Anandkumar et al. (2012)
Anandkumar, A., Hsu, D., & Kakade, S. 2012.
A method of moments for mixture models and hidden Markov models.
In: COLT 2012.  Anandkumar et al. (2011) Anandkumar, Anima, Chaudhuri, Kamalika, Hsu, Daniel, Kakade, Sham, Song, Le, & Zhang, Tong. 2011. Spectral Methods for Learning Multivariate Latent Tree Structure. Proceedings of NIPS 24, 2025–2033.
 Arora et al. (2012) Arora, S., Ge, R., & Moitra, A. 2012. Learning topic models – Going Beyond SVD. In: FOCS.
 Arora et al. (2013) Arora, S., Ge, R., Halpern, Y., Mimno, D., Moitra, A., Sontag, D., Wu, Y., & Zhu, M. 2013. A Practical Algorithm for Topic Modeling with Provable Guarantees. In: ICML, vol. 28 (2). JMLR: W&CP.
 Belanger et al. (2013) Belanger, D., Sheldon, D., & McCallum, A. 2013. Marginal Inference in MRFs using FrankWolfe. NIPS Workshop on Greedy Optimization, FrankWolfe and Friends.
 Berge (1991) Berge, Jos M.F. Ten. 1991. Kruskal’s polynomial for arrays and a generalization to arrays. Psychometrika, 56, 631–636.

Chaganty & Liang (2014)
Chaganty, A., & Liang, P. 2014.
Estimating LatentVariable Graphical Models using Moments and
Likelihoods.
In:
International Conference on Machine Learning (ICML)
.  Chapman et al. (2001) Chapman, Wendy W., Bridewell, Will, Hanbury, Paul, Cooper, Gregory F., & Buchanan, Bruce G. 2001. A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries. Journal of Biomedical Informatics, 34(5), 301 – 310.

Chickering (1996)
Chickering, D. 1996.
Learning Bayesian Networks is NPComplete.
Pages 121–130 of: Fisher, D., & Lenz, H.J. (eds),
Learning from Data: Artificial Intelligence and Statistics V
. SpringerVerlag.  Chow & Liu (1968) Chow, C, & Liu, C. 1968. Approximating discrete probability distributions with dependence trees. Information Theory, IEEE Transactions on, 14(3), 462–467.
 Cussens & Bartlett (2013) Cussens, J., & Bartlett, M. 2013. Advances in Bayesian Network Learning using Integer Programming. UAI.
 Dempster et al. (1977) Dempster, A. P., Laird, N. M., & Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM Algorithm. J. Roy. Statist. Soc. Ser. B, 1–38.
 Elkan & Noto (2008) Elkan, C., & Noto, K. 2008. Learning classifiers from only positive and unlabeled data. In: KDD.
 Fan et al. (2014) Fan, X., Yuan, C., & Malone, B. 2014. Tightening Bounds for Bayesian Network Structure Learning. In: AAAI.
 Frank & Wolfe (1956) Frank, Marguerite, & Wolfe, Philip. 1956. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(12), 95–110.
 Gurobi Optimization (2014) Gurobi Optimization, Inc. 2014. Gurobi Optimizer Reference Manual.
 Halpern & Sontag (2013) Halpern, Yoni, & Sontag, David. 2013. Unsupervised Learning of NoisyOr Bayesian Networks. In: Conference on Uncertainty in Artificial Intelligence (UAI13).
 Heckerman et al. (1995) Heckerman, D., Geiger, D., & Chickering, D. M. 1995. Learning Bayesian Networks: The Combination of Knowledge and Statistical Data. Machine Learning, 20(3), 197–243. Technical Report MSRTR9409.
 Heckerman (1990) Heckerman, David E. 1990. A tractable inference algorithm for diagnosing multiple diseases. Knowledge Systems Laboratory, Stanford University.
 Jaakkola et al. (2010) Jaakkola, T., Sontag, D., Globerson, A., & Meila, M. 2010. Learning Bayesian network structure using LP relaxations. Pages 358–365 of: International Conference on Artificial Intelligence and Statistics.
 Jaggi (2013) Jaggi, M. 2013. Revisiting FrankWolfe: Projectionfree sparse convex optimization. In: ICML.
 Jernite et al. (2013a) Jernite, Yacine, Halpern, Yoni, & Sontag, David. 2013a. Discovering Hidden Variables in NoisyOr Networks using Quartet Tests. In: Advances in Neural Information Processing Systems 26. MIT Press.
 Jernite et al. (2013b) Jernite, Yacine, Halpern, Yoni, Horng, Steven, & Sontag, David. 2013b. Predicting Chief Complaints at Triage Time in the Emergency Department. NIPS Workshop on Machine Learning for Clinical Data Analysis and Healthcare.
 Kivinen & Warmuth (1995) Kivinen, Jyrki, & Warmuth, Manfred K. 1995. Exponentiated Gradient Versus Gradient Descent for Linear Predictors. Inform. and Comput., 132.
 Lam & Bacchus (1994) Lam, Wai, & Bacchus, Fahiem. 1994. Learning Bayesian Belief Networks: An Approach Based on the MDL Principle. Computational Intelligence, 10, 269–294.
 Martin & VanLehn (1995) Martin, J, & VanLehn, Kurt. 1995. Discrete factor analysis: Learning hidden variables in Bayesian networks. Tech. rept. Department of Computer Science, University of Pittsburgh.
 Menon et al. (2015) Menon, Aditya, Rooyen, Brendan V, Ong, Cheng S, & Williamson, Bob. 2015. Learning from Corrupted Binary Labels via ClassProbability Estimation. Pages 125–134 of: Proceedings of the 32nd International Conference on Machine Learning (ICML15).
 Miller et al. (1986) Miller, Randolph A., McNeil, Melissa A., Challinor, Sue M., Fred E. Masarie, Jr., & Myers, Jack D. 1986. The INTERNIST1/QUICK MEDICAL REFERENCE project – Status report. West J Med, 145(Dec), 816–822.
 Natarajan et al. (2013) Natarajan, N., Dhillon, I., Ravikumar, P., & Tewari, A. 2013. Learning with Noisy Labels. In: NIPS.
 Pearl & Verma (1991) Pearl, Judea, & Verma, Thomas. 1991. A Theory of Inferred Causation. Pages 441–452 of: KR.
 Sanchez et al. (2008) Sanchez, M., Bouveret, S., Givry, S. De, Heras, F., Jégou, P., Larrosa, J., Ndiaye, S., Rollon, E., Schiex, T., Terrioux, C., Verfaillie, G., & Zytnicki, M. 2008. MaxCSP competition 2008: toulbar2 solver description.
 Scott et al. (2013) Scott, Clayton, Blanchard, Gilles, & Handy, Gregory. 2013. Classification with Asymmetric Label Noise: Consistency and Maximal Denoising. Pages 489–511 of: ShalevShwartz, Shai, & Steinwart, Ingo (eds), COLT. JMLR Proceedings, vol. 30. JMLR.org.
 Shaban et al. (2015) Shaban, Amirreza, Farajtabar, Mehrdad, Xie, Bo, Song, Le, & Boots, Byron. 2015. Learning Latent Variable Models by Improving Spectral Solutions with Exterior Point Methods. In: Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence (UAI2015).
 Šingliar & Hauskrecht (2006) Šingliar, Tomáš, & Hauskrecht, Miloš. 2006. Noisyor component analysis and its application to link analysis. The Journal of Machine Learning Research, 7, 2189–2213.
 Sontag & Jaakkola (2007) Sontag, D., & Jaakkola, T. 2007. New Outer Bounds on the Marginal Polytope. In: NIPS 20. MIT Press.
 Spirtes et al. (2001) Spirtes, P., Glymour, C., & Scheines, R. 2001. Causation, Prediction, and Search, 2nd Edition. The MIT Press.
 Teyssier & Koller (2005) Teyssier, M., & Koller, D. 2005. Orderingbased Search: A Simple and Effective Algorithm for Learning Bayesian Networks. Pages 584–590 of: Proc. of the 21st Conference on Uncertainty in Artificial Intelligence.
 Wainwright & Jordan (2008) Wainwright, M., & Jordan, M. 2008. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1–305.
 Wang et al. (2014) Wang, Xiang, Sontag, David, & Wang, Fei. 2014. Unsupervised Learning of Disease Progression Models. Pages 85–94 of: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’14. New York, NY, USA: ACM.
 Wood et al. (2006) Wood, F., Griffiths, T., & Ghahramani, Z. 2006. A NonParametric Bayesian Method for Inferring Hidden Causes. In: UAI ’06, Proceedings of the 22nd Conference in Uncertainty in Artificial Intelligence.
Appendix A R matrix is full rank
In this section we show that the matrix , defined in Section 3.1 is invertible. is blockdiagonal, so its determinant is equal to the product of the determinants of the blocks and will be nonzero as long as all of the blocks have nonzero determinants.
Each block, is a Kronecker product of conditional distributions: . The determinant of each of these matrices is nonzero as long as which is assumed in the definition of anchored latent variable models since .Thus, the determinant of the Kronecker product is also nonzero, and the matrix is full rank.
Appendix B Estimating conditional probabilities with two anchors per latent variable
In section 3.3, we note that if a latent variable as two anchors, their conditional distributions can estimated. The derivation is as follows: Let be observed anchors of . We can choose any other observation that has positive mutual information with both and . In that case, we have a singlycoupled triplet setting, as described in Halpern & Sontag (2013). Marginalizing over all latent variables but , we have that the observed distribution is a tensor with the following decomposition into a sum of rank1 tensors:
(12) 
This decomposition can be computed efficiently Berge (1991), and the conditional probabilities can be recovered.
Appendix C Estimating failures with Markov blanket conditioning
The Markov Blanket estimator from Section 4.2 follows from the noisyor parametrization of the model. Here we show that the estimator from Equation 6 is indeed a consistent estimator. We start with the estimator:
Breaking the numerator and denominator of the RHS. Let represent the conditioned variables and be the unconditioned variables. For the numerator, we have:
(13) 
Similarly, for the denominator:
(14) 
The second lines follow from the Markov blanket property. Thus, the ratio is equal to . As the number of samples approaches infinity, the empirical estimates, and , respectively approach the values of the true probabilities, and thus the estimator is consistent, provided that, .
Appendix D Estimating correction coefficients serially in trees
In this section we derive the correction terms for estimating the failure probabilities in tree models described in Section 4.2 (Equation 8). While the directionality of the tree is not important, it is easier notationally to consider a rooted tree, and without loss of generality, when estimating the failure term , we can consider the tree as rooted at .
We introduce the notation to denote the likelihood of an event in the graphical model where all nodes but and its nondescendants are removed (i.e., all that remains is the subtree rooted at ).
Conditioning on the variable taking the value , the likelihood of is written as:
(15) 
Correction terms are defined as
(16) 
Using Equation D, it is easy to see that:
(17) 
It remains to be shown that can be estimated from low order moments as in Equation 9.
(18) 
We can expand as follows:
(19) 
Where the second equality comes from the conditional independence properties that conditioning on makes children of independent of and children of (other than ) independent of .
We now substitute Eq. D into Eq. 18 and treat the numerator (num) and denominator (denom) separately:
num  
(20) 
The product over children of does not depend on and can be pulled out of the sum.
num  
(21) 
Expanding the denominator using Equation D:
(22) 
Canceling the leak term and the product over all children of except for from both the numerator and denominator, we are left with:
as desired.
Appendix E Dataset preparation
In this section we provide additional details about the preparation of the two realworld datasets used in the experimental results.
Emergency: The emergency dataset consists of the following fields from patients’ medical records: Current medications (medication reconciliation) and Administered medications (pyxis records) mapped to GSN (generic sequence number) codes; Free text concatenation of chief complaint, triage assessment and MD comments; age binned by decade; sex; and administrative ICD9 billing codes (used to establish ground truth but not visible during learning or testing). We apply negation detection to the freetext section using “negex” rules Chapman et al. (2001) with some manual adaptations appropriate for Emergency department notes Jernite et al. (2013b), and replace common bigrams with a single token (e.g. “chest pain” is replaced by “chest_pain”. We reduce the dataset from 273,174 patients to 16,268 by filtering all patients that have fewer than 2 of the manually specified conditions. We filter words to remove those that appear in more than of the dataset and take the 1000 most common words after that filtering. Table LABEL:tab:medical_concepts lists the concepts that are used and a selection of their anchors specified by a physician research collaborator. In the feature vector, anchors are replaced by a single feature which represents a union of the anchors (i.e. whether any of the anchors appear in the patient record).
Stack Overflow: Questions were initially filtered to remove all questions that do not have at least two of the 200 most popular tags. We filter words to remove those that appear in more than
of the dataset and take the 1000 most common words after that filtering. Tag names that contain multiple words are treated as Ngrams that are replaced by a single token in the text.
Appendix F Detailed methods
f.1 Regularization
We found it useful to introduce an additional regularization parameter to Equation 4 that encourages the recovered marginals to be be close to independent unless there is strong evidence otherwise.
(23) 
Where is a marginal vector constructed using the singlevariable marginals in an independent distribution.
For recovery of marginals under local consistency and marginal polytope constraints, we use the the conditional gradient algorithm, as discussed in section 3.2, using a tolerance of for the duality gap (as described in Jaggi (2013)) as a stopping criterion for Stack Overflow and for medical records. When using simplex constraints, we use the more appropriate Exponentiated Gradient algorithm Kivinen & Warmuth (1995) for each marginal independently. We use a regularization parameter of to learn and to learn . The moments required to learn are recovered using only simplex constraints.
Linear programs in the conditional gradient algorithm are solved using Gurobi (Gurobi Optimization, 2014) and integer linear programs are solved using Toulbar2 (Sanchez et al. , 2008). For structure learning we use the gobnilp package (Cussens & Bartlett, 2013) with a BIC score objective, though we note that any exact or approximate structure learner that takes a decomposable score as input could be used equally well.
f.2 Conditional gradient algorithm for moment recovery
Algorithm 2 describes the conditional gradient (Frank & Wolfe, 1956) algorithm that was use for moment recovery. Line 3 minimizes a linear objective over a compact convex set. In our setting, this is the marginal polytope or its relaxations. To minimize a linear objective over a compact convex set, it suffices to search over the vertices of the set. For the marginal polytope, these correspond to the integral vertices of the local consistency polytope. Thus, this step can be solved as an integer linear program with local consistency constraints.
We use a “fully corrective” variant of the conditional gradient procedure. If the minimization of line 3 returns a vertex that has previously been used, we perform an additional step, moving to the point that minimizes the objective over convex combinations of all previously seen vertices.
f.3 Monte Carlo EM
When running EM, we optimize the variational lower bound on the data likelihood:
The EM algorithm optimizes this bound with a coordinate ascent, alternating between Esteps which improve and Msteps which improve . Usually both the Estep and Msteps are maximization steps, but incomplete Msteps which only improve in every Mstep also leads to monotonic improvement of with every step. In this section, we describe a variant of Monte Carlo EM which is useful for this model using Gibbs sampling to approximate the Estep and a custom Mstep which is guaranteed to improve the variational lower bound at every step. We hold the distribution fixed and only optimize the failure and leak probabilities. For these purposes, the leak probabilities can be treated as simply failure probabilities of an extra latent variable whose value is always 1.
f.3.1 Outer Estep
For the Estep, we use the Gibbs sampling procedure described in 4.3 of Wang et al. (2014).
f.3.2 Outer Mstep
The Mstep consists of a coordinate step guaranteed to improve for a fixed . is a fully observed model, but optimizing has no closed form. Instead we introduce auxiliary variables , and adopt the generative model described in Algorithm 3 which equivalently describes the fullyobserved noisyor model (i.e. ). In this new expanded model, we can perform a single EM step with respect to the latent variable in closed form.
inner Estep:
(24) 
inner Mstep:
(25) 
Comments
There are no comments yet.