# Predictive Learning on Sign-Valued Hidden Markov Trees

We provide high-probability sample complexity guarantees for exact structure recovery and accurate Predictive Learning using noise-corrupted samples from an acyclic (tree-shaped) graphical model. The hidden variables follow a tree-structured Ising model distribution whereas the observable variables are generated by a binary symmetric channel, taking the hidden variables as its input. This model arises naturally in a variety of applications, such as in physics, biology, computer science, and finance. The noiseless structure learning problem has been studied earlier by Bresler and Karzand (2018); this paper quantifies how noise in the hidden model impacts the sample complexity of structure learning and predictive distributional inference by proving upper and lower bounds on the sample complexity. Quite remarkably, for any tree with p vertices and probability of incorrect recovery δ>0, the order of necessary number of samples remains logarithmic as in the noiseless case, i.e., O((p/δ)), for both aforementioned tasks. We also present a new equivalent of Isserlis' Theorem for sign-valued tree-structured distributions, yielding a new low-complexity algorithm for higher order moment estimation.

## Authors

• 3 publications
• 15 publications
• 19 publications
• ### Predictive Learning on Hidden Tree-Structured Ising Models

We provide high-probability sample complexity guarantees for exact struc...
12/11/2018 ∙ by Konstantinos E. Nikolakakis, et al. ∙ 0

• ### Non-Parametric Structure Learning on Hidden Tree-Shaped Distributions

We provide high probability sample complexity guarantees for non-paramet...
09/20/2019 ∙ by Konstantinos E. Nikolakakis, et al. ∙ 0

• ### SGA: A Robust Algorithm for Partial Recovery of Tree-Structured Graphical Models with Noisy Samples

We consider learning Ising tree models when the observations from the no...
01/22/2021 ∙ by Anshoo Tandon, et al. ∙ 0

• ### Sample complexity of hidden subgroup problem

The hidden subgroup problem (𝖧𝖲𝖯) has been attracting much attention in ...
07/07/2021 ∙ by Zekun Ye, et al. ∙ 0

• ### The Sparse Hausdorff Moment Problem, with Application to Topic Models

We consider the problem of identifying, from its first m noisy moments, ...
07/16/2020 ∙ by Spencer Gordon, et al. ∙ 0

• ### Robustifying Algorithms of Learning Latent Trees with Vector Variables

We consider learning the structures of Gaussian latent tree models with ...
06/02/2021 ∙ by Fengzhuo Zhang, et al. ∙ 0

• ### Distributional Equivalence and Structure Learning for Bow-free Acyclic Path Diagrams

We consider the problem of structure learning for bow-free acyclic path ...
08/07/2015 ∙ by Christopher Nowzohour, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Graphical models are a useful tool for modeling high-dimensional structured data. Indeed, the graph captures structural dependencies: its edge set corresponds to (often physical) interactions between variables. There is a long and deep literature on graphical models (see Koller and Friedman (2009) for a comprehensive introduction), and they have found wide applications in areas such as image processing and vision (Liu et al., 2017; Schwing and Urtasun, 2015; Lin et al., 2016a; Morningstar and Melko, 2017; Wu et al., 2017; Li and Wand, 2016)

(Wang et al., 2017; Wainwright et al., 2003), signal processing (Wisdom et al., 2016; Kim and Smaragdis, 2013), and gene regulatory networks (Zuo et al., 2017; Banf and Rhee, 2017), to name a few.

An undirected graphical model, or Markov random field (MRF) in particular, is defined in terms of a hypergraph

, which models the Markov properties of a joint distribution on variables

where . A tree-structured graphical model is one in which is a tree. We denote the tree-structured model as . In this paper, we consider binary models on variables , where the joint distribution of is a tree-structured Ising model distribution on and is a noisy version of , where and are independent and identically distributed (i.i.d.) Rademacher noise. We refer to as the hidden layer and as the observed layer. Under this setting, our objective is to recover the underlying tree structure of the hidden layer (with high probability) using only the noisy observations . This is non-trivial because does not

itself follow any tree structure; this is similar to more traditional problems in nonlinear filtering, where a Markov process of known distribution (and thus, of known structure) is observed through noisy measurements

(Arulampalam et al., 2002; Jazwinski, 2007; Van Handel, 2009; Douc et al., 2011; Kalogerias and Petropulu, 2016). The sample complexity of the noiseless version of our model was recently studied by Bresler and Karzand (2018), where the well-known Chow-Liu algorithm (Chow and Liu, 1968) is employed for tree reconstruction. Like them, we also consider the Chow-Liu algorithm (Chow and Liu, 1968) (more precisely, a slightly modified version of it) in this paper, as well.

### 1.1 Statement of Contributions

We are interested in answering the following general question: how does noise affect the sample complexity of the structure learning procedure? That is, given only noisy observations, our goal is to learn the tree structure of the hidden layer, in a well-defined and meaningful sense. In turn, the estimated structure is an essential statistic for estimating the underlying distribution of the hidden layer, allowing for Predictive Learning.

Specifically, based on the structure estimate, we are also interested in appropriately approximating the tree-structured distribution under study, which can then be used for accurate predictions, in regard to the statistical structure of the underlying tree. We also consider the problem of hidden layer higher order moment estimation of sign-valued hidden Markov fields on trees and, in particular, how such estimation can be efficiently performed, on the basis of noisy observations.

Our contributions may be summarized as follows:

• A lower bound on the number of samples needed to recover the exact hidden structure with high probability, by using the Chow-Liu algorithm. Complementary, an upper bound on the necessary number of samples for the hidden structure learning task.

• Determination of the sufficient and necessary number of samples for accurate predictive learning. We analyze the sample complexity of learning distribution estimates, which can accurately provide predictions on the hidden tree. The estimates are computed through noisy data.

• A closed-form expression and a computationally efficient estimator for higher-order moment estimation in sign-valued tree-structured Markov random fields.

• Sample complexity analysis for accurate distribution estimates with respect to the KL-divergence.

### 1.2 Structure Learning for Undirected Graphical Models and Related Work

For a detailed review of methods involving undirected and directed graphical models, see the relevant article by Drton and Maathuis (2017). In general, learning the structure of a graphical model from samples can be intractable (Karger and Srebro, 2001; Højsgaard et al., 2012). For general graphs, neighborhood selection methods (Bresler, 2015; Ray et al., 2015; Jalali et al., 2011) estimate the conditional distribution for each vertex, to learn the neighborhood of each node and therefore the full structure. These approaches may use greedy search or regularization. For Gaussian or Ising models, -regularization (Ravikumar et al., 2010), the GLasso (Yuan and Lin, 2007; Banerjee et al., 2008), or coordinate descent approaches (Friedman et al., 2008) have been proposed, focusing on estimating the non-zero entries of the precision (or interaction) matrix. Model selection can also be performed using score matching methods (Hyvärinen, 2005, 2007; Nandy et al., 2015; Lin et al., 2016b), or Bayesian information criterion methods (Foygel and Drton, 2010; Gao et al., 2012; Barber et al., 2015). Other works address non-Gaussian models such as elliptical distributions, t-distribution models or latent Gaussian data (Vogel and Fried, 2011; Vogel and Tyler, 2014; Bilodeau, 2014; Finegold and Drton, 2011), or even mixed data (Fan et al., 2017).

For tree- or forest-structured models, exact inference and the structure learning problem are significantly simpler: the Chow-Liu algorithm provides an estimate of the tree or forest structure of the underlying graph (Bresler and Karzand, 2018; Chow and Liu, 1968; Wainwright et al., 2008; Tan et al., 2011; Liu et al., 2011; Edwards et al., 2010; Daskalakis et al., 2018). Furthermore, marginal distributions and maximum values are simpler to compute using a variety of algorithms (sum-product, max-product, message passing, variational inference) (Lauritzen, 1996; Pearl, 1988; Wainwright et al., 2008, 2003)).

The noiseless counterpart of the model considered in this paper was studied recently by Bresler and Karzand (2018); in this paper, we extend their results to the hidden case, where samples from a tree-structured Ising model are passed through a binary symmetric channel with crossover probability

. Of course, in the special case of a linear graph, our model reduces to a hidden Markov model. Latent variable models are often considered in the literature when some variables of the graph are deterministically unobserved

(Chandrasekaran et al., 2010; Anandkumar and Valluvan, 2013; Ma et al., 2013; Anandkumar et al., 2014). Our model is most similar to that studied by Chaganty et al. (Chaganty and Liang, 2014)

, in which a hidden model is considered with a discrete exponential distribution and Gaussian noise. They solve the parameter estimation problem by using moment matching and pseudo-likelihood methods; the structure can be recovered indirectly using the estimated parameters.

Notation.

Boldface indicates a vector or tuple and calligraphic face for sets and trees. The sets of even and odd natural numbers are

and respectively. For an integer , define . The indicator function of a set is . For a graph where indexes the set of variables , for any pair of vertices the correlation and for any edge it is . For two nodes of a tree, the term denotes the set of edges in the unique path with endpoints and . The binary symmetric channel is a conditional distribution from that acts componentwise independently on to generate , where and is a vector of i.i.d. Rademacher variables equal to with probability . We use the symbol to indicate the corresponding quantity for the observable (noisy) layer. For instance, is the probability mass function of and corresponds to the correlation of variables , where generates noisy observations of , for any . Also, BSC denotes a binary symmetric channel with crossover probability and blocklength . For our readers’ convenience, we summarize the notation in Table 1.

## 2 Preliminaries and Problem Statement

In this section, we introduce our model of hidden sign-valued Markov random fields on trees.

### 2.1 Undirected Graphical Models

We consider sign-valued graphical models where the joint distribution has support . Let

be a collection of sign-valued (binary) random variables. Then,

. We consider distributions on of the form

 p(x) =E[p∏i=11Xi=xi]=12p⎡⎣1+∑k∈[p]∑S⊂V:|S|=kE[∏s∈SXs]∏s∈Sxs⎤⎦,x∈{−1,+1}p. (1)

In this paper we assume that the marginal distributions of the are uniform, that is,

 P(Xi=±1)=12,∀i∈V. (2)

Thus, , for all . A distribution is Markov with respect to a hypergraph if for every node in the set it is true that , where is the set of neighbors of in . One subclass of of distributions for which the Markov property holds is the Ising model, in which each random variable is sign-valued and the hypergraph is a simple undirected graph, indicating that variables have only pairwise and unitary interactions. The joint distribution for the Ising model with zero external field is given by

 p(x) =1Z(θ)exp⎧⎨⎩∑(s,t)∈Eθstxsxt⎫⎬⎭,x∈{−1,1}p. (3)

are parameters of the model representing the interaction strength of the variables and is the partition function. These interactions are expressed through potential functions which ensure that the Markov property holds with respect to the graph . Next, we discuss the properties of distributions of the form of (1), which are Markov with respect to a tree.

### 2.2 Sign-Valued Markov Fields on Trees

From prior work by Lauritzen (1996), it is known that any distribution which is Markov with respect to a tree (or forest) factorizes as

 p(x) =∏i∈Vp(xi)∏(i,j)∈Ep(xi,xj)p(xi)p(xj),x∈{−1,+1}p, (4)

and we call as tree (forest) structured distribution, to indicate the factorization property. If the distribution has the form of (1) with , for all , and is Markov with respect to a tree , then

 p(x)=12∏(i,j)∈E1+xixjE[XiXj]2 (5)

and

 E[XiXj]=∏e∈path(i,j)μe,for all i,j∈V. (6)

(see Appendix A,Lemma A). Additionally, let us state the definition of the so-called Correlation (coefficient) Decay Property (CDP), which will be of central importance in our analysis.

###### Definition 1

The CDP holds if and only if for all tuples such that .

The CDP is a well known attribute of acyclic Markov fields (see, e.g., Tan et al. (2010), Bresler and Karzand (2018)). Further, it is true that the products for all are independent and the CDP holds for every of the form of (1), which factorizes with respect to a tree (see Lemma A, Appendix A). This is a consequence of property (6) and the inequality , for all . We can interpret the CDP as a type of data processing inequality (see Cover and Thomas (2012)). The connection is clear through the relationship between the mutual information and the correlations , namely,

 I(Xi,Xj) =12log2((1−E[XiXj])1−E[XiXj](1+E[XiXj])1+E[XiXj]), (7)

for any pair of nodes . This epxression shows that the mutual information is a symmetric function of and increasing with respect to (see also Lemma A, Appendix A).

Tree-structured Ising models: Despite its simple form, the Ising model has numerous useful properties. In particular, (5), (6) hold for any tree-structured Ising model with uniform marginal distributions and for all . Furthermore,

 E[XiXj] =tanhθij,∀(i,j)∈ET, (8)

which implies that

 p(x) =12∏(i,j)∈ET1+xixjtanhθij2,x∈{−1,1}p,α<|θij|≤β, (9) E[XiXj] =∏e∈path(i,j)μe=∏e∈path(i,j)tanh(θe),∀i,j∈V. (10)

A short argument showing (8) and (9) is included in Appendix A, Lemma A. For the rest of the paper, we assume a tree-structured Ising model for the hidden variable , that is, the distribution of has the form of (5). We also impose a reasonable compactness assumption on the respective interaction parameters, as follows.

###### Assumption 1

There exist and such that for the distribution , for all .

For a fixed tree structure , and for future reference, we hereafter let be the class of Ising models satisfying Assumption 1.

### 2.3 Hidden Sign-Valued Tree-Structured Models

The problem considered in this paper is that of learning a tree-structured model from corrupted observations. Instead of observing samples , ,, we observe , where is a noisy version of for . To formalize this, consider a hidden Markov random field whose hidden layer is an Ising model with respect to a tree, i.e., , as defined in (9). The observed variables are formed by setting for all , where are i.i.d. random variables. Let be the distribution of the observed variables . We can think of as the result of passing through a binary symmetric channel . This yields the following formula for as a function of higher-order moments:

 E[Nr] =1−2q,∀r∈V,and q∈[0,12], (11) μ†r,s △=E[YrYs]=E[NrXrNsXs]=(1−2q)2E[XrXs],∀r,s∈V. (12)

The constant will feature prominently in the analysis. The distribution of also has support , and so the joint distribution satisfies the general form (1). Since the marginal distribution of each is also uniform, for all , (1) and (11) yield

 p†(y) (13)

The moments of the hidden variables in (13) can be expressed as products of the pairwise correlations , for any (Section 3.3, Theorem 2). From (13) it is clear that the distribution of does not factorize with respect to any tree, that is, in general.111Lemma F shows the structure preserving property for the observable layer for specific choices of the hidden layer’s tree structure. The next subsections present the contribution and the most important questions which are handled in this paper.

### 2.4 Hidden Structure Estimation

We are interested in characterizing the sample complexity of structure recovery: given data generated from for an unknown tree , what is the minimum number of samples needed to recover the (unweighted) edge set of with high probability? In particular, we would like to quantify how depends on the crossover probability . Intuitively, noise makes "weak" edges to appear "weaker", and the sample complexity is expected to be an increasing function of . Because the distribution of the observable variables does not follow the tree structure , this problem does not follow directly from the noiseless case. In this work, we use the Chow-Liu algorithm to estimate the hidden tree structure. Specifically, we analyze the sample complexity of Algorithm 1. Our model allows us to retrieve a coherent variation of the original algorithm by Chow and Liu (1968). The consistency of Algorithm 1 is explained in depth in the next Sections (see Section 3.1, Section 4.1)

### 2.5 Evaluating the Accuracy of the Estimated Distribution

In addition to recovering the graph structure, we are interested in the "goodness of fit" of the estimated distribution. We measure this through the "small set Total Variation" (or "ssTV") distance as defined by Bresler and Karzand (2018):

 L(k)(P,Q) ≜supS:|S|=kd% TV(PS,QS), (14)

where are the marginal distributions of on the set , is the total variation distance, and . If is an estimate of , the norm guarantees predictive accuracy because (Bresler and Karzand, 2018, Section 3.2)

 EXS[|P(Xi=+1|XS)−Q(Xi=+1|XS)|]≤2L(|S|+1)(P,Q). (15)

We propose an estimator for the distribution of the hidden variables, , given only noisy observations. We also design the estimator to factorize according to the estimated structure , where the latter results as the output of the clipped Chow-Liu, or Chow-Liu-CC algorithm (see Algorithm 1). The Chow-Liu-CC algorithm constitutes a slightly modified version of the original Chow-Liu algorithm. In particular, for some , is greater than , then its value is set to . This extra step reduces the error of the estimation and simplifies the analysis (see Section 4.4). Our main result gives a lower bound on the number of samples needed to guarantee accurate estimation (in the sense of small “ssTV”), with high probability.

## 3 Main Results

The main question asked by this paper is as follows: What is the impact of noise on the sample complexity of learning a tree-structured graphical model in order to make predictions? This corresponds to sampling variables generated by sampling from the model (3) and randomly flipping each sign independently with probability (Binary symmetric channel). The Chow-Liu algorithm estimates the hidden structure through corrupted by noise observations, which consist the output of the BSC. Hidden (binary) model structures arise in a variety of applications. A typical example is classification where a subset of the data is misclassified. In such a case, corrupted data are observed, however, we are still able to retrieve the underlying structure by considering the appropriate number of samples.

Regarding this work, we firstly derive the Chow-Liu algorithm’s sufficient and necessary number of samples bounds for exact hidden structure recovery, assuming that only noisy observations are available. Theorem 3.1 and Theorem 3.1 correspond to the sufficient and necessary samples result respectively.

Secondly, we use the structure statistic to derive an accurate estimate of the hidden layer’s probability distribution. The distribution estimate is computed to be accurate under the "small set Total Variation" or "ssTV" utility measure, which was introduced by

Bresler and Karzand (2018). Furthermore, the estimator of the distribution factorizes with respect to the structure estimate, while the "ssTV" metric ensures that the estimated distribution is a trustworthy predictor. Theorem 3.2 and Theorem 3.2 give the sufficient and necessary sample complexity for accurate distribution estimation from noisy samples. These theorems generalize the results for the noiseless case () by Bresler and Karzand (2018) and lead to interesting connections between structure learning on hidden models and data processing inequalities.

The third part of the results includes Theorem 2, which gives the closed form expression for higher order moments of sign-valued Markov fields on trees is an equivalent of Isserlis’ theorem. Based on Theorem 2, we propose a low complexity algorithm, which estimates any higher order moment of the hidden variables, while it requires as input the estimated tree structure and estimates of the pairwise correlations (both evaluated from corrupted by noise observations).

Finally, Theorem 3.3 gives the sufficient number of samples for distribution estimation, when the symmetric KL divergence is considered as utility measure. The last gives rise to extensions of testing algorithms Daskalakis et al. (2018) under a hidden model setting.

### 3.1 Tree Structure Learning from Noisy Observations

Our goal is to learn the tree structure of an Ising model with parameters , when the nodes are hidden variables and we observe , , where are i.i.d, for all and for all . We derive the estimated structure by applying the Chow-Liu algorithm (Algorithm 1(Chow and Liu, 1968).

Instead of mutual information estimates, the Chow-Liu algorithm (Algorithm 1) requires correlations estimates, which are sufficient statistics because of expression (7). Further, it can consistently recover the hidden structure through noisy observations. The last is true because of the order preserving property of the mutual information. That is, the stochastic mapping allows structure recovery of by observing , because for any tuple such that , it is true that . The proof directly comes from (7) and (12). In addition, the monotonicity of mutual information with respect to the absolute values of correlations allows us to apply the Chow-Liu algorithm directly on the estimated correlations . Notice that can be used as an alternative of , because of (12). The algorithm returns the maximum spanning tree . Further discussion about the Chow-Liu algorithm is given in Section 4.1. The following theorem provides the sufficient number of samples for exact structure recovery through noisy observations.

[Sufficient number of samples for structure learning] Let be the output of a , with input variable . Fix a number . If the number of samples of satisfies the inequality

 n†≥ 32[1−(1−2q)4tanhβ](1−2q)4(1−tanhβ)2tanh2αlog2p2δ, (16)

then Algorithm 1 returns with probability at least .

Complementary to Theorem 3.1, our next result characterizes the necessary number of samples required for exact structure recovery. Specifically, we prove an upper on the sample complexity, which characterizes the necessary number of samples for any estimator .

[Necessary number of samples for structure learning] Let be the output of a , with input variable . If the given number of samples of satisfies the inequality

 n†<[1−(4q(1−q))p]−116αtanh(α)e2βlog(p), (17)

then for any estimator , it is true that

 infψsupT∈Tp(⋅)∈PT(α,β)P(ψ(Y1:n†)≠T)>12. (18)

It can be shown that the right hand-side of (16) is greater than the right-hand side of (17) for any in (and for all possible values of ), by simply comparing the two terms. Theorems 3.1 and 3.1 reduce to the noiseless setting by setting (Bresler and Karzand (2018)). The sample complexity is increasing with respect to , and structure learning is always feasible as long as . That is, to have the same probability of exact recovery we always need since

 [1−(1−2q)4tanh(β)][(1−2q)4(1−tanh(β))]≥1,∀q∈[0,12)% and β∈R, (19)

where is the required samples under a noiseless setting assumption. Furthermore,

 11−(4q(1−q))p≥1,∀q∈[0.12) and p∈N, (20)

which shows that the sample complexity in a hidden model is greater than the noiseless case (), for any measurable estimator (Theorem 3.1). When approaches , the sample complexity goes to infinity, , which makes structure learning impossible. Theorem 3.1 is a non-trivial extension of Theorem 3.1 by Bresler and Karzand (2018) to our hidden model. Our results combines Bresler’s and Karzand’s method and a strong data processing inequality by Polyanskiy and Wu (2017, Evaluation of the BSC). Upper bounds on the symmetric KL divergence for the output distribution can not be found in a closed form. However, by using the SDPI, we manage to capture the dependence of the bound on the parameters and derive a non-trivial result. When goes to , the bound becomes trivial since: , which gives the classical data processing inequality (contraction of KL divergence for finite alphabets, Raginsky (2016); Polyanskiy and Wu (2017)).

### 3.2 Predictive Learning from Noisy Observations

In addition to recovering the structure of the hidden Ising model, we are interested in estimating the distribution itself. If the distance between the estimator and the true distribution is sufficiently small, then the estimated distribution is appropriate for predictive learning because of (15). For consistency, this distribution should factorize according to the structure estimate and for the predictive learning part, the estimate is considered the output of the Chow-Liu-CC algorithm (see Algorithm 1). We continue by defining the distribution estimator of as

 ΠTCL†(^P†)≜12∏(i,j)∈ETCL†1+xixj^μ†i,j(1−2q)22. (21)

The estimator (21) can be defined for any . For it reduces to that in the noiseless case, since , , and thus . It is also closely related to the reverse information projection onto the the tree-structured Ising models (Bresler and Karzand, 2018), in the sense that

 ΠT(P)=argminQ∈PT(α,β)DKL(P||Q),P∈PT(α,β). (22)

To compute , two sufficient statistics are required: the structure and the set of second order moments (Bresler and Karzand, 2018; Chow and Liu, 1968), while is considered to be known. The next result provides a sufficient condition on the number of samples, which guarantees that the distance between the true distribution and the estimated distribution is small, and the last happens with probability at least .

[Sufficient number of samples for inference] Fix a number and and let

 c1(η,β,q) ≜512[etanh(β)log(1tanh(β))η(1−2q)4]2, (23) c2(β,q) ≜1152e2β(1−2q)4(1+2eβ√2(1−q)qtanhβ)2. (24)

If the number of samples satisfies the inequality

 (25)

then for the Chow-Liu-CC algorithm it is true that

 P(L(2)(p(⋅),ΠTCL†(^P†))≤η)≥1−δ. (26)

Conversely, the following result provides the necessary number of samples for small distance by a minimax bound, which characterizes any possible estimator . In other works, it provides the necessary number of samples required for accurate distribution estimation, appropriate for Predictive Learning (small ).

[Necessary number of samples for inference] Fix a number . Choose such that . If the given number of samples satisfies the inequality

 n†<1−[tanh(α)+2η]216η2[1−(4q(1−q))p]logp, (27)

then for any algorithm , it is true that

 infψsupT∈Tp(⋅)∈PT(α,β)P(L2(p(⋅),ψ(Y1:n))>η)>12.

Theorems 3.2 and 3.2 reduce to the noiseless setting for , which has been studied earlier by Bresler and Karzand (2018)222We discuss further properties of the results and the source of the term of (23) in Section 4.4.. Similarly to our structure learning results, presented previously (Theroem 3.1, Theorem 3.1), when we have , which indicates that the structure learning task becomes impossible for . Theorem 3.2 requires the assumption . The special case can be derived by applying the same proof technique of Theorem 3.2 combined with Theorem B.1 by Bresler and Karzand (2018) and the SDPI by Polyanskiy and Wu (2017). Further details and proof sketches of Theorems 3.2 and 3.2 are provided in Section 4.3.

### 3.3 Estimating Higher Order Moments of Signed-Valued Trees

A collection of moments is sufficient to represent completely any probability mass function. For many distributions, the first and second order moments are sufficient statistics; this is true, for instance, for the Gaussian distribution or the Ising model with unitary and pairwise interactions. Even further, in the Gaussian case, the well-known Isserlis Theorem (Isserlis (1918)) gives a closed form expression for all moments of every order. As part of this work, we derive the corresponding moment expressions, for any tree-structured Ising model. To derive the expression of higher order moments, we first prove a key property of tree structures: for any tree structure and a even-sized set of nodes , we can partition into pairs of nodes, such that the path along any pair is disjoint with the path of any other pair (see Appendix A, Lemma A). We denote as the set of distinct pairs of nodes in , such that , for all . Let be the set of all edges in all mutually disjoint paths with endpoints the pairs of nodes in , that is,

 CPT(V′)≜⋃{w,w′}∈CT(V′)pathT(w,w′). (28)

For any tree , the set can be computed via the Matching Pairs algorithm, Algorithm 2. By using the notation above, we can now present the equivalent of Isserlis’ Theorem. The closed form expression of moments is given by the next theorem. For any distribution of the form of (4), which factorizes according to a tree and has support , it is true that

 E[Xi1Xi2…Xik] ={0k odd∏e∈CPT(i1,i2,…,ik)μek even. (29)

Theorem 2 is an equivalent of Isserlis’ theorem for tree-structured sign-valued distributions. Equation (29) is used later to define an estimator of higher order moments which requires two sufficient statistics: the estimated structure and the correlations estimates , for any . The higher order moments altogether, along with the parameter , completely characterize the distribution of the noisy variables of the hidden model (13). The proof of Theorem 2 is provided in Appendix A.

High Order Moments Estimator: Consider any higher order moment as the expected value of the product of the hidden tree-structured Ising model variables where . Theorem 2 gives the closed form solution of those moments and indicates the following estimator for higher order moments, considering that only noisy observations might be available, and is known. In particular, we have

 ^E[Xi1Xi2…Xik] ≡0,k∈2N+1, (30) ^E[Xi1Xi2…Xik] ≜∏e∈CPTCL†(i1,i2,…,ik)^μ†e(1−2q)2,k∈2N. (31)

Since the estimated structure and the pairwise correlations are sufficient statistics, then given those, (31) suggests a computationally efficient estimator for higher order moments. Algorithm 2 with input the estimated structure returns the set . Thus, by estimating , and for any , we can in turn estimate any higher order moment through (31). Considering the absolute estimation error, we have

 ∣∣ ∣∣^E⎡⎣∏s∈V′Xs⎤⎦−E⎡⎣∏s∈V′Xs⎤⎦∣∣ ∣∣≤2|V′|L(2)(p(⋅),ΠTCL†(^P†)). (32)

Theorem 3.2 guarantees small "ssTV" and in combination with (32) gives an upper bound on the higher order moment estimate (31). In Section 4.5, we provide further details and discussion about Theorem 2, Algorithm 2, which computes the sets , and the bound on the error of estimation (32).

So far we have studied the consistency of the estimator with respect to the . To generalize our results, we are now interested in finding sample complexity bounds for sufficiently small -divergences. This part seems to be challenging, however an interesting case is the KL-divergence, which is found in a variety of testing algorithms (see Daskalakis et al. (2018)). The next result gives a bound for the sufficient number of samples, which guarantees small symmetric KL divergence with high probability. Here, the notation is used for the symmetric KL divergence of two distributions and . For any Ising model distributions of (3), with corresponding interaction parameters we have

 SKL(θ||θ′)≜SKL(P||Q) =∑s,t∈E(θst−θ′st)(μst−μ′st). (33)

[Upper Bounds for the Symmetric KL Divergence] If the number of samples of satisfies

 n†≥4β2(p−1)2(1−2q)4η2slog(p2δ), (34)

then for we have

 P(SKL(p(⋅)||ΠTCL†(^P†))≤ηs)≥1−δ, (35)

where is the Chow-Liu tree defined in (36) and the estimate is given by (21).

The asymptotic behavior of the bound in (34) was recently studied by Daskalakis et al. (2018). In that work, a set of testing algorithms are proposed and analyzed under the assumption of an Ising model with respect to trees and arbitrary graphs. Theorem 3.3 gives rise to possible extensions of testing algorithms to the hidden model setting, which is an interesting subject for future work.

## 4 Discussion

In this section, we present sketches of proofs, we compare our results with prior work, we further elaborate on Algorithm 1, Algorithm 2 and the error of higher order moment estimates. First, we discuss the convergence of the estimate (Section 4.1). In section 4.2, we explain the connection between the hidden and noiseless settings on the tree structure learning problem. Later, in Section 4.3, we present the analysis and a sketch of proof for Theorem 3.2, and in Section 4.4 we compare our results concerning predictive learning from noisy observations with the corresponding results of the noiseless setting (studied by Bresler and Karzand (2018)). Finally, in Section 4.5, we provide further details about Theorem 2, discussion about the Matching Pairs algorithm (Algorithm 2) and the accuracy of the proposed higher order moments estimator (31).

### 4.1 Estimating the Tree Structure T

In this work, the structure learning algorithm is based on the classical Chow-Liu algorithm, and is summarized in Algorithm 1. We can express its output as

 TCL†=argmaxT∈T∑(i,j)∈∈ET∣∣^μ†i,j∣∣, (36)

where the last comes from a direct application of Lemma 11.2 by Bresler and Karzand (2018). The difference between Algorithm 1 and the Chow-Liu algorithm of the noiseless scheme is the use of noisy observations as input, since we consider a hidden model, whereas Bresler and Karzand (2018) assume that observations directly from the tree-structured model are available. Further, (36) shows the consistency of the estimate for sufficiently large . The tree structure estimator converges to when , since

 limn→∞^μ†i,j% a.s.=c2μi,j. (37)

From (36) and (37) we have (under an appropriate metric)

 limn→∞TCL†a.s.=T. (38)

Asymptotically, both and converge to , where denotes the structure estimate from noiseless data (). For a fixed probability of exact structure recovery , more samples are required in the hidden model setting, compared to the noiseless one. Additionally, the gap of the sample complexity between the noisy and noiseless setting comes from Theorem 3.1 by comparing the bound for the values and .

### 4.2 Hidden Structure Recovery and Comparison with Prior Results

Theorem 3.1 and Theorem 3.1 constitute extensions of the noiseless setting, considered by Bresler and Karzand (2018, Theorem 3.2, Theorem 3.1)Bresler and Karzand (2018, Theorem 3.1, Theorem 3.2) to our hidden model; the noiseless results correspond to . In particular, in the presence of noise, the dependence on remains strictly logarithmic, that is, . To make the connection between sufficient conditions more explicit, by setting in (16) of Theorem 3.1, we retrieve the corresponding structure learning result by Bresler and Karzand (2018, Theorem 3.2) exactly: Fix a number . If the number of samples of satisfy the inequality

 n ≥32tanh2α(1−tanhβ)log(2p2δ), (39)

then the Chow-Liu algorithm returns with probability at least . An equivalent condition of (39) is

 tanhα≥4ϵ√1−tanhβ≜τ(ϵ),\ and\ ϵ≜√2/nlog(2p2/δ), (40)

which shows that the weight of weakest edge should satisfy the following inequality (Bresler and Karzand (2018)). For the hidden model, the equivalent extended condition for the weakest edge is

 tanhα ≥4ϵ†√1−(1−2q)4tanhβ(1−2q)2(1−tanhβ)≜τ†(ϵ†),\ and\ ϵ†=√2log(2p2/δ)n†, (41)

(see Appendix C, Lemma C) Condition (40) is retrieved through (41) for . Note that, for , the mutual information of the hidden and observable variables is zero, thus structure recovery is impossible.

Theorem 3.1 provides the necessary number of samples bound for exact structure recovery given noisy observations. In fact, it generalizes Theorem 3.1 by Bresler and Karzand (2018) to the hidden setting. By fixing , Theorem 3.1 recovers the noiseless case bound (Bresler and Karzand (2018), Theorem 3.1): Fix . If the number of samples of satisfies the inequality

 n<116e2β[αtanh(α)]−1log(p), (42)

then for any algorithmic mapping (estimator) , it is true that

 infψsup