# Learning Ising Models with Independent Failures

We give the first efficient algorithm for learning the structure of an Ising model that tolerates independent failures; that is, each entry of the observed sample is missing with some unknown probability p. Our algorithm matches the essentially optimal runtime and sample complexity bounds of recent work for learning Ising models due to Klivans and Meka (2017). We devise a novel unbiased estimator for the gradient of the Interaction Screening Objective (ISO) due to Vuffray et al. (2016) and apply a stochastic multiplicative gradient descent algorithm to minimize this objective. Solutions to this minimization recover the neighborhood information of the underlying Ising model on a node by node basis.

## Authors

• 22 publications
• 48 publications
• 5 publications
10/01/2020

### Agnostic Learning of Halfspaces with Gradient Descent via Soft Margins

We analyze the properties of gradient descent on convex surrogates for t...
01/15/2019

### Distributed Stochastic Gradient Descent Using LDGM Codes

We consider a distributed learning problem in which the computation is c...
07/31/2018

### Stochastic Gradient Descent with Biased but Consistent Gradient Estimators

Stochastic gradient descent (SGD), which dates back to the 1950s, is one...
03/08/2018

### Learning with Rules

Complex classifiers may exhibit "embarassing" failures in cases that wou...
03/16/2020

### Discrete-valued Preference Estimation with Graph Side Information

Incorporating graph side information into recommender systems has been w...
06/20/2017

### Statistical Mechanics of Node-perturbation Learning with Noisy Baseline

Node-perturbation learning is a type of statistical gradient descent alg...
07/06/2018

### Beating the curse of dimensionality in options pricing and optimal stopping

The fundamental problems of pricing high-dimensional path-dependent opti...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Ising models are fundamental undirected binary graphical models that capture pair-wise dependencies between the input variables. They are well-studied in the literature and have applications in a large number of areas such as physics, computer vision and statistics (

Jaimovich et al. (2006); Koller et al. (2009); Choi et al. (2010); Marbach et al. (2012)). One of the core problems in understanding graphical models is structure learning, that is, recovering the structure of the underlying graph given access to random samples from the distribution. Developing efficient algorithms for structure learning is a heavily-studied topic, especially for the case when the underlying graph is sparse or has bounded degree (Dasgupta (1999); Lee et al. (2007); Bresler et al. (2008); Ravikumar et al. (2010); Bresler et al. (2014); Bresler (2015); Vuffray et al. (2016); Hamilton et al. (2017); Klivans and Meka (2017); Wu et al. (2018)). Klivans and Meka (2017) were the first to give an algorithm for learning the structure of Ising models (and more generally higher order MRFs) with essentially optimal runtime and sample complexity.

The focus of this paper is the setting where the samples drawn from the Ising model are corrupted by noise. More specifically, we consider the independent failure model where each entry of each sample is independently corrupted – missing or flipped — with some constant possibly unknown probability . Such a scenario can be expected, for example, in a sensor network, where sensors fail to report data occasionally due to internal failures. In addition to being a natural question, structure learning in this model has been specifically posed as an open problem by Chen (2011).

### 1.1 Our Result

The main contribution of our paper is a simple stochastic algorithm for learning the structure of an Ising model in the presence of corruptions with near optimal sample and time complexity:

###### Theorem 1 (Informal version).

Consider an Ising model with dependency graph over vertices with minimum edge weight . Further assume that the sum of absolute values of the weights of outgoing edges from each vertex is bounded by . Given corrupted draws from the Ising model such that each entry is missing or flipped with known probability , there exists an algorithm that recovers the underlying structure in time using at most samples where 444In the missing-data setting, scales as . For , is at most 10. We refer the reader to the proof of Theorem 3 for the exact dependence on . depends on .

For missing data, we extend our approach to the setting where is unknown. We show that using only fresh samples to estimate these probabilities suffices for the guarantees to hold. Our results can also be easily extended to other similar noise models such as independent block failures where instead of each entry being flipped, fixed subsets of entries are simultaneously missing or flipped. Our work is the first efficient algorithm for learning the structure of Ising models in the presence of independent failures; both our running time and sample complexity are essentially optimal (matching recent work due to Klivans and Meka (2017)).

### 1.2 Our Approach

Our approach has three main components that we briefly explain here.

##### Optimization Problem:

We follow the “nodewise-regression” approach of recovering the graph by solving an optimization problem to identify the neighborhood of each vertex. To do so, we minimize the Interaction Screening Objective (ISO) proposed by Vuffray et al. (2016) over the -norm constrained ball. The ISO satisfies a property known as restricted strong convexity which allows us to recover the underlying edge weights from an approximate minimum solution.

##### Minimization Procedure:

Instead of using convex programming as in Vuffray et al. (2016), we minimize ISO using stochastic multiplicative gradient descent (SMG) due to Kakade et al. (2008)

. SMG is a stochastic, multiplicative-weight update algorithm and therefore runs on the simplex (modeling a probability distribution). As such, we need to transform our optimization problem to the simplex. The main motivation for using SMG is two-fold: 1) it can handle

-constraints, 2) it requires access to only an unbiased, bounded estimator of the gradient to give convergence guarantees.

##### Unbiased Gradient Construction:

Despite independent failures in our sample, we are able to construct an unbiased estimator of the gradient of the (modified) ISO on the simplex. Our constructor is able to exploit the decomposability of the ISO, as it is an exponential of a linear function. This allows us to use the independence property of the failures to come up with a simple unbiased estimator. The techniques used for this construction may be of independent interest.

### 1.3 Related Work

##### Ising Models.

Structure learning for Markov Random Fields has been studied since the 1960s. For example, Chow and Liu (1968) gave a greedy algorithm for undirected graphical models assuming the underlying graph is a tree. There have been many works for learning Ising models under various assumptions on the structure of the underlying graph (e.g., Lee et al. (2007); Yuan and Lin (2007); Ravikumar et al. (2010); Yang et al. (2012)). For Ising models specifically, the first assumption-free result (that is, no assumptions are made on the underlying graph other than sparsity) was given by Bresler (2015). One drawback of Bresler’s work, however, is that the sample complexity has a doubly exponential dependence on the sparsity of the underlying graph. This was improved by Vuffray et al. (2016) where they used general-purpose tools for convex programming to minimize a certain objective function, but the running time of their approach was suboptimal. Subsequently, Klivans and Meka (2017) gave a multiplicative weight update algorithm called Sparsitron that achieved near-optimal sample complexity (essentially matching known information-theoretic lower bounds) and near-optimal running time (under a computational hardness assumption for the light-bulb problem Valiant (2015)). Recently, Wu et al. (2018) gave a different proof of Klivans and Meka (2017) using

-regularized logistic regression.

##### Learning Ising Models with Noise.

None of the works mentioned above handle the setting where there is noise in the observed data. The problem of structure learning in the independent failure model (specifically missing data) was raised by Chen (2011). The only work we are aware of that obtains positive results in this model is due to Hamilton et al. (2017), who applied a broad generalization of the work of Bresler (2015). Their sample complexity, however, has a doubly exponential dependence on the sparsity of the underlying graph. Our result gives a singly exponential dependence on the sparsity (known to be information-theoretically optimal) and can handle non-sparse graphs as long as they have small -norm.

##### Robust Learning of Ising Models.

Lindgren et al. (2018) studied structure learning in a different noise model motivated by recent results in robust learning. In their work, an adversary is allowed to arbitrarily corrupt some fraction of a training set drawn from the underlying distribution. They showed that if an adversary is allowed to corrupt even an exponentially small fraction of the samples, then no algorithm is robust. They further showed that their bounds are tight by proving robustness of Sparsitron against exponentially small adversarial corruption. The adversarial model is incomparable to the model studied here: in the independent failure model, every example will (on average) have a fraction of the entries missing (or flipped).

##### Learning with Missing Data.

For general learning problems, various methods have been proposed to handle missing data including heuristics and maximum likelihood methods. In the context of high-dimensional sparse

linear regression, Loh and Wainwright (2011) proposed simple estimators to handle missing data based on solving optimization problems, and further showed that simple gradient descent recovers close to optimum solution despite non-convexity. For distribution learning, Shah and Song (2018) recently gave the first positive results for learning mixtures of gaussians with missing data with optimal sample/runtime complexity. It is not immediately clear how to use either of these techniques for our problem.

### 1.4 Notation

For vector

, denotes and denotes . denoted the -norm. We denote the different distance/divergence metrics for distributions as follows: denotes the Kullback-Lieber divergence between probability distributions and , and denotes the total variation distance between and . We denote the ball of radius using and the simplex with radius using .

## 2 Preliminaries

###### Definition 1 (Ising model).

Let be a symmetric matrix with for all and be the mean-field vector. Let be an undirected graph with such that if and only if . The -variable Ising model with underlying dependency graph is a distribution defined on such that

 Pr[Z=z]=1Zexp⎛⎝∑(i,j)∈EAijzizj+∑i∈Vθizi⎞⎠

where is the normalizing factor. We denote the minimum edge weight by and the width of the model by .

We will assume that the minimum edge weight is at least and the model is width bounded by . For ease of presentation, we will suppress notation and denote by . A useful property of bounded width Ising models is that the conditional distributions are bounded away from 0 and 1. More formally,

###### Lemma 1 (Bresler (2015)).

For any node , subset , and any fixed configuration of the subset ,

 min{Pr[Zv=1|ZS=zS],Pr[Zv=−1|ZS=zS]}≥12exp(−2λ).

In this paper, we are interested in structure learning, that is, recovering the edges of the dependency graph of an unknown Ising model given independent draws from it. We will focus on recovery in the presence of corruptions in the observed samples. The model of corruptions we study in our paper is known as the independent failures models: corruptions of each variable are independent of all the other variables. More specifically, we will consider the following two special cases of independent failures:

1. Missing Data: In this setting, for sample we will instead observe such that for all , with probability , (missing) otherwise .

2. Flipped Data: In this setting, for sample we will instead observe such that for all , with probability , (flipped) otherwise .

## 3 Main Approach

To recover the underlying graph, we will show how to reconstruct the neighborhood of each vertex by solving a convex constrained minimization problem. We will first describe the optimization problem and then show how to minimize the same using a stochastic first-order method assuming access to an unbiased estimator of the gradient. Subsequently we will show how to recover the neighborhood from the optimization solution. Lastly, we will detail the construction of the unbiased estimators for the two given noise models.

Note: For ease of presentation, we will consider Ising models with zero mean-field (). For details on the non-zero mean-field case, we refer the reader to Appendix B

### 3.1 Optimization Problem

Consider the neighborhood of a fixed vertex (WLOG we choose , we can do the same analysis for any vertex). Vuffray et al. (2016) proposed to minimize the Interaction Screening Objective (ISO) as follows:

 Optimization Problem 1: minv∈Rn−1 S(v):=EZ∼D[exp(−n−1∑j=1vjZnZj)] subject to ||v||1≤λ.

Vuffray et al. (2016) studied the empirical version of Optimization Problem 1: the objective is computed using samples drawn from the Ising model as . They proved various useful properties of that directly extend to . Firstly, they showed that the optimal solution to Optimization Problem 1 captures the edge weights of the neighborhood of in . More formally,

###### Lemma 2 (Vuffray et al. (2016)).

Let be such that for , if else 0. Then we have, and is a global minimum Optimization Problem 1.

Further they proved that satisfies a property known as restricted strong convexity (RSC) which enables us to recover a vector close to by approximately minimizing the objective. Here we give a stronger version of their result which holds more generally.

###### Lemma 3.

[RSC for ] For all ,

 S(v)−S(v∗)≥∇S(v∗)T(v−v∗)+exp(−3λ)1+λ||v−v∗||2∞

Note: Our proof technique improves on their result, as we work directly with the norm, avoiding the use of the norm in their proof (c.f. Lemma 7 and 8 Vuffray et al. (2016)). Using our analysis in their proof improves their sample complexity bounds for structure learning from to where is the maximum degree of each node in . This improvement is highlighted when (where the improvement is by a factor of ).

With the above lemma in hand, we can show that small loss implies closeness to and hence recovery of the edge weights of the neighborhood of . More formally, if then .

### 3.2 Minimizing ISO with l1 Constraint

To solve Optimization Problem 1, we will use an algorithm due to Kakade et al. (2008) called Stochastic Multiplicative Gradient Descent (SMG) (see Algorithm 1). SMG is a multiplicative-weight update algorithm that optimizes over the simplex instead of the -constraint ball. As such, we will need to reduce our problem from the ball to the simplex to obtain the appropriate guarantees. Since the SMG analysis only appears in a set of lecture notes (and there are some minor errors in the writeup), for completeness we include the proof of the following theorem in the appendix:

###### Theorem 2 (Kakade et al. (2008)).

Let be an optimal solution of for convex function on . For SMG run on , as long as where denotes history of previous iterations and for all on the simplex and for all iterations , then with probability for suitably chosen

 c(¯u)≤c(u∗)+4BW⎛⎝√logkT+√2Tlog1δ⎞⎠.

To apply SMG, we convert our optimization problem’s constraint from the ball to the simplex. To do so, we define the following mappings:

 ΠkB→Δ:B(W,k)→Δ(W,2k+1),ΠkB→Δ(w)i=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩W−||w||1 if i=2k+1wi if i∈{1,…,k},wi≥0−wi−k if i∈{k+1,…,2k},wi−k<00 otherwise.
 ΠkΔ→B:Δ(W,2k+1)→B(W,k),ΠkΔ→B(u)i=ui−ui+k for i∈{1,…,k}

It is not hard to verify that these are valid mappings from ball to simplex and vice-versa. Using the given transformation, the optimization problem on the simplex is as follows:

 Optimization Problem 2: minw∈R2n−1 ˜S(w):=S(Πn−1Δ→B(w))=EZ∼D[exp(−n−1∑j=1(wj−wn−1+j)ZnZj)] subject to ||w||1=λ,w≥0.

Note that the above loss is also convex and similar to Lemma 2, we can show the following,

###### Lemma 4.

Let , then and is a global minimum of Optimization Problem 2.

Using SMG we can thus solve Optimization Problem 2 as long as we have access to an unbiased estimator of the gradient.

### 3.3 Unbiased Estimator of Gradient

Let be a candidate solution of Optimization Problem 2. Since we have missing/flipped data, it is not clear how to compute the loss or the gradient at the point , and the gradient is required to execute a step of the SMG algorithm. We will show, however, that it is possible to construct an unbiased estimator of the gradient with bounded norm for known s. We then show that our estimates are sufficiently strong to plug-in to the SMG analysis.

#### 3.3.1 Missing Data

In the missing-data model, instead of getting samples from , we instead get samples from a corrupted distribution as follows:

• Let . Let .

• Output such that for all , (replacing ? by 0).

Here

denotes the distribution of a Bernoulli random variable with probability

.

The main intuition of our estimator can be made clear with the following example. Suppose we wish to construct an unbiased estimator for function then consider . Then it is not hard to see that,

 EXi[f(Xi)−pf(0)1−p∣∣∣Zi]=ECi∼Ber(1−pi)[f(CiZi)−pf(0)1−p∣∣∣Zi] =(1−p)×f(Zi)−pf(0)1−p+p×f(0)=f(Zi).

Since ISO is a product function in , we can apply the above estimator to each term of the product. For we use the above idea on the so formed product of estimators. This allows us to construct the following estimators, ,

 gimiss(w;X)=−Xn1−pn×exp(−(wi−wn−1+i)XnXi)Xi1−pi×∏j≠i,nexp(−(wj−wn−1+j)XnXj)−pj1−pj.

The following lemma shows how to use to construct an unbiased estimator of the gradient of .

###### Lemma 5.

Consider estimator where is the indicator vector for coordinate . Then for all fixed ,

 EX∼Dmiss[Gmiss(w;X)]=∇˜S(w).

Also, for all and ,

 ||Gmiss(w;X)||∞≤1(1−pmax)2exp(λ1−pmax)

where .

###### Proof.

Observe that the gradient of with respect to can be computed as follows. We have since does not depend on , and

 ∀i∈[n−1],∇˜S(w)i=−∇˜S(w)i+n−1=−EZ∼D[exp(−n−1∑j=1(wj−wn−1+j)ZnZj)ZnZi].

For ease of presentation, let that is, for . Taking expectation of over , we have

 EX∼Dmiss[gimiss(w;X)] =−EX∼Dmiss⎡⎣Xn1−pn×exp(−viXnXi)Xi1−pi)×∏j≠i,nexp(−vjXnXj)−pi1−pi⎤⎦ =−EZ∼D[CnZn1−pn×ECi∼Ber(1−pi)[exp(−viCnZnCiZi)CiZi1−pi]× ∏j≠i,nECj∼Ber(1−pj)[exp(−vjCnZnCjZj)−pj1−pj]⎤⎦ =−EZ∼DCn∼Ber(1−pn)⎡⎣CnXn1−pn×exp(−viCnZnZi)Zi×∏j≠i,nexp(−vjCnZnZj)⎤⎦ =−EZ∼D⎡⎣exp⎛⎝−∑j≠nvjZnZj⎞⎠ZnZi⎤⎦=∇˜S(w)i.

The above follows from observing the following facts: 1) as long as is independent of , 2) as long as are independent of .

The last thing we need is that the above estimator has bounded norm for all (). Observe that

 ∣∣∣exp(a)−p1−p∣∣∣=∣∣1−p+∑∞i=1aii!∣∣1−p≤1+∞∑i=1|a|i(1−p)i!≤1+∞∑i=1(a1−p)ii!=exp(|a|1−p).

Using the above property,

 |gimiss(w;X)| exp(|vi|)(1−pmax)(1−pmax)∏j≠i,nexp(|vi|1−pmax) ≤1(1−pmax)2exp(λ1−pmax).

Here the inequality follows from observing that . Thus for all . ∎

#### 3.3.2 Random-Flipped Data

In the random-flipped data model, instead of getting samples from , we instead get samples from a corrupted distribution as follows:

• Let . Let .

• Output such that for all , .

Here denotes the distribution of Rademacher variables with probability , that is, distribution over where probability of drawing 1 is .

Similar to the missing data case, we motivate our estimator with the following example. Suppose we wish to construct an unbiased estimator for any function then consider . Then it is not hard to see that,

 EXi[(1−p)f(Xi)−pf(−Xi)1−2p∣∣∣Zi] =ECi∼Rad(1−pi)[(1−p)f(CiZi)−pf(−CiZi)1−2p∣∣∣Zi] =(1−p)×(1−p)f(Zi)−pf(−Zi)1−2p+p×(1−p)f(−Zi)−pf(Zi)1−2p=f(Zi).

Based on the above example, define . Consider the following estimators, for all :

 giflip(w;X)=−(1−pn)hi(w;X)+pnhi(−w;X)1−2pn where hi(w;X)=XnXi×∏j≠nσ(pj,−(wj−wn−1+j)Xn,Xj).

The following lemma shows that is indeed an unbiased estimator of .

###### Lemma 6.

Using the above, we construct estimator where is the indicator vector for coordinate . Then for all fixed ,

 EX∼Dflip[Gflip(w;X)]=∇˜S(w).

Also, for all and ,

 ||Gflip(w;X)||∞≤1(1−pmax)2exp(λ1−pmax)

where .

We defer the proof to the appendix as it follows roughly the same ideas as of the missing-data model.

## 4 Main Result

Combining the techniques presented in the previous sections, our main algorithm (Algorithm 2) gives us the following guarantees.

###### Theorem 3.

For known constants , given samples with missing data from an unknown -variable Ising model , for with large enough constant , Algorithm 2 returns such that:

 ∀ i∈[n−1], ∣∣Πn−1Δ→B(¯w)i−Ani∣∣≤ϵ.
###### Proof.

Observe that is a new draw and is independent of and therefore , thus is an unbiased estimator of (using Lemma 5 and 6) conditioned on . Applying Theorem 2 gives us that the output of Algorithm 2, , satisfies

 ˜S(¯w)−˜S(w∗)≤4λ(1−pmax)2exp(λ1−pmax)⎛⎝√lognT+√2Tlog1δ⎞⎠.

Recall that where satisfies for such that else 0. By definition of the mappings, we can see that and . Using Lemma 2, we have , thus choosing and applying the RSC property of (see Lemma 3),

 ||¯v−v∗||2∞ ≤(1+λ)exp(3λ)(S(¯v))−S(v∗)) =(1+λ)exp(3λ)(˜S(¯w)−˜S(w∗)) ≤4(1+λ)λ(1−pmax)2exp(λ(11−pmax+3))⎛⎝√lognT+√2Tlog1δ⎞⎠.

Thus for given we have

 ∀ i∈[n−1], |¯vi−v∗i|=|¯vi−Ani|≤ϵ.

Assuming as a constant gives us the required result.

Similarly for flipped data, we get:

###### Theorem 4.

For known constants , given samples with flipped data from an unknown -variable Ising model , for with large enough constant , Algorithm 2 returns such that:

 ∀ i∈[n−1], ∣∣Πn−1Δ→B(¯w)i−Ani∣∣≤ϵ.

Setting in the above theorems where is the smallest edge weight will recover the neighborhood of vertex exactly. To recover the entire graph, we can run Algorithm 2 for each vertex using the same samples. Thus with probability , using samples we will recover the entire graph. The runtime for Algorithm 2 is as the unbiased estimator requires time to compute. Thus, the overall runtime to recover the entire graph is .

## 5 Extension to unknown p

In this section, we will show how to extend the analysis for the missing data model to the case when for all and is unknown. The main approach is to use fresh samples to estimate with such that using samples where corresponds to the number of iterations needed for the SMG algorithm. We can compute an empirical estimate of since we observe when an entry is missing (it is unclear how to do this for the flipped data model). Also note that there is no dependence on since each sample gives estimates for , one for each coordinate. Subsequently we will show that the distribution of samples using and are within total variation distance for constant . It follows that we can use in our SMG algorithm and obtain the same guarantees while losing a factor of in the failure probability.

### 5.1 Estimating p

Prior to running SMG, we will draw samples. Let be the fraction of observed in the samples. Since each is i.i.d., with probability , we have by Chernoff,

 Pr[|p−ˆp|≥ϵ]≤exp(−mnϵ22p(1−p))

For , we have with probability , using samples.

### 5.2 Distribution closeness using ˆp

Here we will show that the distribution over samples with missing probability is close to the distribution over samples with missing probability . In order to do so, we will first show that is equivalent to the following distribution :

1. Concatenate all samples into a -dimensional vector and permute all coordinates.

2. Select first coordinates and replace with .

3. Unpermute and split back to samples.

Here

denotes the binomial distribution over

runs with success probability . It is not hard to see that is equivalent to . Similarly the distribution , analogous to with replaced by is equivalent to . Thus, we have

 dTV(D,ˆD)=dTV(Dperm,ˆDperm)=dTV(Bin(nT,p),Bin(nT,ˆp)).

The above follows as and differ only in Step 2, which depends on only closeness of the binomial distribution.

We will use the following lemma to bound the closeness of binomial distributions.

###### Lemma 7 (Roos (2001)).

For and ,

 dTV(Bin(n,p),Bin(n,p+δ))≤√e2θ(δ)(1−θ(δ))2whereθ(δ):=δ√n+22p(1−p).

The above gives us,

 dTV(D,ˆD)=dTV(Bin(nT,p),Bin(nT,ˆp))≤O(δ).

This implies that with probability , the draw from will be identical to that of . Thus, Algorithm 2 would give the desired result with estimate .

## 6 Conclusions and Open Problems

Our result highlights the importance of choosing the right surrogate loss to obtain noise-tolerant algorithms for learning the structure of Ising models. It would be interesting to know if other regression-based algorithms (e.g., the Sparsitron due to Klivans and Meka (2017)) can learn with independent failures. Other open problems include learning with missing data for distinct and unknown error rates (we can only handle the unknown error-rate case when all are equal) and improving the dependence on in the sample complexity bound.

## References

• Bresler (2015) Guy Bresler. Efficiently learning ising models on arbitrary graphs. In

Proceedings of the forty-seventh annual ACM symposium on Theory of computing

, pages 771–782. ACM, 2015.
• Bresler et al. (2008) Guy Bresler, Elchanan Mossel, and Allan Sly. Reconstruction of markov random fields from samples: Some observations and algorithms. In

Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques

, pages 343–356. Springer, 2008.
• Bresler et al. (2014) Guy Bresler, David Gamarnik, and Devavrat Shah. Hardness of parameter estimation in graphical models. In Advances in Neural Information Processing Systems, pages 1062–1070, 2014.
• Chen (2011) Yuxin Chen. Learning sparse ising models with missing data. 2011.
• Choi et al. (2010) Myung Jin Choi, Joseph J Lim, Antonio Torralba, and Alan S Willsky. Exploiting hierarchical context on a large database of object categories. 2010.
• Chow and Liu (1968) C Chow and Cong Liu. Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory, 14(3):462–467, 1968.
• Dasgupta (1999) Sanjoy Dasgupta. Learning polytrees. In

Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

, pages 134–141. Morgan Kaufmann Publishers Inc., 1999.
• Hamilton et al. (2017) Linus Hamilton, Frederic Koehler, and Ankur Moitra. Information theoretic properties of markov random fields, and their algorithmic applications. In Advances in Neural Information Processing Systems, pages 2463–2472, 2017.
• Jaimovich et al. (2006) Ariel Jaimovich, Gal Elidan, Hanah Margalit, and Nir Friedman. Towards an integrated protein–protein interaction network: A relational markov network approach. Journal of Computational Biology, 13(2):145–164, 2006.
• Kakade et al. (2008) Sham Kakade, Dean Foster, and Eyal Even-Dar.

(exponentiated) stochastic gradient descent for l1 constrained problems, 2008.

• Klivans and Meka (2017) Adam Klivans and Raghu Meka. Learning graphical models using multiplicative weights. In Foundations of Computer Science (FOCS), 2017 IEEE 58th Annual Symposium on, pages 343–354. IEEE, 2017.
• Koller et al. (2009) Daphne Koller, Nir Friedman, and Francis Bach. Probabilistic graphical models: principles and techniques. MIT press, 2009.
• Lee et al. (2007) Su-In Lee, Varun Ganapathi, and Daphne Koller. Efficient structure learning of markov networks using -regularization. In Advances in neural Information processing systems, pages 817–824, 2007.
• Lindgren et al. (2018) Erik M Lindgren, Vatsal Shah, Yanyao Shen, Alexandros G. Dimakis, and Adam Klivans. On robust learning of ising models. 2018.
• Loh and Wainwright (2011) Po-Ling Loh and Martin J Wainwright. High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. In Advances in Neural Information Processing Systems, pages 2726–2734, 2011.
• Marbach et al. (2012) Daniel Marbach, James C Costello, Robert Küffner, Nicole M Vega, Robert J Prill, Diogo M Camacho, Kyle R Allison, Andrej Aderhold, Richard Bonneau, Yukun Chen, et al. Wisdom of crowds for robust gene network inference. Nature methods, 9(8):796, 2012.
• Ravikumar et al. (2010) Pradeep Ravikumar, Martin J Wainwright, John D Lafferty, et al. High-dimensional ising model selection using ?1-regularized logistic regression. The Annals of Statistics, 38(3):1287–1319, 2010.
• Roos (2001) Bero Roos. Binomial approximation to the poisson binomial distribution: The krawtchouk expansion. Theory of Probability & Its Applications, 45(2):258–272, 2001.
• Shah and Song (2018) Devavrat Shah and Dogyoon Song. Learning mixture model with missing values and its application to rankings. arXiv preprint arXiv:1812.11917, 2018.
• Valiant (2015) Gregory Valiant. Finding correlations in subquadratic time, with applications to learning parities and the closest pair problem. Journal of the ACM (JACM), 62(2):13, 2015.
• Vuffray et al. (2016) Marc Vuffray, Sidhant Misra, Andrey Lokhov, and Michael Chertkov. Interaction screening: Efficient and sample-optimal learning of ising models. In Advances in Neural Information Processing Systems, pages 2595–2603, 2016.
• Wu et al. (2018) Shanshan Wu, Sujay Sanghavi, and Alexandros G Dimakis. Sparse logistic regression learns all discrete pairwise graphical models. arXiv preprint arXiv:1810.11905, 2018.
• Yang et al. (2012) Eunho Yang, Genevera Allen, Zhandong Liu, and Pradeep K Ravikumar. Graphical models via generalized linear models. In Advances in Neural Information Processing Systems, pages 1358–1366, 2012.
• Yuan and Lin (2007) Ming Yuan and Yi Lin. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19–35, 2007.

## Appendix A Omitted Proofs

### a.1 Proof of Lemma 2

Computing the gradient, we have

 ∇S(v∗)i =−EZ∼D⎡⎣exp⎛⎝−∑j|(n,j)∈EAnjZnZj⎞⎠ZnZi⎤⎦ =−1Z∼D∑z∈{−1,1}nexp⎛⎝−∑j|(n,j)∈EAnjznzj⎞⎠exp⎛⎝∑(i,j)∈EAijzizj⎞⎠znzi =−1Z∼D⎛⎜⎝∑z−n∈{−1,1}n−1exp⎛⎝∑i,j≠n,(i,j)∈EAijzizj⎞⎠zi⎞⎟⎠⎛⎝∑zn∈{−1,1}zn⎞⎠=0

Since is convex, gradient is 0 at and lies on the -ball of radius by assumption, it is the global minimum.

### a.2 Proof of Lemma 3

To prove the above property, we use Lemma 5 from Vuffray et al. (2016) to get for all ,

 S(v)−S(v∗)≥∇S(v∗)T(v−v∗)+exp(−λ)1+||v−v∗||1(v−v∗)TH(v−v∗)

where satisfies for . The above property in their paper was proved for the empirical loss, however the same proof goes through for the expected loss with being the true covariance matrix instead of the empirical one.

Let