The Ising model is a popular mathematical model inspired by ferromagnetism in statistical mechanics. The model consists of discrete random variables representing magnetic dipole moments of atomic spins. The spins are arranged in a graph—originally a lattice, but other graphs have also been considered—allowing each spin to interact with its graph neighbors. Sometimes, the spins are also subject to an external magnetic field.
The Ising model is one of many possible mean field models for spin glasses. Its probabilistic properties have caught the attention of many researchers—see, e.g., the monographs of Talagrand [talagrand-2003, talagrand-2010, talagrand-2011]. The analysis of social networks has brought computer scientists into the fray, as precisely the same model appears there in the context of community detection [berthet].
In this work we view an Ising model as a probability distribution on, and consider the following statistical inference and learning problem, known as density estimation or distribution learning: given i.i.d. samples from an unknown Ising model on a known graph , can we create a probability distribution on that is close to in total variation distance? If we have samples, then how small can we make the expected value of this distance? We prove that if has edges, the answer to this question is bounded from above and below by constant factors of . In the case when there is no external magnetic field, the answer is instead .
Our techniques carry over to the continuous case and allow us prove a similar minimax rate for learning the class of -dimensional normal undirected graphical models on . It is surprising that the minimax rate for this class was not known, even when is the complete graph, corresponding to the class of all
-dimensional normal distributions.
1.1. Main results
We start by stating our result for normal distributions. For precise definitions of all terms mentioned below, see Section 2.
Theorem 1.1 (Main result for learning normals).
Let be a given undirected graph with vertex set and edges. Let be the class of -dimensional multivariate normal undirected graphical models with respect to . Then, the minimax rate for learning in total variation distance is bounded from above and below by constant factors of .
The upper bound follows from standard techniques (see Section 3.1) and a lower bound of is known (see Section 1.2); our main technical contribution is to show a lower bound of , from which Theorem 1.1 follows. This theorem immediately implies a tight result on the minimax rate for learning the class of all -dimensional normals, if we take the graph to be complete. In this specific case, the upper bound is already known, so our contribution is the matching lower bound.
The minimax rate for learning the class of all -dimensional multivariate normal distributions in total variation distance is bounded from above and below by constant factors of .
We remark that for the class of mean-zero normal undirected graphical models, we prove a lower bound of , while the best known upper bound is . In practice, the underlying graph is typically connected, which means that , so these bounds match.
We prove similar rates as in Theorem 1.1 for the class of Ising models, which resemble discrete versions of multivariate normal distributions. An Ising model in dimension is supported on and comes with an undirected graph with vertex set , edge set , interactions for each , and external magnetic field for such that appears with probability proportional to
Note that our definition has no temperature parameter; we have absorbed it into the weights.
Theorem 1.3 (Main result for learning Ising models).
Let be a given undirected graph with vertex set and edges. Let be the class of -dimensional Ising models with underlying graph .
The minimax rate for learning in total variation distance is bounded from above and below by constant factors of .
Let be the subclass of Ising models with no external magnetic field. The minimax rate for learning in total variation distance is bounded from above and below by constant factors of .
In all of the above cases, the full structure and labeling of the underlying graph is known in advance. We next consider the case in which it is only known that the underlying graph has vertices and edges.
Let and be the class of all normal and Ising undirected graphical models with respect to some unknown graph with vertices and edges. The minimax learning rates for and are both bounded from above by a constant factor of , and bounded from below by a constant factor of .
The lower bound in this theorem follows immediately from our lower bounds for the case in which the graph is known.
1.2. Related work
Density estimation is a central problem in statistics and has a long history [devroye-course, devroye-gyorfi, ibragimov, tsybakov]. It has also been studied in the learning theory community under the name distribution learning, starting from [Kearns]
, whose focus is on the computational complexity of the learning problem. Recently, it has gained a lot of attention in the machine learning community, as one of the important tasks in unsupervised learning is to understand the distribution of the data, which is known to significantly improve the efficiency of learning algorithms (e.g., [deeplearningbook, page 100]). See [Diakonikolas2016] for a recent survey from this perspective.
An upper bound on the order of for estimating -dimensional normals can be obtained via empirical mean and covariance estimation (e.g., [2017-abbas, Theorem B.1]) or via Yatracos’ techniques based on VC-dimension (e.g., [abbas-mixtures, Theorem 13]). Regarding lower bounds, Acharya, Jafarpour, Orlitsky, and Suresh [acharya-lower-bound, Theorem 2] proved a lower bound on the order of for spherical normals (i.e., normals with identity covariance matrix), which implies the same lower bound for general normals. The lower bound for general normals was recently improved to a constant factor of by Asthiani, Ben-David, Harvey, Liaw, Mehrabian, and Plan [2017-abbas]. In comparison, our result shaves off the logarithmic factor. Moreover, their result is nonconstructive and relies on the probabilistic method, while our argument is fully deterministic.
For the Ising model, the main focus in the literature has been on learning the structure of the underlying graph rather than learning the distribution itself, i.e., how many samples are needed to reconstruct the underlying graph with high probability? See [Santhanam-lower, Shanmugam-lower] for some lower bounds and [Hamilton-upper, Klivans-upper] for some upper bounds. Otherwise, the Ising model itself has been studied by physicists in other settings for nearly a century. See the books of Talagrand for a comprehensive look at the mathematics of the Ising model [talagrand-2003, talagrand-2010, talagrand-2011].
Daskalakis, Dikkala, and Kamath [costis-2018] were the first to study Ising models from a statistical point of view. However, their goal is to test whether an Ising model has certain properties, rather than estimating the model, which is our goal. Moreover, their focus is on designing efficient testing algorithms. They prove polynomial sample complexities and running times for testing various properties of the model.
An alternative goal would be to estimate the parameters of the underlying model (e.g., [KMV]) rather than coming up with a model that is statistically close, which is our focus. We remark that these two goals are quantitatively different, although similar techniques may be used for both. In general, estimating the parameters of a model to within some accuracy does not necessarily result in a distribution that is close to the original distribution in a statistical sense. For instance, define
and observe that and are entrywise very close. However, is non-singular and is singular, and thus two mean-zero normal distributions with covariance matrices and are at total variation distance from one another. Conversely, if two distributions are close in total variation distance, their parameters are not necessarily close to within the same accuracy.
The goal of density estimation is to design an estimator for an unknown function taken from a known class of functions . In the continuous case,
is a class of probability density functions with sample spacefor some ; in the discrete case, is a class of probability mass functions with a countable sample space . In either case, in order to create the estimator , we have access to samples . Our measure of closeness is the total variation (TV) distance: For functions , their TV-distance is defined as
where for any function , the -norm of is defined as
|in the continuous case, and|
|in the discrete case.|
Further along, we will also need the Kullback-Leibler (KL) divergence or relative entropy [kullback-book], which is another measure of closeness of distributions defined by
|in the continuous case, and|
|in the discrete case.|
Formally, in the continuous case, we can write for a probability measure and the Lebesgue measure on , and in the discrete case for a probability measure and the counting measure on countable . In view of this unified framework, we say that is a class of densities and that is a density estimate, in both the continuous and the discrete settings. The total variation distance has a natural probabilistic interpretation as , where and are probability measures corresponding to and , respectively. So, the TV-distance lies in . Also, it is well known that the KL-divergence is nonnegative, and is zero if and only if the two densities are equal almost everywhere. However, it is not symmetric in general, and can become .
For density estimation there are various possible measures of distance between distributions. Here we focus on the TV-distance since it has several appealing properties, such as being a metric and having a natural probabilistic interpretation. For a detailed discussion on why TV is a natural choice, see [devroye-lugosi, Chapter 5]. If is a density estimate, we define the risk of the estimator with respect to the class as
where the expectation is over the i.i.d. samples from , and possible randomization of the estimator. The minimax risk or minimax rate for is the smallest risk over all possible estimators,
For a class of functions defined on the same domain , its Yatracos class is the class of sets defined by
The following powerful result relates the minimax risk of a class of densities to an old well-studied combinatorial quantity called the Vapnik-Chervonenkis (VC) dimension [vapnik-cherv]. Indeed, let be a family of subsets of . The VC-dimension of , denoted by , is the size of the largest set such that for each there exists such that . See, e.g., [devroye-lugosi, Chapter 4] for examples and applications.
Theorem 2.1 ([devroye-lugosi, Section 8.1]).
There is a univeral constant such that for any class of densities with Yatracos class ,
On the other hand, there are several methods for obtaining lower bounds on minimax risk; we emphasize, in particular, the methods of Assouad [assouad], Le Cam [lecam-1, lecam-2], and Fano [has-fano]. Each of these involve picking a finite subclass , and using the fact that , developing a lower bound on the minimax risk of . See [devroye-course, devroye-lugosi, yu-survey] for more details. We will use the following result, known as (generalized) Fano’s lemma, originally due to Khas’minskii [has-fano].
Lemma 2.2 (Fano’s Lemma [yu-survey, Lemma 3]).
Let be a finite class of densities such that
In light of this lemma, to prove a minimax risk lower bound on a class of densities , we shall carefully pick a finite subclass of densities in , such that any two densities in this subclass are far apart in -distance but close in KL-divergence.
Throughout this paper, we will be estimating densities from classes with a given graphical dependence structure, known as undirected graphical models [graphical-models]. The underlying graph will always be undirected and without parallel edges or self-loops, so we will omit these qualifiers henceforth. Indeed, let be a given graph with vertex set and edge set . A set of random variables with everywhere strictly positive densities forms a graphical model or Markov random field (MRF) with respect to if for every , the variables and are conditionally independent given .
Often, the problem of density estimation is framed slightly differently than we have presented it: given , we can be interested in finding the smallest number of i.i.d. samples for which there exists a density estimate based on these samples satisfying . Or, given , we might want to find the minimum number of samples for which there is a density estimate satisfying with probability at least . The quantities and are known as sample complexities of the class . Note that and are related through the equation
so that determining one also determines the other. Moreover, is often fixed to be some small constant like when studying , since it can be shown that all other values of for smaller are within a factor of . Then, there are versions of Theorem 2.1 and Lemma 2.2 for , which introduce some extraneous factors. In order to avoid such extraneous logarithmic factors, we focus on —equivalently, —rather than or .
We now recall some basic matrix analysis formulae which will be used throughout (see Horn and Johnson [matrix_analysis] for the proofs). For a matrix , the spectral norm of is defined as , where is the unit -sphere. Recall also the Frobenius norm of , sometimes also called the Hilbert-Schmidt norm, . When
has all real eigenvalues, we writefor the -th largest eigenvalue of . In general, we write for the
-th largest singular value of. Then, , and for any , . Furthermore, , and , so . For any matrix , we have . Finally, when is invertible, for every .
Throughout this paper, we let denote positive universal constants. We liberally reuse these symbols, i.e., every may differ between proofs and statements of different results. From now on, we denote the set by .
3. Learning Normal Graphical Models
Let be a positive integer, be the set of positive definite matrices over , and denote the multivariate normal distribution with mean , covariance matrix , and corresponding probability density function , where for ,
Let be a given graph with edges. Let be the following subset of all positive definite matrices,
The main result of this section is a characterization of the minimax risk of
It is known that is precisely the class of -dimensional multivariate normal graphical models with respect to [graphical-models, Proposition 5.2].
3.1. Proof of the upper bound in Theorem 1.1
We can already prove the upper bound in Theorem 1.1 without lifting a finger. The proof is similar to that of [abbas-mixtures, Theorem 13], which is for an upper bound on the minimax risk of all multivariate normals, corresponding to the case in which is complete. Let be the Yatracos class of ,
which, after simplification, is easily seen to be contained in the larger class
and thus . It remains to upper-bound .
In general, let
be a vector space of real-valued functions, and. Dudley [dudley_vectorspace, Theorem 7.2] proved that . (See [devroye-lugosi, Lemma 4.2] for a historical discussion.) In our case, the vector space has a basis of monomials
so . By Theorem 2.1, there is a universal constant such that
while the upper bound follows simply because the TV-distance is bounded by . ∎
3.2. Proof of the lower bound in Theorem 1.1
Since a lower bound on the order of for spherical normals was proved in [acharya-lower-bound, Theorem 2], the lower bound in Theorem 1.1 follows from subadditivity of the square root after the following proposition.
There exist such that for any graph with edges, where ,
Note that if , then , which implies the lower bound in Theorem 1.1 in this regime for . We prove Proposition 3.1 via Lemma 2.2. This involves choosing a finite subset of . Our normal densities will be mean-zero, but the covariance matrices will be chosen carefully. To make this choice, we use the next result which follows from an old theorem of Gilbert [gilbert] and independently Varshamov [varshamov] from coding theory.
There is a subset of size at least such that for any distinct , we have .
We give an iterative algorithm to build : choose a vertex from the hypercube, put it in , remove the hypercube points in the corresponding -ball of radius , and repeat. Since the intersection of this ball and the hypercube has size at most
the size of the final set will be at least . ∎
Let be as in Theorem 3.2, so that and for any distinct , . Let be a real number to be specified later. Enumerate the edges of from to , and for , set to be the matrix with entries
In other words, is symmetric with all ones on its diagonal, everywhere along the nonzero entries of the adjacency matrix of according to the signs in , and elsewhere.
Suppose that . Then, for any , the matrix is positive definite.
Since is symmetric and real, all its eigenvalues are real. Write , so that . Observe that
Then, for every , and so is positive definite. ∎
We will assume from now on that . In light of Lemma 3.3, is positive definite, so it is invertible, and we let denote its inverse. Since we will always take the mean to be , we will write for from now on. We define the set of covariance matrices, and let
There exist such that for any satisfying ,
We consider a symmetrized KL-divergence, often called the Jeffreys divergence [kullback-book],
which clearly serves as an upper bound on the quantity of interest. It is well known that
e.g., by [kullback-book, Section 9.1]. By the Cauchy-Schwarz inequality for the inner product ,
Notice now that , so
Write just as in the proof of Lemma 3.3. Then,
and the same bound holds for , whence . ∎
Unfortunately, the -distance between multivariate normals does not have such a nice expression as the Jeffreys divergence does. To control some of the quantities involved in the computation of the -distance, we recall some properties of sub-gaussian random variables.
The sub-gaussian norm of a random variable is defined to be
A random variable is called sub-gaussian if . Observe in particular that and any bounded random variable are sub-gaussian. Recall now the following well-known large deviation inequality for quadratic forms of sub-gaussian random vectors.
Theorem 3.5 (Hanson-Wright inequality [vershynin-book, Theorem 6.2.1], see also [gabor-concentration, Example 2.12]).
Let be a random vector with independent mean-zero components satisfying , and let . Then, for every ,
for some universal constant .
A square matrix is called zero-diagonal if all its diagonal entries are zero.
Let be a random vector with i.i.d. components where , , and . Let be symmetric and zero-diagonal. Then,
There exists such that for any integer we have
There exist such that for any , if , then
Observation 1 follows simply by writing out the quadratic form,
To prove 2, we expand the square, and notice that only the monomials of the form or are nonzero after taking expectations, so
There exist such that for any with and , we have
Write . Then,
and the lower bound follows. Furthermore, observe that
If is sufficiently small, then
Then, since and by symmetry of ,
and again for sufficiently small ,
There are such that for any such that and are zero-diagonal and ,
By Lemma 3.7 and the triangle inequality,
where the expectation is with respect to , a -dimensional standard normal vector. Observe now the following chain of elementary inequalities,
which holds for all . By the triangle inequality again,