Suppose are unknown parameters, and is an symmetric matrix with non-negative entries with on the diagonal. For , define a p.m.f. by setting
This is the Ising model with coupling matrix , and inverse temperature parameter and magnetization parameter
. Study of Ising models is a growing area which has received significant attention in Statistics and Machine Learning in recent years. The theoretical investigation into the properties of Ising models can be broadly classified into two categories. One of the branches assumes that the matrixis the unknown parameter of interest, and focuses on estimating under the assumption that i.i.d. copies are available from the model described in 1.1 (c.f. [1, 7, 21, 22] and references there-in). Another branch works under the assumption that only one observation is available from the model in (1.1) (c.f. [5, 8, 11, 15, 17] and references there-in). In this setting, estimation of the whole matrix (which has
entries) is impossible from a vectorof size . As such, the standard assumption is that the matrix is completely specified, and the focus is on estimating the parameters . In this direction, the behavior of the MLE for the Curie-Weiss model (when is the scaled adjacency matrix of the complete graph) was studied in , where the authors showed that in the regime , the MLE of is consistent for if is known, and vice versa. They also show that if both
are unknown, then the joint MLE for the model does not exist with probability 1. This raises the natural question as to whether there are other estimators which work in this case. Focusing on the case whenis known,  gave general sufficient conditions under which the pseudo-likelihood estimate for is consistent. Developing on this approach,  studies the behavior of the rate of consistency of the pseudo-likelihood estimator at all values of
, demonstrating interesting phase transition properties in the rate of the pseudo-likelihood estimator. The question of joint estimation offor a general matrix was raised in . Up to the best of the our knowledge, this question has not been addressed in the literature. This will be the focus of this paper.
1.1 Main results
Throughout this paper we will assume that are unknown parameters of interest, and the coupling matrix has non-negative entries and is completely known. We will also assume the following two conditions
Here is a finite constant free of . Note that (1.2) implies that satisfies [2, Eqn (1.10)], as well as ([8, Cond (a),Thm 1.1]), where is the operator norm of . The most common examples of the coupling matrix are scaled adjacency matrices of labelled simple graphs, defined as follows:
For a graph with vertices labelled by , define the coupling matrix by setting
where is the number of edges in the graph .
The scaling of the adjacency matrix by ensures that the resulting Ising model has non-trivial phase transition properties (see for e.g. ), which is of much interest in Statistical Physics and Applied Probability. The influence of phase transition on Inference has received recent attention (c.f. [5, 19]). Under this scaling, (1.3) holds trivially, as . Condition (1.2) demands that the maximum degree of is of the same order as the average degree. Below we give examples of some graphs for which (1.2) holds. We will also use these as running examples for all our future assumptions.
For a simple graph on vertices, let denote the labelled degrees of .
is a regular graph for some . In this case we have for all , and so (1.2) holds with .
is an Erdos-Renyi graph with parameter , where is fixed. In this case we have
and so (1.2) holds with probability tending to for .
is a bi-regular bipartite graph with parameters defined as follows:
has bipartition sets and , with sizes and respectively, and each vertex in has degree , and each vertex in has degree . Finally assume that
Note that the parameters are related, as we have , and .
In this case we have and , and so (1.2) holds for all large with .
We will now introduce the bivariate pseudo-likelihood estimator.
For any we have
On taking and differentiating this with respect to we get the vector , where
The bi-variate equation
will be referred to as the pseudo-likelihood equation in this paper. If the pseudo-likelihood equation has a unique root in , denote it by . This is the pseudo-likelihood estimator for the parameter vector .
Let denote the set of all parameters such that .
Our first result gives a general upper bound on the error of the pseudo-likelihood estimator.
An immediate corollary of Theorem 1.7 is the following corollary.
In the setting of Theorem 1.7, if we further have
then under , i.e. the pseudo-likelihood estimator is jointly consistent.
Corollary 1.8 shows that (1.4) is a sufficient condition for consistency of the pseudo-likelihood estimate. Note that condition (1.4) is an implicit condition, and it is not clear when this will hold. We will now give an exact characterization for (1.4) in terms of the matrix for “mean field” matrices, introduced in the following definition.
We say that a sequence of matrices satisfies the mean field condition, if we have
Condition 1.5 was first introduced in  to study the limiting behavior of normalizing constant of Ising and Potts models. In particular, if , where is the adjacency matrix of a graph, then (1.5) holds iff . Indeed, this is because
which is iff . Thus (1.5) holds in the following examples:
is a regular graph with . In this case we have .
is an Erdos-Renyi graph with parameter . In this case we have .
is a convergent sequence of dense graph converging to the graphon which is not identically . In this case we have .
is a bi-regular bipartite graph with parameters as in Definition 1.5, such that . In this case we have , and so .
Given the coupling matrix , let denote the row sum of .
Our next result now gives a simple sufficient condition for joint consistency of the pseudo-likelihood estimator.
Note that (1.6) and (1.2) together imply (1.3). Thus if is the scaled adjacency matrix of a graph with , the pseudo-likelihood is consistent whenever the graph is slightly irregular. In particular, the pseudo-likelihood is consistent in the following examples:
is a convergent sequence of dense graphs converging to the graphon such that the function is not constant almost surely Lebesgue measure. In this case we have
is a bi-regular bipartite graph with parameters as defined above, such that , and .
This raises the natural question as to what happens for regular graphs. The following theorem addresses this question by showing that whenever the coupling matrix is mean field and asymptotically regular, the random variable is .
Theorem 1.12 along with the upper bound of Theorem 1.7 together suggest that consistency may not be attained by the pseudo-likelihood estimator for asymptotically regular graphs with degree going to . The following theorem confirms this conjecture for the special case when is an Erdös-Renyi graph with parameter , free of .
Suppose is an observation from the Ising model (1.1), where the coupling matrix is , where is a random graph from , the Erdös-Renyi graph with parameter , free of . Let be fixed, and let
Let denote the joint law of and on . Then, setting to be product measure on under which , we have that is contiguous to for every . Consequently, under there does not exist any sequence of estimates (functions of ) which is consistent for in (and hence in ).
It was pointed out in  that the MLE for doesn’t exist for the Curie Weiss model. The above Theorem extends this by showing that consistent estimates do not exist when the underlying graph is Erdös-Renyi. Note that if we set in the Erdös-Renyi model we get a complete graph on vertices, which corresponds to the Curie-Weiss model. We conjecture that there are no consistent estimates for both parameters whenever the graph sequence is regular with degree going to .
If the average degree of a graph sequence does not go to , joint estimation of both parameters at rate is always possible, as shown in the following theorem.
Note that (1.2) and (1.8) together imply (1.3). To see how (1.8) captures sparse graphs, recall that for any graph with adjacency matrix and we have
Thus if (1.8) holds, then we must have , which says that the graph sequence is sparse. In particular Theorem 1.15 shows consistency when the underlying graph has a uniformly bounded degree sequence, irrespective of whether is regular or not.
To complete the picture, we show that if one of the two parameters are known, then the pseudo-likelihood estimator for the other parameter is consistent, for all . Thus joint estimation is indeed a much harder problem than estimation of the individual parameters. The proof of this proposition appears in the appendix.
1.2 Interpretation of results for graphs
Even though all our results apply for general matrices with non-negative entries, the most interesting examples for our theorems are the cases when is the scaled adjacency matrix of a simple graph as in Definition 1.1. Also the conditions take a simpler form. This subsection describes all our results in this special case. Recall that are the degrees of , and let denote the average degree. Also assume that , as has been the case throughout the paper. Finally note that (1.2) is equivalent to , which we will assume throughout this subsection.
For any graph , the pseudo-likelihood estimate of is consistent if is known, and vice versa (Proposition 1.16).
If , and the graph is somewhat irregular as captured by the condition
then the pseudo-likelihood estimator for is jointly consistent (Theorem 1.11).
If and the graph is somewhat regular as captured by the condition
then we believe that the pseudo-likelihood estimator for is not jointly consistent (Theorem 1.12). The only reason this statement is suggestive and not rigorous is that Theorem 1.7 only provides an upper bound and not a matching lower bound.
For the particular case when is an Erdös-Renyi random graph with parameter free of , there are no estimators which are jointly consistent for (Theorem 1.13). Thus indeed the estimation problem is harder on asymptotically regular graphs with large degree.
If is a graph with bounded, then the pseudo-likelihood estimator for is jointly consistent (Theorem 1.15), irrespective of whether is regular or not.
Figure 1 gives a gist of the above discussion on a summary tree.
Our results demonstrate a dichotomy in the joint -consistency of based on whether the coupling matrix is approximately regular or not. In what follows, we address this dichotomy using simulation. At first, we fix different values of the pair on the line for . Next, we draw two random -regular graphs and with and , with nodes. For each value of , we generate a sample from the Ising model with scaled adjacency matrices for the graphs and . On each of those different samples, we estimate by solving the bivariate pseudolikelihood equation. We repeat the same experiment with number of nodes , and random -regular graphs (). In Figure 2, we plot the corresponding pseudolikelihood estimates of for and respectively. In both the figures, plots of the estimates for the case (resp. ) are colored in green (resp. red). Notice that the fit in the case when is more prominent in comparison to the case when .
The rest of the paper is outlined as follows: Section 2 details the proof of Theorem 1.7. Section 3 proves Theorem 1.11 and Theorem 1.12 with the help of Theorem 3.2, the proof of which is deferred to the appendix. Finally, section 4 gives the proof of Theorem 1.13 and Theorem 1.15. The proof of Proposition 1.16 is also deferred to the appendix.
We thankfully acknowledge helpful discussions with Bhaswar B. Bhattyacharya and Sourav Chatterjee at various stages of this work.
2 Proof of Theorem 1.7
The following Lemma is a collection of estimates to be used throughout the rest of this paper.
the following hold:
Proof of Lemma 2.1.
Various versions of these estimates exist already in the literature. In particular, (2.1) follows on invoking [9, Lemma 3.1] or [2, Lemma 3.2] along with the assumption that satisfies (1.5), and (2.2) follows on invoking [9, Lemma 3.2] along with the assumption that satisfies (1.2). Finally, (2.3) follows as an easy consequence of [19, Lemma 1].
2.1 Proof of Theorem 1.7
note that Differentiating the function twice we get the negative Hessian matrix given by
where . The determinant of the Hessian is given by
Since on we have it follows that the Hessian is negative definite, and so the function is strictly concave. To show that there exists a global maximizer , it thus suffices to show that
To see this, note that implies there exists such that , and . Since we have
on letting gives A similar argument shows that if , then Finally, it is immediate that for any . Thus there exists a unique global maximum for the function , and so is the unique root of .
We will now show that if for some , then the pseudo-likelihood estimator is not defined.
On this set we have which implies for all . This implies that , and so the equation is equivalent to
Since the function is convex, it follows that any satisfying this equation is a global maximizer, and hence in this case the set of maximizers is a line in the two dimensional plane and hence not unique. Thus the pseudo-likelihood estimator is not defined.
On this set we have
and so the equation has no roots in , and so the pseudo-likelihood estimator is not defined.