Markov random fields (MRFs) are a popular way to model the dependence between a set of random variables. Consider a class of MRFs,, on Recall that such MRFs have distributions such that, for any is conditionally independent of for such that given the random variables Thus, with every one can associate an undirected graph capturing the dependencies, called the Markov network structure of the distribution, , where The Ising model and the Gaussian MRF are instances of MRFs for binary and real-valued random variables, respectively, and are employed in a variety of applications.
An important problem in statistics is the inverse problem of determining the dependencies between a set of random variables given a set of samples drawn from their joint distributions. In the context of MRFs, letbe a class of MRFs as before, be parametrised by and the random vector be distributed according to . The inverse problem can be stated in two closely related ways - given i.i.d. samples , one may wish to either learn , which is the model selection problem, or to learn , which is the structure learning problem. These problems are typically studied in the “high-dimensional setting,” where
, that is, the number of samples is much smaller than the number of variables involved. A large body of work has focussed on constructing structure learning schemes for the Gaussian and Ising models, including algorithms and regularised estimators, with a particular focus on the technical conditions and the number of samples required for consistent learning (see[anandkumar2012high, meinshausen2006, ravikumar2010high] and references therein for the Gaussian MRF, and [montanari2009graphical, Bresler:2015:ELI:2746539.2746631, ravikumar2010high, anandkumar2012high_ising], and references therein for the Ising model). A relatively smaller set of papers has studied the hardness of structure learning in terms of simple graph properties, providing necessary and sufficient information-theoretic conditions on the number of samples required for reliable structure learning (see [WanWaiRam] and [SanWai, TandomShanmugamRavikumarDimakis] for the Gaussian and Ising MRFs, respectively).
Recently, the allied problem of change detection in MRFs has received attention. Let be some parametrised members of respectively. The basic question is whether, given samples , and samples one can estimate (structural change detection) or (general change detection) well. In the following, we concentrate on structural change detection, and largely suppress the adjective structural, since all results hold for the general problem as well (and also since including it makes for clunky prose). Structural change detection has both received practical interest, for instance in the study of controlled experiments [zhang2012learning]; in genetics [zhao2014direct, xia2015testing]; and in neuroscientific contexts [belilovsky2016testing], and theoretical interest focused on change detection algorithms along with the study of their sample complexities and consistency conditions [zhao2014direct, FazBan, liu2017, liu2017learning]. A related line of work is pursued in [daskalakis2016testing], which studies, amongst other things, the sample complexity of goodness of fit testing for Ising models under statistical measures of difference, i.e., given and where are Ising models, it studies the problem of estimating where is the Kullback-Leibler (KL) divergence. We note that our proof technique also relies on analysing a goodness of fit test, but against structural distance measures.
In this paper, we take an information-theoretic approach to the change detection problem for the Ising and the Gaussian MRFs. Following the structure learning literature, we study the problem for the simple case of distributions that have Markov network structures with maximum degree bounded by , denoted , and study the same against a minimax risk of misdeclaring a change, or the lack thereof, which is denoted We allow Ising models where the non-zero are lower bounded by some and upper bounded by some , and Gaussian MRFs where the ratios of the non-zero off-diagonal entries to the corresponding diagonal entries are lower bounded by some . Our results provide lower bounds on the sample complexity in terms of and or , for any detection method that makes small, and use simple ensembles of possible changes to do so. Our proof technique is symmetric in the two datasets, and hence the results are lower bounds on . Interestingly, under mild conditions on the parameters, these bounds improve upon all the known lower bounds on the sample complexity of structure learning, and are at most separated from the sample complexity of the maximum likelihood structure learner. This suggests222but does not prove that, at least for the Ising and Gaussian MRFs on change detection is as hard, in terms of its data cost, as structure learning.
1.1 Comparison to Prior Work
Note that one can naively perform change detection by estimating the network structures of the distributions underlying the two sets of samples and comparing the same. This method is generally considered profligate, especially in the pratically important case where the underlying models might be dense, but the differences between them are small333All the applied work we have mentioned before falls into this category.. One possible justification for this comes from the compressed sensing literature, which has demonstrated that in certain model classes learning sparse changes between two models can be significantly more data efficient than learning either of the models. Thus, a widely believed “folk theorem” asserts the existence of schemes for change detection in the Ising and Gaussian MRFs that can handle wide ranges of model classes such that their sample requirements scale only with sparsity of the changes to be detected, rather than the complexity of the underlying models. The previous work mentioned, [zhao2014direct, FazBan, liu2017, liu2017learning], all develop algorithms that estimate certain functionals of the ratio of the probability densities of the underlying models via a regularised optimisation scheme, and detect change on the basis of this estimate with sample complexities that scale as , where . However, these results only hold under strong technical conditions, e.g. structural assumptions on the Fisher information matrices of the models considered, and as previously noted in [montanari2009graphical, anandkumar2012high], such conditions do not easily provide a clear description of which classes of graphs and models satisfy them in direct terms, that is, do not provide a non-trivial class such that these results hold for every . Further, as [Hein2016] points out for the case of Gaussian structure learning, such conditions can fail to hold in subtle ways for application relevant data.
The nature of our results is at odds with the conventional wisdom (e.g., the claims in[FazBan, liu2017learning]) that sparse changes can be easy to detect irrespective of the ranges underlying parameters and the graph classes they live in. We believe that this disconnect is because of the strictness of the conditions required for these results to hold, and the regimes they implicitly push the underlying models into. In particular, as previously noted, our lower bounds on sample complexity closely match all known lower bounds on structure learning in under mild conditions on the parameters. Crucially, the ensembles we construct to demonstrate these bounds have the sparsest possible changes - We note that this dependence on the parameters and is nontrivial - for instance, for the Ising model, if , then the sample complexity lower bound scales as , which is exponential in , and even if one is allowed to cherry pick to minimise the lower bound, one needs at least samples to detect change, irrespective of any incoherence or dependency conditions that may be imposed. For the Gaussian, our bound matches the scaling of the corresponding the structure learning bound. Since this is known to be tight up to constant factors in certain regimes, this shows that the naive scheme for change detection is minimax optimal in at least some contexts.
Lastly, our lower bounds on Ising model change detection sample complexity for structured models are actually stronger than those in [SanWai] in certain parameter regimes. Since change detection is reducible to structure learning, these are also new bounds for the latter. We further note that our proof technique, which is currently un-optimised, relies upon lower bounding the -distance of certain distributions, and thus differs from the Fano bounds of the previous literature. Since the technique is largely elementary and provides improvements on the previous results, it may be of independent interest and merit refinement.
Organisation: §2 defines the models considered, and precisely formulates the problem considered, while §3 states the results and compares them with previous work. §4 begins by laying out the common technical development involved in the proof, and follows this with the actual proofs.
Notation: For a natural , we use as shorthand for the set . denote random vectors, usually of dimension or , with -valued entries. For a natural is the component of . Similarly, for a set of naturals, is the vector . We use to denote an -length sequence of i.i.d. random vectors, frequently referred to as a ‘dataset’. For a distribution , and given , is the distribution of . Further, given the sample in that dataset is denoted . Lastly, we use and diacritical modifications of the same for partition functions whenever they are relevant. The two element sets are interchangeably denoted when referring to edges in an undirected graph, and as just when they appear in a subscript. Vectors in are indexed by cardinality-two subsets of . For instance, a vector is represented as
. The identity matrix of sizeis denoted as .
For functions if there exists a positive constant such that and if . Similarly, if and if .
2 Definitions and Problem Statement
We begin the paper with a bit of background on Markov random fields (MRFs) and a precise problem statement, which also serve to establish notation followed in the rest of the paper. Note that while we define the problem over a general MRF, our results are restricted to the Ising and the Gaussian MRFs. We define the problem thus since we develop a single technique for establishing lower bounds on the sample complexity and apply it parallely to both the models considered.
2.1 Markov Random Fields
Recall that an undirected finite simple graph consists of a finite vertex set , here identified with , and edge set which is a set of subsets of of cardinality . For , we let be the set of neighbours of , and the cardinality of is said to be the degree of . Lastly, for naturals we let be the set of graphs on vertices such that every node’s degree is no bigger than .
For a set an -valued Markov random field on a graph is a random vector such that for every , is conditionally independent of given . The graph
is said to be the Markov network structure of the Markov random field. We define a class of MRFs as a set of probability distributions, each of which is an MRF. For a class of MRFsand a family of graphs , is said to be a class of MRFs on if every MRF in has a Markov network structure contained in .
Given a graph and a vector such that , a -external field Ising model on with parameter is a -valued Markov random field with the distribution
where is a normalising constant commonly known as the partition function. We let be the set of Ising models on graphs in such that for every either , or holds. Note that every determines a network structure, which we refer to as .
Similarly, given a graph a symmetric, positive-definite matrix such that only if , the -mean Gaussian Markov Random Field on with parameter is the -valued Markov random field with the distribution
The matrix is known as the precision matrix of . Note that the Markov network structure of the distribution is determined by the non-zero entries of . For let be the set of Gaussian MRFs on graphs in with mean and precision matrix such that all diagonal entries are non-zero, and for every
We note here that the value of actually modulates the graphical structures allowed within . For instance, if we allow graphs such that even a single node may have neighbours in the Markov network structure of a distribution as above, then the condition of positivity of the precision matrices enforces , and if we allow the entirety of unrestrictedly, the condition is required.
Lastly, for the above classes of MRFs, we frequently describe distributions in terms of their Markov network structures. In particular, we say that a distribution has the edge with weight if for the Ising model, and if for the Gaussian MRF. If all the edges of a distribution have the same weight, we say that the distribution has uniform edge weights.
2.2 The Change Detection Problem
Let be a class of -valued Markov random fields with parameters , and let have parameters and respectively. Note that the distributions/parameters are unknown to us, and are potentially equal. For , let be its Markov network structure. We consider the structural change detection problem, which may be informally stated as follows: Given samples drawn according to independently and identically, and samples identically drawn according to independently of each other and of the other set of samples, can one determine if or not with high probability?
Formally, for -valued Markov random fields let and be finite sets of samples, also referred to as datasets, drawn independently and identically from and , respectively. An -change detector for is a map Let be the set of all -change detectors. Let the risk of a detector , be
and the minimax change detection risk with samples over the class be
Note that the above risk expressions are just the adaptation of the standard binary hypothesis testing risks to our situation.
We say that an -change detector is -reliable over the class if and say that the change detection problem can be solved over -reliably with samples if there exists an -change detector over that is -reliable or, equivalently, if The parameter is occasionally referred to as the reliability level.
The main aim of this paper is to study the trade-off between the reliability level of change detection over a given class of MRFs and the sample size In particular, we provide necessary conditions on for -reliable change detection with samples over the classes and in terms of their parameters.
3 Theorem Statements and Nature of Results
As previously noted, our results are necessary conditions on the number of samples required for to be smaller than a given . The results are stated in separate subsections for the Ising and the Gaussian MRFs, respectively. A discussion comparing the results to parallel results in structure learning follows the theorem statements, while proofs are relegated to later sections. C.f. §2 for precise definitions of the models and graph class considered.
3.1 Ising Model
Let For every a necessary condition for -reliable change detection with samples over is
Let and For every a necessary condition for -reliable change detection with samples over is
Proof sketch: While the exact proof is relegated to §4, we loosely detail the strategy here. Let , and let be a set of distributions such that Suppose we are given the information that the larger set of samples is drawn according to , while the second set is either drawn from some or drawn from
. We have thus reduced the change detection problem to running a hypothesis test on the smaller dataset, with the simple null hypothesis that the same is drawn from, and the composite alternate that it is drawn according to some . Suppose further that we are supplied with a prior, for the selection on . Clearly, the average risk of the hypothesis testing problem under the prior would be a lower bound for the minimax change detection risk. However, since the uniformly most powerful tests for such problems are known by the classical results of Neyman & Pearson, we can compute lower bounds on these average risks. In particular, we do this by applying a variation of Le Cam’s method, as outlined in [arias-castro2012], and with a uniform prior on
We call the pair a change detection ensemble. The ensembles used to derive the above bounds are as follows.
Theorem 1a: is the Ising model with no edges on nodes, and is the collection of Ising models with exactly one edge with edge weight
Theorem 1b: is the Ising model with uniform edge weight on separate cliques of size each, while is the collection of Ising models on the same graph as but with one known edge from exactly one of the cliques deleted, again with uniform weight
We note that the above proof technique directly allows us to state our results as structure learning bounds in the context of the recovery criterion defined in [SanWai] as well, as will be discussed in §4.
Remark: The ensemble used for the proof of Theorem 1b most likely does not satisfy the conditions required in work such as [FazBan] or [liu2017]. This is because these papers all require an ‘incoherence condition’, and, as pointed out in [montanari2009graphical], these conditions essentially hold only if for some , while we need for the results to follow.
3.1.1 Remarks, and comparison with structure learning bounds
In the high-dimensional setting, one considers the behaviour of these bounds for large . The above bounds above can be interpreted in three different contexts depending on which of the model parameters are allowed to change. In the following we set in order to discuss conditions necessary to beat a random coin, and is some arbitrary quantity that depends only on the non-changing parameters.
If the parameters are given constants, then change detection requires at least samples from each dataset to detect changes.
If we hold as constants, and allow to grow with , then for sufficiently large , one needs samples from each dataset to detect changes.
If we allow and as well as to change with , then unless decays with we are forced into the exponential growth in regime. Suppose we impose the requirement that the bounds grow at most polynomially in . This can be done essentially by limiting for any constant , which limits the second bound, although since must hold, this affects the first bound as well. Optimising for the which gives the lowest net growth, we get that in any scenario of parameter growth, we need at least samples from each data set to detect changes.
We compare the bounds of Theorem 3.1 with two results due to Santhanam & Wainwright. Note that the statements have been modified to fit our notation.
First we consider the necessary condition:
Theorem [SanWai, Thm. 1].
Consider for If the structure learning probability of error is smaller than then following condition on the number of samples, , is necessary
Theorem 1a is the direct analogue of the first bound, and Theorem 1b is the direct analogue of the second bound. The third bound is only active when all parameters are held constant, and even then is inactive for , and thus left largely unconsidered by us. However, the other two terms are weaker than Theorem 3.1 in the regime in which the latter hold. In particular, if and , Theorem 1b has an advantage of essentially which is unbounded in for Note that while the requirement for our second bound to hold may seem more stringent than the condition in [SanWai], in the regime , the bound dominates the exponential bound in both cases, and thus this distinction is rendered moot. Lastly, observe that if we force and to decay in a manner that allows at most polynomial growth with , we see that structure learning requires samples as in our case.
In the same paper, Santhanam & Wainwright also show ([SanWai, Theorem 3a)]) that if the edge weights are given, it is possible to learn the structure with samples for
Ignoring the terms, the result above is separated from our lower bound by a factor of Now if and are at most polynomial in then this factor is at most polynomial in , which is neglegible compared to the exponential in scaling forced under . Since our bounds are computed with a change ensemble where we are aware of the edge weights, the closeness of these bounds implies that our technique cannot yield significantly stronger lower bounds in this regime. Lastly, recall that our change detection bounds can also be stated as structure learning bounds. Thus, our results close the exponential gap in the structure learning lower bounds of [SanWai], which has persisted through all subsequent work on structure learning.
3.2 Gaussian MRFs
Let . For every a necessary condition for -reliable change detection with samples over is
Remark: As mentioned before, controls the richness of the Markov network structures within that are allowed, essentially since the two together determine the positivity of certain precision matrices in the class. In particular, if we allow even a single node to have neighbours, as we rightly should when considering models in , then is forced, and if regular graphs are allowed, then is imposed. In light of this, the condition on in the theorem statement is fairly benign. For instance, enforcing for already gives us
As in the Ising case, the bound is proved by considering an explicitly stated restricted class of changes, and bounding the risk for them. Curiously, while we obtain the above bound by considering a simple ensemble of changes of the form independent versus one-edge, essentially the same result can be obtained up to constant factors444 and subject to the conditions that allow these ensembles to exists, in more richly connected classes of ensembles - for instance, by using repetitions as in §4.4.2 of the star graph versus the star graph with one edge moved, or the complete graph versus the same with a known edge missing. This suggests that there might be some uniformity to the hardness of structure learning/change detection in Gaussian MRFs.
Comparing the above bound to the corresponding structure learning bound, we note the following result of Wang, Wainwright, and Ramchandran, stated in our notation.
Theorem [WanWaiRam, Thm. 1].
Consider the class with . A necessary condition for asymptotically555as grows large reliable structure learning over this class is
Note that the first term in the expression above dominates the second when which, by the previous argument, is the range of in which at least one node can be connected to other nodes in the Markov network structure of the graph. Thus, our lower bound matches the structure learning lower bound in the parameter region relevant for . We note that this bound is near tight - for instance [anandkumar2012high] achieves, under technical conditions, structure learning for Gaussian MRFs with sample complexity .
We begin by detailing the general proof technique that we use, followed by the proofs of Theorems 1a and 2, which are of the same flavour, and are rather simple. The proof of Theorem 1b is relatively more involved and follows these sections.
4.1 Lower Bounding Technique
We first note that any lower bound on sample complexities to achieve a given risk level must hold for both and , since merely switching the labels of the two sets of samples should not affect anything. We hence set , and derive lower bounds on by considering simpler hypothesis testing problems. In the following, denotes a random sample from the dataset with the smaller number of samples.
Recall the proof strategy described after the statement of Theorem 3.1. Continuing in the same vein, we consider the average risk for the hypothesis testing problem
under the uniform prior on . Recall that the average risk under this prior must be smaller than .
By the results of Neyman & Pearson [lehmann2006testing, Ch. 3], we know that the most powerful tests for the above hypothesis test are of the form
where is a positive real number, and
is the likelihood ratio of the distribution versus a distribution uniformly at random from set . We refer to as the likelihood ratio of versus .
Note that for discrete distributions such as the Ising model, and for distributions which admit a density with respect to a Euclidean space, such as the Gaussian, we have , where and denote the respective densities.
Our main technical tool is captured by the following lemma, which allows us to compute lower bounds on required for
to be small by computing upper bounds on the variance offor well-chosen and .
Let be a class of MRFs. For every such that , and , if is the -sample likelihood ratio of versus then
For notational brevity, we let
Let be the optimal average risk for the above test with samples when is picked uniformly at random from . By the previous discussion,
Since the proof concludes with a simple manipulation of the above inequality. ∎
Remark: The above proof technique can also be applied to obtain structure learning bounds. In particular, suppose we have access to a structure learning algorithm that uniformly over all distributions in identifies the correct structure with probability greater than . Then, for such that for all , and under the distribution the above hypothesis test can be solved with probability of error smaller than by learning the structure of the distributions. However, this must exceed . Thus, all our change detection bounds, which rely on bounding , can be converted to structure learning bounds by doubling the reliability level.
4.2 Proof of Theorem 1a
Let be the Ising model with no edges on nodes, and let be the Ising model with weight on the edge and no other edges. We let the class , and consider the change detection ensemble . Lastly, we set Note that the n-sample likelihood ratio is
In order to apply Lemma 3, we need to compute the quantity Let . We note that for , , since is just
independent Bernoulli distributions together, we have
where the equality is since each of the samples are independent and have the distribution , and
comes from feeding in the moments computed before. Plugging equationinto the condition imposed by Lemma 3 we get that if the risk is smaller than then we must have
We conclude by noting that and that we may set .
4.3 Proof of Theorem 2
This result essentially follows the same arguments as in the proof of Theorem 1a. We identify the distributions of the Gaussian MRFs with their precision matrices. Let be the Gaussian MRF on nodes with no edges and unit variance, be the unit variance Gaussian MRF on nodes with the single edge with edge weight s.t. , and For convenience, we let be the matrix with the and entries equal to , and all other entries . Note that the precision matrices are and .
Again, consider the likelihood ratio. We have
We first require a few computations. For convenience, let and let and be naturals in
- simply observe that the matrix is similar to the block diagonal matrix and thus the determinant is the product of and .
since this is similar to
If , then since the two s each contribute separate blocks of .
If , then , since this matrix is similar to where
We are now in a position to bound . The first few steps are parallel to the Ising model case, and are omitted.
where is due to the Gaussian integral, and the final equality is by simple counting. Note first that . Since we are looking for an upper bound, we will set . We thus have
Taking logarithms, we directly have that for every
The stated result follows on noting that for 666This inequality can be shown for by a standard positivity of derivative argument, and can be readily shown for by any numerical equation solver., and since if we may set above. Note that the quadratic denominator in is retained as long as is bounded away from , although with a graceful weakening of the constants invovlved.
4.4 Proof of Theorem 1b
To begin with, we will prove a likelihood ratio upper bound for a specific change detection ensemble on the graph class . We will next show a technique that allows us to obtain a closely related bound on the graph class , and follow this by a small section concluding the proof.
4.4.1 A likelihood ratio bound
Let be the Ising model with uniform edge weight on the graph , the complete graph on nodes. Note that each node in has degree . Similarly, let be the Ising model with uniform edge weight on the graph i.e., the same graph as but with one edge deleted. We let be the partition function for , and be the partition function for .
Arithmetical manipulations show that for
We thus have
We first compute a bound on the sample likelihood ratio for versus , denoted . Observe that
and we thus need a bound on . This is the subject of the following lemma, the proof of which is relegated to the appendix.
If and then
4.4.2 Lifting results on nodes to nodes
Let be a MRF on variables such that , and be the MRF obtained by eliminating the dependence in between and , i.e, clipping one edge from with no change in the rest of the graph. Suppose that we have an estimate for the likelihood ratio variance of the form
where is some positive function.
Let and consider the graph on nodes consisting of singletons, and non-trivial completely connected components, each of which has exactly nodes. Observe that , where for