On Model Selection Consistency of Lasso for High-Dimensional Ising Models on Tree-like Graphs

10/16/2021
by   Xiangming Meng, et al.
The University of Tokyo
0

We consider the problem of high-dimensional Ising model selection using neighborhood-based least absolute shrinkage and selection operator (Lasso). It is rigorously proved that under some mild coherence conditions on the population covariance matrix of the Ising model, consistent model selection can be achieved with sample sizes n=Ω(d^3logp) for any tree-like graph in the paramagnetic phase, where p is the number of variables and d is the maximum node degree. When the same conditions are imposed directly on the sample covariance matrices, it is shown that a reduced sample size n=Ω(d^2logp) suffices. The obtained sufficient conditions for consistent model selection with Lasso are the same in the scaling of the sample complexity as that of ℓ_1-regularized logistic regression. Given the popularity and efficiency of Lasso, our rigorous analysis provides a theoretical backing for its practical use in Ising model selection.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/01/2016

Model selection consistency from the perspective of generalization ability and VC theory with an application to Lasso

Model selection is difficult to analyse yet theoretically and empiricall...
07/08/2011

High-dimensional structure estimation in Ising models: Local separation criterion

We consider the problem of high-dimensional Ising (graphical) model sele...
08/22/2018

On an improvement of LASSO by scaling

A sparse modeling is a major topic in machine learning and statistics. L...
05/31/2013

On model selection consistency of regularized M-estimators

Regularized M-estimators are used in diverse areas of science and engine...
07/05/2019

Improving Lasso for model selection and prediction

It is known that the Thresholded Lasso (TL), SCAD or MCP correct intrins...
06/10/2018

Stationary Geometric Graphical Model Selection

We consider the problem of model selection in Gaussian Markov fields in ...
02/08/2021

Ising Model Selection Using ℓ_1-Regularized Linear Regression

We theoretically investigate the performance of ℓ_1-regularized linear r...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Undirected graphical models, also widely known as Markov random fields (MRFs), are a powerful tool for modeling high dimensional probability distributions. The conditional independence relationships between random variables can be well captured by the associated graphs

[wainwright2008graphical, koller2009probabilistic]. It is thus of significant importance to reconstruct the underlying graph structure of the MRFs from independent, identically distributed (i.i.d.) random samples. With the recent advent of massive data, the problem of learning graph structure of MRFs, also termed as graphical model selection, has attracted a surge of interests in various scientific disciplines such as image analysis [roth2005fields], social networking [mcauley2012learning, perc2017statistical], gene network analysis [marbach2012wisdom, krishnan2020modified], and protein interactions [morcos2011direct, liebl2021accurate], just to name a few.

Among various kinds of MRFs, the Ising model [ising1925beitrag] is the most renowned binary graphical model with pairwise interaction potentials [wainwright2008graphical, mezard2009information]. The problem of Ising model selection, i.e., reconstructing the graph structure of an Ising model from i.i.d. samples, has been studied extensively and various methods have been proposed. As the Ising model is originally proposed in statistical physics [ising1925beitrag]

, some early attempts for Ising model selection are heuristic mean-field methods from the physics community

[tanaka1998mean, kappen1998efficient, ricci2012bethe]. However, methods utilizing mean-field approximations perform poorly for Ising models with long-range correlations. To address this problem and further improve the performance, a variety of methods have been proposed in the past years [wainwright2007high, hofling2009estimation, ravikumar2010high, decelle2014pseudolikelihood, vuffray2016interaction, lokhov2018optimal]. Notably, under the framework of the pseudo-likelihood (PL) method [besag1975statistical], the statistical community has provided a strong theoretical backing for -regularized logistic regression (-LogR) [ravikumar2010high]

. In addition, the regularized interaction screening estimator (RISE) is proposed in

vuffray2016interaction using a local cost function called interaction screening objective (ISO) (a logarithmic version is proposed in lokhov2018optimal). It has been theoretically demonstrated that the RISE requires a number of samples that is near-optimal with respect to (w.r.t.) previously established information-theoretic lower-bound [santhanam2012information]. Interestingly, both the -LogR and RISE estimators belong to the common framework of -regularized M-estimators [negahban2012unified]

but only with different loss functions. The use of logistic loss in

-LogR [ravikumar2010high] stems from its consistency with the underlying conditional likelihood of the Ising model while the ISO loss in RISE is specially designed according to the physical “interaction screening” property [vuffray2016interaction, lokhov2018optimal].

In practice, however, the underlying model generating the observed samples is usually unknown a priori, i.e., model mismatch or misspecification is inevitable and thus in general it is difficult to obtain a loss function consistent with the true conditional likelihood as -LogR [ravikumar2010high], nor is it easy to design a specific ISO satisfying the “interaction screening” property as RISE [vuffray2016interaction, lokhov2018optimal]. Consequently, one natural question is that, is it still possible to recover the structure of the Ising model using a misspecified loss function such as the quadratic loss? To the best of our knowledge, this problem has not been rigorously studied, though there are several recent non-rigorous works in the statistical physics literature [Abbara2019c, meng2020structure, meng2021ising]. In particular, it is shown in meng2021ising that even in the case with a misspecified quadratic loss, the neighborhood-based least absolute shrinkage and selection operator (Lasso) [tibshirani1996regression] (termed as

-regularized linear regression (

-LinR) in meng2021ising) has the same order of sample complexity as -LogR for random regular (RR) graphs in the whole paramagnetic phase [mezard2009information]. Furthermore, meng2021ising provides an accurate estimate of the typical sample complexity as well as a precise prediction of the non-asymptotic learning performance. However, there are several limitations in meng2021ising. First, since the replica method [mezard1987spin, opper2001advanced, nishimori2001statistical, mezard2009information] they use is a non-rigorous method from statistical mechanics, a rigorous mathematical proof is lacking. Second, the results in meng2021ising

are restricted to the ensemble of RR graphs since the obtained equations of state (EOS) explicitly depend on the eigenvalue distribution (EVD) of the covariance matrix, which is intractable for general graphs. In addition, since their analysis relies on the

self averaging property [nishimori2001statistical, mezard2009information], the results in meng2021ising are meaningful in terms of the “typical case” rather than the worst case. Another closely related work is lokhov2018optimal, which states that at high temperatures when the magnitudes of the couplings are approaching zero, both logistic and IOS losses can be approximated by a quadratic loss using the second-order Taylor expansion, so that both the -LogR and RISE estimators behave similarly as the Lasso estimator at high temperatures. However, a rigorous analysis of the Lasso estimator itself is still lacking. Moreover, although the Taylor expansion is used in lokhov2018optimal to illustrate the similarity between Lasso and -LogR/RISE, there is no quantitative analysis of the exact valid regime for the success of Lasso, especially when no post-thresholding is performed.

In this paper, we provide a rigorous proof of the model selection consistency of Lasso for high-dimensional Ising models on any tree-like graphs in the whole paramagnetic phase [mezard2009information]. Our proof builds on the framework of the primal-dual witness method in ravikumar2010high for -LogR but with several key modifications specially tailored to the Lasso estimator. Specifically, there are two crucial differences between the quadratic loss of Lasso and logistic/ISO losses of -LogR/RISE [ravikumar2010high, vuffray2016interaction]

. First, the solution to the zero-gradient condition of the expected quadratic loss is no longer the true parameter vector but a rescaled one. Second, the Hessian matrix of the expected quadratic loss reduces to the covariance matrix. Interestingly, such differences on the one hand simplify the analysis somewhere while on the other hand complicate the analysis elsewhere but overall, remarkably, it leads to the same scaling of the sample complexity as the

-LogR estimator [ravikumar2010high] for consistent Ising model selection. Specifically, our main results show that under mild assumptions on the population covariance matrix, consistent Ising model selection can be achieved using samples with Lasso for any paramagnetic tree-like graphs, where is the number of variables of the Ising model and is the maximum node degree. When the same assumptions are directly imposed on the sample covariance matrices, only samples suffice. Compared to previous works [lokhov2018optimal, meng2021ising], we provide a rigorous analysis of Lasso for Ising model selection and explicitly prove its consistency in the whole paramagnetic phase. Given the wide popularity and efficiency of Lasso, such rigorous analysis provides a theoretical backing for its practical use in Ising model selection, which can be viewed as a complement to that for Gaussian graphical models [meinshausen2006high, zhao2006model].

The remainder of this paper is organized as follows. In section 2, we first provide a background of the Ising model and the problem setup. Section 3 states the main results. Section 4 gives the proof of the main results. In Section 5, numerical simulations are conducted to verify the theoretical analysis. Finally, Section 6 concludes the paper with some discussions.

2 Background and Problem Setup

We begin with some background on the Ising model, the definition of the Ising model selection, and an introduction to the neighborhood-based Lasso estimator. We follow the notations of ravikumar2010high very closely for ease of comparison.

2.1 Ising Model

The Ising model is one classical graphical model originated from statistical physics [ising1925beitrag], which is one special class of MRFs with pairwise potentials and each variable takes binary values [opper2001advanced, mezard2009information]

. The joint probability distribution of an Ising model with

variables (spins) has the following form

(1)

where is the partition function and is the coupling vector, respectively. In general, there are also external fields but here they are assumed to be zero for simplicity. The structure of an Ising model can be described by an undirected graph , where is a collection of the indices of the vertices of the graph, and is a collection of undirected edges. For each vertex , its neighborhood set is defined as the subset and the corresponding node degree is given by . The maximum node degree of the graph is denoted as . We use to denote the ensemble of graphs with vertices and maximum (not necessarily bounded) node degree . The minimum and maximum magnitudes of the couplings for are respectively defined as

(2)

Regarding the specific structure of the graph , various types exist and in this paper we specially focus on the ensemble of tree-like graphs, which serves as one assumption for proving the consistency of Lasso as illustrated in assumption (A3) in Sec. 3.1.

2.2 Ising Model Selection

The problem of Ising model selection refers to recovering the underlying graph structure (edge set ) of graph from a collection of i.i.d. samples of the Ising model, where represents the -th sample. As in ravikumar2010high, we consider the slightly stronger criterion of signed edge recovery. Specifically, given one Ising model with couplings and edge set , the edge sign vector is defined as follows [ravikumar2010high]

(3)

where is an element-wise operation that maps every positive entry to 1, negative entry to -1, and zero entry to zero. Suppose that is an estimator of the edge sign vector given , then correct signed edge recovery is achieved if and only if .

In this paper, we are interested in Ising model selection in the high dimensional () regime, where both the number of vertices and the maximum node degree may also scale as a function of the sample size . Our goal is to investigate the sufficient conditions on the scaling of so that the considered neighborhood-based Lasso estimator (as described in Sec. 2.3) is model selection consistent in the sense that

(4)

which is also known as the sparsistency property [ravikumar2010high].

2.3 Neighborhood-based Lasso

The problem of graph structure recovery can be divided into a parallel of sub-problems for each vertex . This is justified by the fact that for any graph , recovering the edge sign vector (3) is equivalent to recovering the associated neighborhood set for each vertex along with its correct signs for every [ravikumar2010high]. For each vertex , the signed neighborhood set is defined as

(5)

Denote as the sub-vector associated with the vertex . Suppose that is an estimator of , then one can estimate the signed neighborhood from . Consequently, the event is equivalent to the event , i.e., every signed neighborhood set is recovered correctly. Note that, given the estimator , there are various methods to estimate the signed neighborhood . In this paper, similar to ravikumar2010high, we consider estimating as follows

(6)

Alternatively, one can also introduce a threshold and then perform post-thresholding as vuffray2016interaction, lokhov2018optimal. Here, we only focus on the case without post-thresholding.

Then, the main goal is to obtain the estimator . There are several popular nonlinear estimators such as -LogR [ravikumar2010high] and RISE [vuffray2016interaction]. In this paper we consider one simple linear estimator, i.e., the neighborhood-based Lasso estimator. Specifically, for each vertex , the estimator is obtained as follows

(7)

where denotes the quadratic loss function

(8)

and is the regularization parameter, which might depend on the value of . For notational simplicity, instead of , will be used hereafter. Our main concern is the scaling condition on which ensures that the estimated signed neighborhood in (6) agrees with the true neighborhood, i.e., , with high probability.

3 Main results

3.1 Assumptions

Similar to ravikumar2010high, the success of the neighborhood-based Lasso estimator (7) relies on several assumptions. First, as ravikumar2010high, we introduce the dependency condition and incoherence condition for the Hessian of the expected square loss , where

denotes expectation w.r.t. the joint distribution

in (1). However, different from the case of logistic loss in ravikumar2010high, the Hessian of with quadratic loss (8) corresponds to the covariance matrix of , which is independent of the value of

and hence the additional variance function term in

ravikumar2010high does not exist. Specifically, for each fixed vertex , the Hessian of is a matrix of the form

(9)

For notational simplicity, will be written as hereafter. Denote as the subset of indices associated with edges of and as its complement. The sub-matrix of indexed by is denoted as , where is the node degree of node . The dependency condition and incoherence condition are described as follows.

Assumption 1 (A1): dependency condition. The sub-matrix has bounded eigenvalue, i.e., there exists a finite constant such that

(10)

Assumption 2 (A2): incoherence condition. There exists an such that

(11)

where is the matrix norm of a matrix .

Compared to the -LogR estimator [ravikumar2010high], the assumptions (A1) and (A2) for the neighborhood-based Lasso (7) are easier to check since they are only imposed on the covariance matrix. Besides, it is assumed that the graph is tree-like and in the paramagnetic phase.

Assumption 3 (A3): Paramagnetic tree-like graph condition. The underlying graph of the Ising model is assumed to be a sparse tree-like graph in the paramagnetic phase [mezard2009information].

This sparse tree-like assumption implies that the loops in the graph can be neglected and the corresponding free energy can be described by the so-called Bethe free energy [mezard2009information]. The paramagnetic phase assumption indicates that the Ising model is above the critical temperature so that the spontaneous magnetization is zero [nishimori2001statistical, mezard2009information]. Note that for RR graphs, under the assumption (A3), assumptions (A1) and (A2) naturally hold, as shown in Proposition 1.

Proposition 1

For a RR graph with uniform coupling strength and constant node degree , the dependency condition in (A1) and incoherence condition in (A2) naturally hold under the assumption (A3).

Proof  See Appendix B.  

It is conjectured that the assumption (A3) is not essential for consistent model selection, though it is necessary for the proof below. In fact, as can be seen from the numerical experiments in Section 5 that, even for graphs with many loops such as the square lattice, the reconstruction probability of the Lasso estimator behaves essentially similarly to tree-like graphs. We here leave relaxing the assumption (A3) as future work.

3.2 Statement of main results

First consider the case of “fixed design” of the sample covariance matrices

(12)

which satisfy the assumptions (A1) and (A2). Similar to ravikumar2010high, we define the “good event” as follows

(13)

Then, when the original Ising model satisifes (A3), we have Theorem 1.

Theorem 1

(fixed design) Consider an Ising model defined on a graph with parameter vector and associated sign edge set such that the assumption (A3) is satisfied. Suppose that the event holds and the regularization parameter is selected to satisfy for some constant . Under these conditions, if

(14)

then with probability at least as , the following properties hold:

(a) For each node , the Lasso estimator (7) has a unique solution, and thus uniquely specifies a signed neighborhood .

(b) For each node , the estimated signed neighborhood vector correctly excludes all edges not in the true neighborhood. Moreover, it correctly includes all edges if , where is the minimum magnitude of the rescaled parameter and is defined later in (34).

Theorem 1 indicates that, remarkably, under some mild assumptions (A1) and (A2) on sample covariance matrices, in the high-dimensional setting (for ), the Lasso estimator is model selection consistent with samples for tree-like Ising models in the paramagnetic phase. Note that minimum magnitude of the rescaled parameter might be difficult to evaluate for general graphs. Instead, a relaxed condition can be obtained by using a lower bound of (see Appendix A). In the case of RR graphs with coupling strength and node degree , it can be easily obtained as .

In Theorem 1, the assumptions (A1) and (A2) are directly imposed on the sample covariance matrices . When (A1) and (A2) are imposed on the population covariance matrix , we obtain the following Theorem 2.

Theorem 2

Consider an Ising model defined on a graph with parameter vector and associated sign edge set such that assumption (A3) is satisfied. Moreover, assumptions (A1) and (A2) are satisfied by the population covariance matrix . Suppose that the regularization parameter is selected to satisfy for some constant . Then there exists a constant independent of such that if

(15)

then with probability at least as , the following properties hold:

(a) For each node , the Lasso estimator (7) has a unique solution, and thus uniquely specifies a signed neighborhood .

(b) For each node , the estimated signed neighborhood vector correctly excludes all edges not in the true neighborhood. Moreover, it correctly includes all edges if the minimum magnitude of the rescaled parameter satisfies .

Theorem 2 indicates that, if the same assumptions (A1) and (A2) are imposed on the population covariance matrix , then in the high-dimensional setting (for ), the Lasso estimator is model selection consistent with samples for any tree-like Ising models in the paramagnetic phase.

4 Proof of the Main Results

In this section, we present the detailed proofs of Theorem 1 and Theorem 2, respectively.

4.1 Sketch of the proof

Our proofs of Theorem 1 and Theorem 2 follow closely the one in ravikumar2010high. First, we prove Theorem 1 for fixed-design of the sample covariance matrices . Then, it is demonstrated that imposing the dependence condition (A1) and incoherence condition (A2) on the population covariance matrices guarantees that the associated conditions hold for the sample covariance matrices with high probability. Given Theorem 1, the proof of Theorem 2 is the same as that of ravikumar2010high using some large-deviation analysis. Thus our main focus is the proof of Theorem 1 which, due to the misspecified quadratic loss, requires several key modifications of the result in ravikumar2010high.

As in ravikumar2010high, we use the primal-dual witness proof framework. The main idea of the primal-dual witness method is to explicitly construct an optimal primal-dual pair which satisfies the sub-gradient optimality conditions associated with the -regularized estimator (in our case it is the Lasso estimator (7)). Subsequently, we prove that under the stated assumptions on , the optimal primal-dual pair can be constructed such that they act as a witness, i.e., a certificate that guarantees that the neighborhood-based Lasso estimator correctly recovers the sign edge set of the graph .

Specifically, for each vertex , an optimal primal-dual pair is constructed, where is a primal solution and is the associated sub-gradient vector. They satisfy the zero sub-gradient optimality condition [rockafellar1970convex] associated with the Lasso estimator (7):

(16)

where the sub-gradient vector satisfies

(17)

Then, the pair is a primal-dual optimal solution to (7) and its dual. Further, to ensure that such an optimal primal-dual pair correctly specifies the signed neighorbood of node , the sufficient and necessary conditions are as follows

(18)

Note that while the regression in (7) corresponds to a convex problem, for in the high-dimensional regime, it is not necessarily strictly convex so that there might be multiple optimal solutions. Fortunately, the following lemma in ravikumar2010high provides sufficient conditions for shared sparsity among optimal solutions as well as uniqueness of the optimal solution.

Lemma 1

(Lemma 1 in ravikumar2010high). Suppose that there exists an optimal primal solution with associated optimal dual vector such that . Then any optimal primal solution must have . Moreover, if the Hessian sub-matrix is strictly positive definite, then is the unique optimal solution.

According to Lemma 1, following ravikumar2010high, we can construct a primal-dual witness for the Lasso estimator (7) as follows:

(a) First, set as the minimizer of the partial penalized likelihood

(19)

and then set .

(b) Second, set so that condition (18) (b) holds.

(c) Third, obtain from (16) by substituting the values of and .

(d) Finally, we need to show that the stated scalings of imply that, with high probability, the remaining conditions (18) (a) and (17) are satisfied.

Similar to ravikumar2010high, our analysis in step (d) demonstrates that with high probability. Furthermore, under the conditions of Theorem 1, the sub-matrix of the sample covariance matrices are strictly positive definite with high probability so that, by Lemma 1, it is guaranteed that the primal solution is unique.

4.2 Some key results

Some key technical results central to the proof are presented in this subsection. First, we obtain the zero-gradient solution to for the quadratic loss (8).

Lemma 2

For a tree-like graph in the paramagnetic phase (assumption (A3)), the solution to , denoted as , can be obtained as

(20)

In particular, for RR graphs with constant coupling strength and degree node , there is

(21)

Proof  The gradient of the quadratic loss in (8) w.r.t. reads

(22)

After taking expectation of gradient over the distribution and setting it to be zero, we obtain in matrix form:

(23)

where is the covariance matrix of and . The solution to (23), denoted as , can be analytically obtained as . Next, we construct the full covariance matrix of all spins as follows

(24)

where is indexed as the first variable in without loss of generality. From the block matrix inversion lemma, the inverse covariance matrix can be computed as

(25)

where

(26)
(27)

On the other hand, for sparse tree-like graph in the paramagnetic phase, the inverse covariance matrix can be computed from the Hessian of the Gibbs free energy [ricci2012bethe, nguyen2012bethe, Abbara2019c]. Specifically, each element of the covariance matrix can be expressed as

(28)

where with and the assessment is carried out at . In addition, for technical convenience we introduce the Gibbs free energy as

(29)

The definition of (29) indicates that following two relations hold:

(30)
(31)

where the evaluations are performed at and ( under the paramagnetic assumption). Consequently, the inverse covariance matrix of a tree-like graph can be computed as [ricci2012bethe, nguyen2012bethe, Abbara2019c]

(32)

The two representations of in (25) and (32) are equivalent so that the corresponding elements should equal to each other. Thus, the following identities hold

(33)

where is applied element-wise. From (33), we obtain (20), which is a rescaled version of the true couplings. In particular, for RR graphs with constant coupling and , substituting the results one can obtain (21), which completes the proof.  

The result of Lemma 2 indicates that, under the assumption (A3), the solution to the zero-gradient condition with quadratic loss shares the same sign structure as the true parameter , i.e., . The minimum magnitude of the rescaled value for is denoted as

(34)

where is defined in (20).

Subsequently, for any fixed , given , we re-write the zero-subgradient condition (16) as follows

(35)

where are evaluated at the rescaled in (20), as opposed to the true parameter [ravikumar2010high]. Using the mean-value theorem element-wise to (35), we obtain

(36)

In contrast to ravikumar2010high, the remainder term disappears due to the quadratic quadratic loss (8). The -th element of , denoted as , can be written as follows

(37)
(38)

It can be seen that has a quite different form from the result in ravikumar2010high. The properties of the random variable are shown in Lemma 3.

Lemma 3

The random variable defined in (38) has zero mean and bounded variance, i.e., , . Furthermore, itself is bounded by .

Proof  First, can be readily obtained from Lemma 2. Thus, to prove , it suffices to prove . To this end, we introduce an auxiliary function

(39)

Thus we have . The gradient vector can be computed as . Since as shown in Lemma 2, we have . Moreover, since , we can conclude that reaches its minimum at . As a result, we have

where in the last line the fact that is used. Therefore, we obtain .

Finally, recalling the result (20), we have

(40)

To proceed, consider an auxiliary function . Then it can be proved that , so that from (40), we have

(41)

It can be easily checked that . We introduce another auxiliary function

(42)

The first-order derivative of can be easily computed as

(43)

As a result, when and when . Consequently,

(44)

Finally, combining the above results together yields

(45)

By definition, there is so that , which completes the proof.  
Using the results in Lemma 3, we can obtain the behavior of , as shown in the following lemma.

Lemma 4

For the specified mutual incoherence parameter , if for some constant and , then

(46)

which converges to zero at rate as .

Proof  According to Lemma 3, applying the Bernstein’s inequality [vershynin2018high], we have

(47)

Similar to vuffray2016interaction, inverting the following relation

(48)

and substituting the result in (47) yields

(49)

where . Suppose that , then while . Consequently, we have

where a relaxed result is obtained. Subsequently, we obtain an expression which is independent of :

(50)

Setting , then if