Simultaneous Inference for Pairwise Graphical Models with Generalized Score Matching

05/15/2019 ∙ by Ming Yu, et al. ∙ The University of Chicago Booth School of Business 0

Probabilistic graphical models provide a flexible yet parsimonious framework for modeling dependencies among nodes in networks. There is a vast literature on parameter estimation and consistent model selection for graphical models. However, in many of the applications, scientists are also interested in quantifying the uncertainty associated with the estimated parameters and selected models, which current literature has not addressed thoroughly. In this paper, we propose a novel estimator for statistical inference on edge parameters in pairwise graphical models based on generalized Hyvärinen scoring rule. Hyvärinen scoring rule is especially useful in cases where the normalizing constant cannot be obtained efficiently in a closed form, which is a common problem for graphical models, including Ising models and truncated Gaussian graphical models. Our estimator allows us to perform statistical inference for general graphical models whereas the existing works mostly focus on statistical inference for Gaussian graphical models where finding normalizing constant is computationally tractable. Under mild conditions that are typically assumed in the literature for consistent estimation, we prove that our proposed estimator is √(n)-consistent and asymptotically normal, which allows us to construct confidence intervals and build hypothesis tests for edge parameters. Moreover, we show how our proposed method can be applied to test hypotheses that involve a large number of model parameters simultaneously. We illustrate validity of our estimator through extensive simulation studies on a diverse collection of data-generating processes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Undirected probabilistic graphical models are widely used to explore and represent dependencies between random variables

(Lauritzen, 1996). They have been used in areas ranging from computational biology to neuroscience and finance. An undirected probabilistic graphical model consists of an undirected graph , where is the vertex set and

is the edge set, and a random vector

. Each coordinate of the random vector is associated with a vertex in and the graph structure encodes the conditional independence assumptions underlying the distribution of . In particular, and are conditionally independent given all the other variables if and only if , that is, the nodes and are not adjacent in . One of the fundamental problems in statistics is that of learning the structure of from i.i.d. samples from and quantifying uncertainty of the estimated structure. Drton and Maathuis (2017) provides a recent review of algorithms for learning the structure, while Janková and van de Geer (2019) provides an overview of statistical inference in Gaussian graphical models.

Gaussian graphical models are a special case of undirected probabilistic graphical models and have been widely studied in the machine learning literature. Suppose that

. In this case, the conditional independence graph is determined by the pattern of non-zero elements of the inverse of the covariance matrix . In particular, and are conditionally independent given all the other variables in if and only if and are both zero. This simple relationship has been fundamental for development of rich literature on Gaussian graphical models and has facilitated development of fast algorithms and inferential procedures (see, for example, Dempster, 1972; Drton and Perlman, 2004; Meinshausen and Bühlmann, 2006; Yuan and Lin, 2007; Friedman et al., 2008; Rothman et al., 2008; Yuan, 2010; Sun and Zhang, 2013; Cai et al., 2011).

In this paper, we consider a more general, but still tractable, class of pairwise interaction graphical models with densities belonging to an exponential family with natural parameter space :

(1)

The functions , are the sufficient statistics and is the log-partition function. We assume throughout the paper that the support of the densities is either or and is dominated by Lebesgue measure on . To simplify the notation, for a log-density of the form given in (1) we will write

where and with . The natural parameter space has the form . Under the model in (1), there is no edge between and in the corresponding conditional independence graph if and only if . The model in (1) encompasses a large number of graphical models studied in the literature as we discuss in Section 1.2. Lin et al. (2016) studied estimation of parameters in model (1), however, the focus of this paper, as we discuss next, is on performing statistical inference — constructing honest confidence intervals and statistical tests — for parameters in (1).

The focus of the paper is on the inferential analysis about parameters in the model given in (1), as well as the Markov dependencies between observed variables. Our inference procedure does not rely on the oracle support recovery properties of the estimator and is therefore uniformly valid in a high-dimensional regime and robust to model selection mistakes, which commonly occur in ultra-high dimensional setting. Our inference procedure is based on Hyvärinen generalized scoring rule estimate of in (1). The same procedure was used in Lin et al. (2016), however, rather than focusing on consistent model selection, we use the initial estimator to construct a regular linear estimator (van der Vaart, 1998). We establish Bahadur type representation for our final regular estimator that is robust to model selection mistakes and valid for a big class of data generating distributions. The purpose of establishing a Bahadur representation is to approximate an estimate by a sum of independent random variables, and hence prove the asymptotic normality of the estimator for (1), allowing us to conduct statistical inference on the model parameters (see Bahadur, 1966). In particular, we show how to construct confidence intervals for a parameter in the model that have nominal coverage and also propose a statistical test for existence of edges in the graphical model with nominal size. These results complement existing literature, which is focused on consistent model selection and parameter recovery, as we review in the next section. Furthermore, we develop methodology for constructing simultaneous confidence intervals for all the parameters in the model (1) and apply this methodology for testing the parameters in the differential network111We adopt the notion used in Li et al. (2007) and Danaher et al. (2014) and define the differential network as a difference between parameters of two graphical models.

. The main idea here is to use the Gaussian multiplier bootstrap to approximate the distribution of the maximum coordinate of the linear part in the Bahadur representation. Appropriate quantile obtained from the bootstrap distribution is used to approximate the width of the simultaneous confidence intervals and the cutoff values for the tests for the parameters of the differential network.

1.1 Main contribution

This paper makes two major contributions to the literature on statistical inference for graphical models. First, compared to previous work on high-dimensional inference in graphical models (Ren et al., 2015; Barber and Kolar, 2018; Wang and Kolar, 2016; Janková and van de Geer, 2015), this is the first work on statistical inference in models where computing the log-partition function is intractable. Existing works mostly focus on Gaussian graphical models with a tractable normalizing constant, whereas our method can be applied to more general models, as we discuss in Section 2.1. Second, we apply our proposed method to simultaneous inference on all edges connected to a specific node. Our simultaneous inference procedure can be used to

  1. test whether a node is isolated in a graph; that is, whether it is conditionally independent with all the other nodes;

  2. estimate the support of the graph by setting an appropriate threshold on the proposed estimators; and

  3. test for the difference between graphical models where we have observations of two graphical models with the same nodes and we would like to test whether the local connectivity pattern for a specific node is the same in the two graphs.

1.2 Related work

Our work straddles two areas of statistical learning which have attracted significant research of late: model selection and estimation in high-dimensional graphical models, and high-dimensional inference. We briefly review the literature most relevant to our work, and refer the reader to two recent review articles for a comprehensive overview (Drton and Maathuis, 2017; Janková and van de Geer, 2019). Drton and Maathuis (2017) focuses on structure learning in graphical models, while Janková and van de Geer (2019) reviews inference in Gaussian graphical models.

We start by reviewing the literature on learning structure of probabilistic graphical models. Much of the research effort has focused on learning structure of Gaussian graphical models where the edge set of the graph is encoded by the non-zero elements of the precision matrix . The literature here roughly splits into two categories: global and local methods. Global methods typically estimate the precision matrix by maximizing regularized Gaussian log-likelihood (Yuan and Lin, 2007; Rothman et al., 2008; Friedman et al., 2008; d’Aspremont et al., 2008; Ravikumar et al., 2011; Fan et al., 2009; Lam and Fan, 2009), while local methods estimate the graph structure by learning the neighborhood or Markov blanket of each node separately (Meinshausen and Bühlmann, 2006; Yuan, 2010; Cai et al., 2011; Liu and Wang, 2017; Zhao and Liu, 2014). Extensions to more general distributions in Gaussian and elliptical families are possible using copulas, as the graph structure within these families is again determined by the inverse of the latent correlation matrix (Liu et al., 2009, 2012a; Xue and Zou, 2012; Liu et al., 2012b; Fan et al., 2017).

Once we depart from the Gaussian distribution and related families, learning the conditional independence structure becomes more difficult, primarily owing to computational intractability of evaluating the log-partition function. A computationally tractable alternative to regularized maximum likelihood estimation is regularized pseudo-likelihood which was studied in the context of learning structure of Ising models in

Höfling and Tibshirani (2009), Ravikumar et al. (2010), and Xue et al. (2012). Similar methods were developed in the study of mixed exponential family graphical models, where a node’s conditional distribution is a member of an exponential family distribution, such as Bernoulli, Gaussian, Poisson or exponential. See Guo et al. (2011a), Guo et al. (2011b), Lee and Hastie (2015), Cheng et al. (2013), Yang et al. (2012), and Yang et al. (2014) for more details.

More recently, score matching estimators have been investigated for learning the structure of graphical models in high-dimensions when the normalizing constant is not available in a closed form (Lin et al., 2016; Yu et al., 2018b). Score matching was first proposed in Hyvärinen (2005) and subsequently extended for binary models and models with non-negative data in Hyvärinen (2007). It offers a computational advantage when the normalization constant is not available in a closed-form, making likelihood based approaches intractable, and is particularly appealing for estimation in exponential families as the objective function is quadratic in the parameters of interest. Sun et al. (2015) develop a method based on score matching for learning conditional independence graphs underlying structured infinite-dimensional exponential families. Forbes and Lauritzen (2015) investigated the use of score matching for inference of Gaussian linear models in low-dimensional setting. However, despite its power, there have not been results on inference in high-dimensional models using score matching. As one of our contributions in this paper, we build on the prior work on estimation using generalized score matching and develop an approach to statistical inference for high-dimensional graphical models. In particular, we construct a novel -consistent estimator of parameters in (1). This is the first procedure that can obtain a parametric rate of convergence for an edge parameter in a graphical model where computing the normalizing constant is intractable.

Next, we review literature on high-dimensional inference, focusing on work related to high-dimensional undirected graphical models. Liu (2013) developed a procedure that estimates conditional independence graph from Gaussian observations and controls false discovery rates asymptotically. Wasserman et al. (2014) develop confidence guarantees for undirected graphs under minimal assumptions by developing Berry-Esseen bounds on the accuracy of Normal approximation. Ren et al. (2015), Janková and van de Geer (2015), and Janková and van de Geer (2017) develop methods for constructing confidence intervals for edge parameters in Gaussian graphical models, based on the idea of debiasing the regularized estimator developed in (Zhang and Zhang, 2013; van de Geer et al., 2014; Javanmard and Montanari, 2014). A related approach was developed for edge parameters in mixed graphical models whose node conditional distributions belong to an exponential family in Wang and Kolar (2016). Wang and Kolar (2014) develop methodology for performing statistical inference in time-varying and conditional Gaussian graphical models, while Barber and Kolar (2018) and Lu et al. (2018) develop methods for semi-parametric copula models. We contribute to the literature on high dimensional inference by demonstrating how to construct regular estimators for probabilistic Graphical models whose normalizing constant is intractable. Our estimators are robust to model selection mistakes and allows us to perform valid statistical inference for edge parameters in a large family of data generating distributions.

Finally, we contribute to the literature on simultaneous inference in high-dimensional models. Zhang and Cheng (2017) and Dezeure et al. (2017)

develop methods for performing simultaneous inference on all the coefficients in a high-dimensional linear regression. These procedures allow for the dimensionality of the vector to be exponential in the sample size and rely on bootstrap to approximate the quantile of the test statistic. We extend these ideas to the high dimensional graphical model setting and show how we can build simultaneous hypothesis tests on the neighbors of a specific node.

A conference version of this paper was presented in the Annual Conference on Neural Information Processing Systems 2016 (Yu et al., 2016). Compared to the conference version, in this paper we extend the results in the following ways. First, we extend the results to include the generalized score matching method (Yu et al., 2018b, a) in place of the original score matching method. This generalized form of the score matching method allows us to improve the estimation accuracy and obtain better inference results for non-negative data. Moreover, instead of focusing on a single edge as in the conference version, in this work we propose a procedure for simultaneous inference for all edges connected to a specific node. This allows us to build hypothesis tests for a broad class of applications, including testing of isolated nodes, support recovery, and testing the difference between two graphical models. Lastly, we run additional experiments to demonstrate the effectiveness of our proposed method.

1.3 Notation

We use to denote the set . For a vector , we let be the support set (with an analogous definition for matrices ), , , the -norm defined as with the usual extensions for , that is, and . For a vector , is a sub-vector of with components corresponding to the set , and is the sub-vector with component corresponding to edge omitted. We define as the empirical mean of samples: . For two sequences of numbers and , we use , or to denote that for some finite positive constant , and for all large enough. The notation is used to denote that . We denote as convergence in distribution to a fixed distribution and

as convergence in probability to a constant

. We denote for . For any function , we use to denote the gradient, and to denote the Laplacian operator on . Note that both the gradient and the Laplacian are with respect to .

1.4 Organization of the paper

The remainder of this paper is structured as follows. We begin in Section 2 with background on exponential family pairwise graphical model, score matching method, and a brief review of statistical inference in high dimensional models. In Section 3 we describe the construction of our novel estimator for a single edge parameter based on a three step procedure. Section 4 provides theoretical results and Section 5 extends the procedure to simultaneous inference for all edges connected to some specific node. We provide experimental results for synthetic datasets and a real dataset in Sections 6 and 7 respectively. Section 8 provides conclusion and discussion.

2 Background

We begin with reviewing exponential family pairwise graphical models in Section 2.1, and then introduce the score matching and generalized score matching methods in Section 2.2. Finally we provide a brief overview of statistical inference for high dimensional models in Section 2.3.

2.1 Exponential Family Pairwise Graphical Models

Throughout the paper we focus on the case where

is an exponential family with log-densities given in (1), which frequently appear in graphical modeling. There are sets of sufficient statistics for each that depend on the individual nodes and sets of sufficient statistics for each that allow for pairwise interactions of different types. Conditional independence graph underlying a distribution has no edge between vertices and if and only if . A special case of the model given in (1) are pairwise interaction models with log-densities

(2)

where are sufficient statistics that depend only on and . In what follows, we will consider models that either have the form given in (2) or the more general form given in (1).

A number of well studied distributions have the above discussed form. We provide some examples below, including examples where the normalizing constant cannot be obtained in a closed form.

Gaussian Graphical Models.

The most studied example of a probabilistic graphical model is the case of Gaussian graphical model. Suppose that the random variable follows the centered multivariate Gaussian distribution with covariance and precision matrix . The log-density is given as

(3)

the support of the density is and the sufficient statistics take the form .

Non-negative Gaussian.

Our second example of a distribution with the log-density of the form in (2

) is that of a non-negative Gaussian random vector. The probability density function of a non-negative Gaussian random vector

is proportional to that of the corresponding Gaussian vector given in (3), but restricted to the non-negative orthant. Here the support of the density is . The conditional independence graph is determined the same way as in the Gaussian graphical model case through the non-zero pattern of the elements in the precision matrix . The normalizing constant in this family has no closed form and hence maximum likelihood estimation of is intractable.

Normal Conditionals.

Our third example is taken from Lin et al. (2016). See also Gelman and Meng (1991) and Arnold et al. (1999). Consider the family of distributions with densities of the form

where the matrices are symmetric interaction matrices with a zero diagonal. Members of this family have Normal conditionals, but the densities themselves need not be unimodal. The conditional independence graph does not contain an edge between vertices and if and only if both and

are equal to zero. In contrast to the Gaussian graphical models, the conditional dependence may also express itself in the variances.

Conditionally specified mixed graphical models.

In general, specifying multivariate distributions is difficult, since in a given problem it might not be clear what class of graphical models to use. On the other hand, specifying univariate distributions is an easier task. Chen et al. (2015) and Yang et al. (2015)

explored ways of specifying multivariate joint distributions via univariate exponential families. Consider a conditional density of the form

(4)

where and are known functions for each . Suppose that for a random vector , each coordinate follows the conditional density of the form in (4) with for all . Then Chen et al. (2015) and Yang et al. (2015) showed that there exists a joint distribution of compatible with the conditional densities and that it is of the form

In particular, the joint density above is of the form given in (1), with pairwise interaction sufficient statistics given as . When the support of the distribution is or , the parameters of the distribution can be efficiently estimated using score matching. In the case of unknown function , Suggala et al. (2017) explored nonparametric estimation via basis expansion and fitted parameters using pseudo-likelihood. Developing an valid statistical inference procedure for this nonparametric setting is beyond the scope of the current work.

As an example of conditionally specified model, that we will return to later in the paper, consider exponential graphical models where the node-conditional distributions follows an exponential distribution. For a random vector

described by an exponential graphical model, the density function is given by

Note that the variable takes only non-negative values. To ensure that the distribution is valid and normalizable, the natural parameter space consists of matrices whose elements are positive. Therefore, one can only model negative dependencies via the exponential graphical model.

Exponential square-root graphical model.

As our last example, consider the exponential square-root graphical model (Inouye et al., 2016) with density function given by

This square-root graphical model is a multivariate generalizations of univariate exponential family distributions that can capture the positive dependency among nodes. Specifically, it assumes only a mild condition on the parameter matrix, but allows for almost arbitrary negative and positive dependencies. We refer to Inouye et al. (2016) for details on parameter estimation with nodewise regressions and likelihood approximation methods.

2.2 Score Matching

In this section we briefly review the score matching method proposed in Hyvärinen (2005, 2007) and the generalized score matching for non-negative data proposed in Yu et al. (2018b).

2.2.1 Score Matching

A scoring rule is a real valued function that quantifies accuracy of being the distribution from which an observed realization may have been sampled. There are a large number of scoring rules that correspond to different decision problems Parry et al. (2012). Given independent realizations of , , one finds optimal score estimator that minimizes the empirical score

(5)

When and consists of twice differentiable densities with respect to Lebesgue measure, the Hyvärinen scoring rule (Hyvärinen, 2005) is given as

(6)

where is the density of with respect to Lebesgue measure on . We would like to emphasize that this gradient and Laplacian are with respect to . In this way we get rid of the normalizing constant which does not depend on . This scoring rule is convenient for learning models that are specified in an unnormalized fashion or whose normalizing constant is difficult to compute. The score matching rule is proper (Dawid, 2007), that is, is minimized over at . Suppose the density of is twice continuously differentiable and satisfies

and

Then the Fisher divergence between ,

where is the density of , is induced by the score matching rule (Hyvärinen, 2005). The gradients in the equation above can be thought of as gradients with respect to a hypothetical location parameter, evaluated at the origin (Hyvärinen, 2005).

For a parametric exponential family with densities given in (1), minimizing (5) with the scoring rule in (6) can be done in a closed form (Hyvärinen, 2005; Forbes and Lauritzen, 2015). An estimator obtained in this way can be shown to be asymptotically consistent (Hyvärinen, 2005), however, in general it will not be efficient (Forbes and Lauritzen, 2015).

2.2.2 Generalized Score Matching for Non-Negative Data

The score matching method in Section 2.2.1 does not work for non-negative data, since the assumption that and tend to 0 at the boundary breaks down. To solve this problem, Hyvärinen (2007) proposed a generalization of the score matching approach to the case of non-negative data.

When the non-negative score matching loss (analogous to the Fisher divergence ) is defined as

The scoring rule for non-negative data that induces is given as

(7)

For exponential families, the non-negative score matching loss again can be obtained in a closed form and the estimator is consistent and asymptotically normal under suitable conditions (Hyvärinen, 2007).

Yu et al. (2018b) proposed the generalized score matching for non-negative data to improve the estimation efficiency of the procedure based on the scoring rule in (7). Let be positive and differentiable functions and set

The generalized -score matching loss is defined as

where . Suppose the following regularity conditions are satisfied

(8)

Under the condition (8), the scoring rule corresponding to the generalized -score matching loss is given as

The regularity condition (8) is required for applying integration by parts and Fubini-Tonelli theorem in order to show consistency of the score-matching estimator.

Note that by choosing , for all , one recovers the original score matching formulas for non-negative data in (7). The advantage of this generalized score matching rule is that by choosing an increasing, but slowly growing (for example,

), one dose not need to estimate high moments of the underlying distribution, which leads to better practical performance and improved theoretical guarantees. See

Yu et al. (2018b) for details.

2.2.3 Score matching for probabilistic graphical models

Score matching has been successfully applied in the context of probabilistic graphical models. Forbes and Lauritzen (2015) studied score matching to learn Gaussian graphical models with symmetry constraints. Lin et al. (2016) proposed a regularized score matching procedure to learn conditional independence graph in a high-dimensional setting by minimizing

where the loss function

is either defined in (6) or defined in (7). For Gaussian models, -norm regularized score matching is a simple, yet efficient method, which coincides with the method in Liu and Luo (2015). Yu et al. (2018b) improved on the approach of Lin et al. (2016) and studied regularized generalized -score matching of the form

Applied to data generated from a multivariate truncated normal distribution, the conditional independence graph can be recovered with the same number of samples that are needed for recovery of the structure of a Gaussian graphical model.

Sun et al. (2015) develop a score matching estimator for learning the structure of nonparametric probabilistic graphical models, extending the work on estimation of infinite-dimensional exponential families (Sriperumbudur et al., 2017). In Section 3, we present a new estimator for components of in (1) that is consistent and asymptotically normal, building on Lin et al. (2016) and Yu et al. (2018b).

2.3 Statistical Inference

We briefly review how to perform statistical inference for low dimensional parameters in a high-dimensional model. In many statistical problems, the unknown parameter can be partitioned as , where is a scalar of interest and is a dimensional nuisance parameter. Let denote the true unknown parameter. In a high-dimensional setting, where the sample size is much smaller than the dimensionality of the parameter , it is common to impose structural assumptions on . For example in several applications, it is common to assume that the true parameter is sparse. Indeed, we will work under this assumption as well.

Let us denote the empirical negative log-likelihood by

where is the negative log-likelihood for the observation. Let denote the information matrix and denote the partition of corresponding to as

(9)

The partial information matrix of is denoted as .

Consider for the moment a low-dimensional setting. In order to perform statistical inference about , one can use the profile partial score function defined as

where is the maximum partial likelihood estimator for with a fixed parameter

. Under the null hypothesis that

, we have that (van der Vaart, 1998)

Therefore, one can reject the null hypothesis for large values of . However, in a high-dimensional setting, the estimator is no longer -consistent and we have to modify the approach above. In particular, we will show how to modify the profile partial score function to allow for valid inference in a high-dimensional setting based on a sparse estimator of .

Without loss of generality, assume that . For any estimator , Taylor’s expansion theorem gives

(10)

where rem is the remainder term. The first term in (10

) converges to a normal distribution under suitable assumptions using the central limit theorem (CLT). The distribution of the second term, however, is in general intractable to obtain. This is due to the fact that the distribution of

depends on the selected model. Unless we are willing to assume stringent and untestable conditions under which it is possible to show that the true model can be selected, the limiting distribution of cannot be estimated even asymptotically (Leeb and Pötscher, 2007). To overcome this issue, one needs to modify the profile partial score function, so that its limiting distribution does not depend on the way the nuisance parameter is estimated.

Ning and Liu (2017) introduced the following decorrelated score function

where . The decorrelated score function is uncorrelated with the nuisance score functions and, therefore, its limiting distribution will not depend on the model selection mistakes incurred while estimating . In particular, is indeed asymptotically normally distributed under the null hypothesis, as long as is a good enough estimator of , but not necessarily -consistent estimator. Based on the asymptotic normality of the decorrelated score function, we can then build confidence intervals for and perform hypothesis testing.

In practice, the vector is unknown and needs to be estimated. A number of methods have been proposed for its estimation in the literature. For example, Ning and Liu (2017) use a Dantzig selector-like method, Belloni et al. (2013) proposed the double selection method, while Zhang and Zhang (2013), van de Geer et al. (2014), and Javanmard and Montanari (2014) use a lasso based estimator. In this paper, we adapt the double selection procedure of Belloni et al. (2013). Details will be given in Section 3.

3 Methodology

In this section, we present a new procedure that constructs a -consistent estimator of an element of . Our procedure involves three steps that we detail below. We start by introducing some additional notation and then describe the procedure for the case where . Extension to non-negative data is given at the end of the section.

For fixed indices , let

be the conditional density of given . In particular,

where , with , is the part of the vector corresponding to , and is the corresponding vector of sufficient statistics. Here is the log-partition function of the conditional distribution and . Let and be the gradient and Laplacian operators, respectively, with respect to and defined as:

With this notation, we introduce the following scoring rule

(11)

where

with , , , and .

This scoring rule is related to the one in (6), however, rather than using the density in evaluating the parameter vector, we only consider the conditional density . We will use this conditional scoring rule to create an asymptotically normal estimator of an element . Our motivation for using this estimator comes from the fact that the parameter can be identified from the conditional distribution of where

is the Markov blanket of . Furthermore, the optimization problems arising in steps 1-3 below can be solved much more efficiently, as the scoring rule in (11) involves fewer parameters.

We are now ready to describe our procedure for estimating , which proceeds in three steps.

Step 1:

We find a pilot estimator of by solving the following program

(12)

where is a tuning parameter. Let .

Since we are after an asymptotically normal estimator of , one may think that it is sufficient to find and appeal to results of Portnoy (1988), who has established asymptotic normality for -estimators with increasing number of parameters. Unfortunately, this is not the case. Since is obtained via a model selection procedure, it is irregular and its asymptotic distribution cannot be estimated (Leeb and Pötscher, 2007; Pötscher, 2009). Therefore, we proceed to create a regular estimator of in steps 2 and 3. The idea is to create an estimator that is insensitive to first order perturbations of other components of , which we consider as nuisance components. The idea of creating an estimator that is robust to perturbations of nuisance have been recently used in Belloni et al. (2013), however, the approach goes back to the work of Neyman (1959).

Step 2:

Let be a minimizer of

(13)

where is a tuning parameter. Let . The intuition here is that the vector approximately computes a row of the inverse of the Hessian in (12).

Step 3:

Let . We obtain our estimator as a solution to the following program

(14)

Our estimator of is the coordinate of — which we denote as . Motivation for this procedure will be clear from the proof of Theorem 4 given in the next section.

Extension to non-negative data.

For non-negative data, the procedure is similar. In place of the score rule in (11), we will use a conditional score rule based on the generalized -score rule. We define the following scoring rule

(15)

with

and

Here , and . Now we can define

(16)

Then , which is of the same form as (11) with and replacing and , respectively. Thus our three step procedure for non-negative data can be written as follows. For notation consistency, we omit the subscript on the estimator and support .

Step 1:

We find a pilot estimator of by solving

(17)

where is a tuning parameter and is defined in (15). Let .

Step 2:

Let be a minimizer of

(18)

where is a tuning parameter and are defined in (16). Let .

Step 3:

Let . We obtain our estimator as a solution to the following program

(19)

Our estimator of is the coordinate of — which we denote as .

4 Asymptotic Normality of the Estimator

In this section, we outline main theoretical properties of our estimator. We start by providing high-level conditions that allow us to establish properties of each step in the procedure.

Assumption M.

We are given i.i.d. samples from of the form in (1). The parameter vector is sparse, with . Let

(20)

and

The vector is sparse with . Let .

The assumption M supposes that the parameter to be estimated is sparse, which makes estimation in high-dimensional setting feasible. An extension to approximately sparse parameter is possible but technically cumbersome, and does not provide additional insights into the problem. One of the benefits of using the conditional score to learn parameters of the model is that the sample size will only depend on the size of and not on the sparsity of the whole vector as in Lin et al. (2016). The second part of the assumption states that the inverse of population Hessian is approximately sparse, which is a reasonable assumption since the Markov blanket of is small under the sparsity assumption on .

Our next condition assumes that the Hessian in (12) and (13) is well conditioned. Let

and