Score matching was first developed in Hyvärinen (2005) for continuous distributions supported on all of . Consider such a distribution , with density and support equal to . Let be a family of distributions with twice continuously differentiable densities. The score matching estimator of using as a model is the minimizer of the expected squared distance between the gradients of and a log-density from . So we minimize the loss with respect to densities from . The loss depends on , but integration by parts can be used to rewrite it in a form that can be approximated by averaging over the sample without knowing . A key feature of score matching is that normalizing constants cancel in gradients of log-densities, allowing for simple treatment of models with intractable normalizing constants. For exponential families, the loss is quadratic in the canonical parameter, making optimization straightforward.
If the considered distributions are supported on a proper subset of , then the integration by parts arguments underlying the score matching estimator may fail due to discontinuities at the boundary of the support. For data supported on the non-negative orthant , Hyvärinen (2007) addresses this problem by modifying the loss to , where denotes entrywise multiplication. In this loss, boundary effects are dampened by multiplying gradients elementwise with the identity functions .
In this paper, we propose generalized score matching methods that are based on elementwise multiplication with functions other than . As we show, this can lead to drastically improved estimation accuracy, both theoretically and empirically. To demonstrate these advantages, we consider a family of graphical models on , which does not have tractable normalizing constants and hence serves as a practical example.
specify conditional independence relations for a random vectorindexed by the nodes of a graph (Lauritzen, 1996). For undirected graphs, variables and are required to be conditionally independent given if there is no edge between and . The smallest undirected graph with this property is the conditional independence graph of . Estimation of this graph and associated interaction parameters has been a topic of continued research as reviewed by Drton and Maathuis (2017).
Largely due to their tractability, Gaussian graphical models (GGMs) have gained great popularity. The conditional independence graph of a multivariate normal vector is determined by the inverse covariance matrix , also termed concentration or precision matrix. Specifically, and are conditionally independent given all other variables if and only if the -th and the -th entries of are both zero. This simple relation underlies a rich literature including Drton and Perlman (2004), Meinshausen and Bühlmann (2006), Yuan and Lin (2007) and Friedman et al. (2008), among others.
More recent work has provided tractable procedures also for non-Gaussian graphical models. This includes Gaussian copula models (Liu et al., 2009; Dobra and Lenkoski, 2011; Liu et al., 2012), Ising models (Ravikumar et al., 2010), other exponential family models (Chen et al., 2015; Yang et al., 2015), as well as semi- or non-parametric estimation techniques (Fellinghauer et al., 2013; Voorman et al., 2014)
. In this paper, we apply our method to a class of pairwise interaction models that generalizes non-negative Gaussian random variables, as recently considered byLin et al. (2016) and Yu et al. (2016), as well as square root graphical models proposed by Inouye et al. (2016) when the sufficient statistic function is a pure power. However, our main ideas can also be applied for other classes of exponential families whose support is restricted to a rectangular set.
Our focus will be on pairwise interaction power models
with probability distributions having (Lebesgue) densities proportional to
on . Here and are known constants, and and are unknown parameters of interest. When we define and
. This class of models is motivated by the form of important univariate distributions for non-negative data, including gamma and truncated normal distributions. It provides a framework for pairwise interaction that is concrete yet rich enough to capture key differences in how densities may behave at the boundary of the non-negative orthant,. Moreover, the conditional independence graph of a random vector with distribution as in (1) is determined just as in the Gaussian case: and are conditionally independent given all other variables if and only if in the interaction matrix . Section 5.1 gives further details on these models. We will develop estimators of in (1) and the associated conditional independence graph using the proposed generalized score matching.
A special case of (1) are truncated Gaussian graphical models, with . Let , and let be a positive definite matrix. Then a non-negative random vector follows a truncated normal distribution for mean parameter and inverse covariance parameter , in symbols , if it has density proportional to
on . We refer to as the covariance parameter of the distribution, and note that the parameter in (1) is . Another special case of (1) is the exponential square root graphical models in Inouye et al. (2016), where .
Lin et al. (2016) estimate truncated GGMs based on Hyvärinen’s modification, with an penalty on the entries of added to the loss. However, the paper overlooks the fact that the loss can be unbounded from below in the high-dimensional setting even with an penalty, such that no minimizer may exist. Since the unpenalized loss is quadratic in the parameter to be estimated, we propose modifying it by adding small positive values to the diagonals of the positive semi-definite matrix that defines the quadratic part, in order to ensure that the loss is bounded and strongly convex and admits a unique minimizer. We apply this to the estimator for GGMs considered in Lin et al. (2016), which uses score-matching on , and to the generalized score matching estimator for pairwise interaction power models on proposed in this paper. In these cases, we show, both empirically and theoretically, that the consistency results still hold (or even improve) if the positive values added are smaller than a threshold that is readily computable.
The rest of the paper is organized as follows. Section 2 introduces score matching and our proposed generalized score matching. In Section 3, we apply generalized score matching to exponential families, with univariate truncated normal distributions as an example. Regularized generalized score matching for graphical models is formulated in Section 4. The estimators applied to the pairwise interaction power models are shown in Section 5, while theoretical consistency results are presented in Section 6, where we treat the probabilistically most tractable case of truncated GGMs. Simulation results are given in Section 7. Proofs for theorems in Sections 2–6 are presented in Appendices A and B. Additional experimental results are presented in Appendix C.
Constant scalars, vectors, and functions are written in lower-case (e.g., , ), random scalars and vectors in upper-case (e.g., , ). Regular font is used for scalars (e.g. , ), and boldface for vectors (e.g. , ). Matrices are in upright bold, with constant matrices in upper-case (, ) and random matrices holding observations in lower-case (, ). Subscripts refer to entries in vectors and columns in matrices. Superscripts refer to rows in matrices. So is the -th component of a random vector . For a data matrix , each row comprising one observation of variables/features, is the -th feature for the -th observation. Stacking the columns of a matrix gives its vectorization . For a matrix , denotes its diagonal, and for a vector , is the diagonal matrix with diagonals .
For , the -norm of a vector is denoted
with . A matrix has Frobenius norm
and max norm Its - operator norm is
with shorthand notation ; for instance, .
For a function , we define as the partial derivative with respect to , and . For , , we let be the vector of derivatives. Likewise is used for second derivatives. The symbol denotes the indicator function of the set , while is the vector of all ’s. For , , . A density of a distribution is always a probability density function with respect to Lebesgue measure. When it is clear from the context, denotes the expectation under a true distribution .
2 Score Matching
2.1 Original Score Matching
Let be a random vector taking values in with distribution and density . Let be a family of distributions of interest with twice continuously differentiable densities supported on . Suppose . The score matching loss for , with density , is given by
The gradients in (3) can be thought of as gradients with respect to a hypothetical location parameter, evaluated at the origin (Hyvärinen, 2005). The loss is minimized if and only if , which forms the basis for estimation of . Importantly, since the loss depends on only through its log-gradient, it suffices to know up to a normalizing constant. Under mild conditions, (3) can be rewritten as
plus a constant independent of . The integral in (4) can be approximated by a sample average; this alleviates the need for knowing the true density , and provides a way to estimate .
2.2 Generalized Score Matching for Non-Negative Data
When the true density is supported on a proper subset of , the integration by parts underlying the equivalence of (3) and (4) may fail due to discontinuity at the boundary. For distributions supported on the non-negative orthant, , Hyvärinen (2007) addressed this issue by instead minimizing the non-negative score matching loss
This loss can be motivated via gradients with respect to a hypothetical scale parameter (Hyvärinen, 2007). Under mild conditions, can again be rewritten in terms of an expectation of a function independent of , thus allowing one to form a sample loss.
In this work, we consider generalizing the non-negative score matching loss as follows.
Let be the family of distributions of interest, and assume every has a twice continuously differentiable density supported on . Suppose the -variate random vector has true distribution , and let be its twice continuously differentiable density. Let be a.s. positive functions that are absolutely continuous in every bounded sub-interval of (differentiable a.s. in ), and set . For with density , the generalized -score matching loss is
The distribution is the unique minimizer of for .
First, observe that and . For uniqueness, suppose for some . Let and be the respective densities. By assumption a.s. and a.s. for all . Therefore, we must have a.s., or equivalently, almost surely in . Since and are continuous densities supported on , it follows that for all . ∎
Choosing all recovers the loss from (5). In our generalization, we will focus on using functions
that are increasing but are bounded or grow rather slowly. This will alleviate the need to estimate higher moments, leading to better practical performance and improved theoretical guarantees. We note that our approach could also be presented in terms of transformations of data; compare to Section 11 inParry et al. (2012). In particular, log-transforming positive data into all of and then applying (3) is equivalent to (5).
We will consider the following assumptions:
where “” is a shorthand for “for all being the density of some ”, and the prime symbol denotes component-wise differentiation. While the second half of (A2) was not made explicit in Hyvärinen (2005, 2007), (A1)-(A2) were both required for integration by parts and Fubini-Tonelli to apply.
Once the forms of and are given, sufficient conditions for for Assumptions (A1)-(A2) to hold are easy to find. In particular, (A1) and (A2) are easily satisfied and verified for exponential families.
Under (A1) and (A2), the loss from (6) equals
plus a constant independent of .
Given a data matrix with rows , we define the sample version of (7) as
Subsequently, for a distribution with density , we let . Similarly, when a distribution with density is associated to a parameter vector , we write . We apply similar conventions to the sample version . We note that this type of loss is also treated in slightly different settings in Parry (2016) and Almeida and Gidas (1993).
3 Exponential Families
In this section, we study the case where is an exponential family comprising continuous distributions with support . More specifically, we consider densities that are indexed by the canonical parameter and have the form
where comprises the sufficient statistics, is a normalizing constant depending on only, and is the base measure, with and a.s. differentiable with respect to each component. Define and .
Define , , and .
(C1) is a.s. invertible, and
(C2) , , and exist and are entry-wise finite.
Then the minimizer of (10) is a.s. unique with closed-form solution . Moreover,
provides a basis for asymptotically valid tests and confidence intervals for the parameter. Note that Condition (C1) holds if and only if a.s. and has rank a.s. for some .
Below we illustrate the estimator in the case of univariate truncated normal distributions. We assume (A1)-(A2) and (C1)-(C2) throughout.
Univariate () truncated normal distributions for
mean parameter and variance parameter
and variance parameterhave density
If is known but unknown, then writing the density in canonical form as in (9) yields
Given an i.i.d. sample , the generalized -score matching estimator of is
If , and the expectations are finite (for example, when for ), then
We recall that the Cramér-Rao lower bound (i.e. the lower bound on the variance of any unbiased estimator) for estimating
(i.e. the lower bound on the variance of any unbiased estimator) for estimatingis
Given an i.i.d. sample , the generalized -score matching estimator of is
If, in addition to the assumptions in Example 1, , then
Moreover, the Cramér-Rao lower bound for estimating is
In Example 2, if , then also satisfies (A1)-(A2) and (C1)-(C2) and one recovers the sample variance , which obtains the Cramér-Rao lower bound.
In these examples, there is a benefit in using a bounded function , which can be explained as follows. When
, there is effectively no truncation to the Gaussian distribution, and our method adapts to using low moments in (6), since a bounded and increasing becomes almost constant as it reaches its asymptote for large. Hence, we effectively revert to the original score matching (recall Section 2.1). In the other cases, the truncation effect is significant and our estimator uses higher moments accordingly.
Figure 2 plots the asymptotic variance of from Example 1, with known. Efficiency as measured by the Cramér-Rao lower bound divided by the asymptotic variance is also shown. We see that two truncated versions of have asymptotic variance close to the Cramér-Rao bound. This asymptotic variance is also reflective of the variance for smaller finite samples.
Figure 2 is the analog of Figure 2 for from Example 2 with known. While the specifics are a bit different the benefits of using bounded or slowly growing are again clear. We note that when is small, the effect of truncation to the positive part of the real line is small.
In both plots we order/color the curves based on their overall efficiency, so they have different colors in one from the other, although the same functions are presented. For all functions presented here (A1)–(A2) and (C1)–(C2) are satisfied.
4 Regularized Generalized Score Matching
In high-dimensional settings, when the number of parameters to estimate may be larger than the sample size , it is hard, if not impossible, to estimate the parameters consistently without turning to some form of regularization. More specifically, for exponential families, condition (C1) in Section 3 fails when . A popular approach is then the use of regularization to exploit possible sparsity.
4.1 Ensuring Bounded -Regularized Loss
Let the data matrix comprise i.i.d. samples from distribution . Assume has density belonging to an exponential family , where . Adding an penalty to (10), we obtain the regularized generalized score matching loss
) involves a quadratic smooth part as in the familiar lasso loss for linear regression. However, although the matrixis positive semidefinite, the regularized loss in (14) is not guaranteed to be bounded unless the tuning parameter is sufficiently large—a problem that does not occur in lasso. We note that here, and throughout, we suppress the dependence on the data for , and derived quantities.
For a more detailed explanation, note that that by (11), for some . In the high-dimensional case, the rank of , or equivalently , is at most . Hence, is not invertible and does not necessarily lie in the column span of . Let be the kernel of . Then there may exist with . In this case, if
there exists with and . Evaluating at for scalar , the loss becomes , which is negative and linear in , and thus unbounded below. In this case no minimizer of (14) exists for small values of . This issue also exists for the estimators from Zhang and Zou (2014) and Liu and Luo (2015), which correspond to score matching for GGMs. We note that in the context of estimating the interaction matrix in pairwise models, ; thus, the condition reduces to , or when both and are estimated.
To circumvent the unboundedness problem, we add small values to the diagonal entries of , which become , . This is in the spirit of work such as Ledoit and Wolf (2004) and corresponds to an elastic net-type penalty (Zou and Hastie, 2005) with weighted penalty . After this modification, is positive definite, our regularized loss is strongly convex in , and a unique minimizer exists for all . For the special case of truncated GGMs, we will show that a result on consistent estimation holds if we choose for a suitably small constant , for which we propose a particular choice to avoid tuning. This choice of depends on the data through .
For , let . The regularized generalized -score matching estimator with tuning parameter and amplifier is the estimator
4.2 Computational Details
We optimize our loss functions with respect to a symmetric matrixand in the non-centered case also the vector . We use a coordinate-descent method analogous to Algorithm 2 in Lin et al. (2016), where in each step we update each element of and based on the other entries from the previous steps, while maintaining symmetry. To speed up estimation, we use warm starts using the solution from the previous , as well as lasso-type strong screening rules (Tibshirani et al., 2012) to eliminate variables that are known a priori to have zero estimates. In our simulations in Section 7 we always scale the data matrix by column norms before proceeding to estimation.
5 Score Matching for Graphical Models for Non-negative Data
5.1 A General Framework of Pairwise Interaction Models
We consider the class of pairwise interaction power models with density introduced in (1). We recall the form of the density:
where and are known constants, and the interaction matrix and the vector are parameters. When , we use the convention that and apply the logarithm element-wise. Our focus will be on the interaction matrix that determines the conditional independence graph through its support . However, unless is known or assumed to be zero, we also need to estimate as a nuisance parameter.
We first give a set of sufficient conditions for the density to be valid, i.e., the right-hand side of (16) to be integrable. The proof is given in the appendix.
is strictly co-positive, i.e., for all ;
, , and for ().
In the non-centered case, if (CC1) and one of (CC2) and (CC3) holds, then the function on the right-hand side of (16) is integrable over . In the centered case, (CC1) and are sufficient.
We emphasize that (CC1) is a weaker condition than positive definiteness. Criteria for strict co-positivity are discussed in Väliaho (1986).
5.2 Implementation for Different Models
In this section we give some implementation details for the regularized generalized -score matching estimator defined in (15) applied to the pairwise interaction models from (16). We again let . The unregularized loss is then
where the product between a vector and a matrix means an elementwise multiplication of the vector with each column of the matrix, and . Furthermore, , where and correspond to each entry of and , respectively. The -th column of is
where is the -vector with 1 at the -th position and 0 elsewhere, and the -th entry of is
In the centered case where we know , we only estimate , and is still block-diagonal, with the -th block being the submatrix in (17), while is just . Since only appears in the part of the density, the formulae only depend on in the centered case. These formulae also hold for since and only depend on the gradient of the log density, and also holds for .
We emphasize that it is indeed necessary to introduce amplifiers or a multiplier in addition to the penalty. It is clear from (18) that (or if centered). Thus, is non-invertible when (or if centered) and need not lie in its column span.
We claim that including amplifiers/multipliers for the submatrices only is sufficient for unique existence of a solution for all penalty parameters . To see this, consider any nonzero vector . Partition it as with . Let be our amplified version of the matrix from (21), so
As itself is positive semidefinite, we find that if one of the first entries of is nonzero then
If only the last entry of is nonzero then
almost surely; recall that . We conclude that (and thus the entire amplified ) is a.s. positive definite, which ensures unique existence of the loss minimizer.
Given the formulae for and , one adds the penalty on to get the regularized loss (24). Our methodology readily accommodates two different choices of the penalty parameter for and . This is also theoretically supported for truncated GGMs, since if the ratio of the respective values and is fixed, the proof of the theorems in Section 6 can be easily modified by replacing by . To avoid picking two tuning parameters, one may also choose to remove the penalty on altogether by profiling out and solve for , with the minimizer of the profiled loss