First order covariance inequalities via Stein's method

by   Marie Ernst, et al.
University of Liège

We propose probabilistic representations for inverse Stein operators (i.e. solutions to Stein equations) under general conditions; in particular we deduce new simple expressions for the Stein kernel. These representations allow to deduce uniform and non-uniform Stein factors (i.e. bounds on solutions to Stein equations) and lead to new covariance identities expressing the covariance between arbitrary functionals of an arbitrary univariate target in terms of a weighted covariance of the derivatives of the functionals. Our weights are explicit, easily computable in most cases, and expressed in terms of objects familiar within the context of Stein's method. Applications of the Cauchy-Schwarz inequality to these weighted covariance identities lead to sharp upper and lower covariance bounds and, in particular, weighted Poincaré inequalities. Many examples are given and, in particular, classical variance bounds due to Klaassen, Brascamp and Lieb or Otto and Menz are corollaries. Connections with more recent literature are also detailed.


On the Isoperimetric constant, covariance inequalities and L_p-Poincaré inequalities in dimension one

Firstly, we derive in dimension one a new covariance inequality of L_1-L...

Information inequalities for the estimation of principal components

We provide lower bounds for the estimation of the eigenspaces of a covar...

Iterated Jackknives and Two-Sided Variance Inequalities

We consider the variance of a function of n independent random variables...

Covariance Matrix Estimation with Non Uniform and Data Dependent Missing Observations

In this paper we study covariance estimation with missing data. We consi...

Multiclass Classification via Class-Weighted Nearest Neighbors

We study statistical properties of the k-nearest neighbors algorithm for...

On infinite covariance expansions

In this paper we provide a probabilistic representation of Lagrange's id...

Stein's method, smoothing and functional approximation

Stein's method for Gaussian process approximation can be used to bound t...

1 Introduction

Much attention has been given in the literature to the problem of providing sharp tractable estimates on the variance of functions of random variables. Such estimates are directly related to fundamental considerations of pure mathematics (e.g., isoperimetric, logarithmic Sobolev and Poincaré inequalities), as well as essential issues from statistics (e.g., Cramer-Rao bounds, efficiency and asymptotic relative efficiency computations, maximum correlation coefficients, and concentration inequalities).

One of the starting points of this line of research is Chernoff’s famous result from [27] which states that, if , then


for all sufficiently regular functions . Chernoff obtained the upper bound by exploiting orthogonality properties of the family of Hermite polynomials. The upper bound in (1.1) is, in fact, already available in [56] and is also a special case of the central inequality in [11], see below. Cacoullos [12] extends Chernoff’s bound to a wide class of univariate distributions (including discrete distributions) by proving that if has a density function with respect to the Lebesgue measure then


with . It is easy to see that, if is the standard normal density, then so that (1.2) contain (1.1). Cacoullos also obtains a similar bound as (1.2) for discrete distributions on the positive integers, where the derivative is replaced by the forward difference and the weight becomes

Variance inequalities such as (1.2) are closely related to the celebrated Brascamp-Lieb inequality from [11] which, in dimension 1, states that if and is strictly log-concave then


for all sufficiently regular functions . In fact, the upper bound from (1.1) is an immediate consequence of (1.3) because, if is the standard Gaussian density, then . The Brascamp-Lieb inequality is proved in [55] to be a consequence of Hoeffding’s classical covariance inequality from [41], which states that if

is a continuous bivariate random vector with cumulative distribution

and marginal cdfs then


under weak assumptions on (see e.g. [29]). The freedom of choice in the test functions in (1.4) is exploited by [55] to prove that, if has a strictly convex absolutely continuous density then the asymmetric Brascamp-Lieb inequality holds:


Identity (1.4) and inequalities (1.3) and (1.5) are extended to the multivariate setting in [19] which also gives connections with logarithmic Sobolev inequalities for spin systems and related inequalities for log-concave densities. This material is revisited and extended in [66, 64, 65], providing applications in the context of isoperimetric inequalities and weighted Poincaré inequalities. In [29] the identity (1.4) is proved in all generality and used to provide expansions for the covariance in terms of canonical correlations and variables.

Further generalizations of Chernoff’s bounds are provided in [24, 14, 15], and [44] (e.g., Karlin [44] deals with the entire class of log-concave distributions). See also [10, 16, 46, 58, 18] for the connection with probabilistic characterizations and other properties. Similar inequalities were obtained – often by exploiting properties of suitable families of orthogonal polynomials – for univariate functionals of some specific multivariate distributions e.g., in [17, 13, 20, 48, 3, 49]. A historical overview as well as a description of the connection between such bounds, the so-called Stein identities from Stein’s method (see below) and Sturm-Liouville theory (see Section 4) can be found in [30]. To the best of our knowledge, the most general version of (1.1) and (1.2) is due to [45], where the following result is proved

Theorem 1.1 (Klaassen bounds).

Let be some -finite measure. Let be a measurable function such that does not change sign for almost . Suppose that is a measurable function such that is well defined for some . Let be a real random variable with density with respect to .

  • (Klaassen upper variance bound) For all nonnegative measurable functions such that we have


    with supposed well-defined by .

  • (Cramér-Rao lower variance bound) For all measurable functions such that and we have


    where . Equality in (1.7) holds if and only if is linear in , -almost everywhere.

Klaassen’s proof of Theorem 1.1

relies on little more than the Cauchy-Schwarz inequality and Fubini’s theorem; it has a slightly magical aura as little or no heuristic or context is provided as to the best choices of test functions

and kernel or even to the nature of the weights appearing in (1.6) and (1.7). To the best of our knowledge, all available first order variance bounds from the literature can be obtained from either (1.6) or (1.7) by choosing the appropriate test functions or and the appropriate kernel . For instance, the weights appearing in the upper bound (1.6) generalize the Stein kernel from Cacoullos’ bound (1.2) – both in the discrete and the continuous case. Indeed taking when the distribution is continuous we see that then and the weight becomes which is none other than . A similar argument holds as well in the discrete case. In the same way, taking leads to in (1.7) and thus the lower bound in (1.2) follows as well. The freedom of choice in the function allows for much flexibility in the quality of the weights; this fact seems somewhat under exploited in the literature. This is perhaps due to the rather obscure nature of Klaassen’s weights, a topic which we shall be one of the central learnings of this paper. Indeed we shall provide a natural theoretical home for Klaassen’s result, in the framework of Stein’s method.

Several variations on Klaassen’s theorem have already been obtained via techniques related to Stein’s method. We defer a proper introduction of these techniques to Section 2. The gist of the approach can nevertheless be understood very simply in case the underlying distribution is standard normal. Stein’s classical identity states that if then


By the Cauchy-Schwarz inequality we immediately deduce that, for all appropriate ,


which gives the lower bound in (1.1). For the upper bound, still by the Cauchy-Schwarz inequality,


where the last identity is a direct consequence of Stein’s identity (1.8) applied to the function . This is the upper bound in (1.1). The idea behind this proof is due to Chen [23]. As is now well known (again, we refer the reader to Section 2 for references and details), Stein’s identity (1.8

) for the normal distribution can be extended to basically any univariate (and even multivariate) distribution via a family of objects called “Stein operators”. This leads to a wide variety of Stein-type integration by parts identities and it is natural to wonder whether Chen’s approach can be used to obtain generalizations of Klaassen’s theorem. First steps in this direction are detailed in

[51, 52]; in particular it is seen that general lower variance bounds are easy to obtain from generalized Stein identities in the same way as in (1.9). Nevertheless, the method of proof in (1.10) for the upper bound cannot be generalized to arbitrary targets and, even in cases where the method does apply, the assumptions under which the bounds hold are quite stringent. To the best of our knowledge, the first to obtain upper variance bounds via properties of Stein operators is due to Saumard [64], by combining generalized Stein identities – expressed in terms of the Stein kernel – with Hoeffding’s identity (1.4). The scope of Saumard’s weighted Poincaré inequalities is, nevertheless, limited and a general result such as Klaassen’s is, to this date, not available in the literature.

The main contributions of this paper can be categorized in two types:

  • Covariance identities and inequalities. The first main contribution of this paper is a generalization of Klaassen’s variance bounds from Theorem 1.1 to covariance inequalities of arbitrary functionals of arbitrary univariate targets under minimal assumptions (see Theorems 3.1 and 3.5). Our results hereby therefore also contains basically the entire literature on the topic. Moreover, the weights that appear in our bounds bear a clear and natural interpretation in terms of Stein operators which allow for easy computation for a wide variety of targets, as illustrated in the different examples we tackle as well as in Tables 1, 2 and 3 in which we provide explicit variance bounds for univariate target distributions belonging to the classical integrated Pearson and Ord families (see Example 3.8 for a definition). In particular, Klaassen’s bounds now arise naturally in this setting.

  • Stein operators and their properties. The second main contribution of the paper lies in our method of proof, which contributes to the theory of Stein operators themselves. Specifically, we obtain several new probabilistic representations of inverse Stein operators (a.k.a. solutions to Stein equations) which open the way to a wealth of new manipulations which where hitherto unavailable. These representations also lead to new interpretations and ultimately new handles on several quantities which are crucial to the theory surrounding Stein’s method (such as Stein kernels, Stein equations, Stein factors, and Stein bounds). Finally the various objects we identify provide natural connections with other topics of interest, including the well-known connection with Sturm-Liouville theory already identified in [30].

The paper is organised as follows. Section 2 contains the theoretical foundations of the paper. In Section 2.1 we recall the theory of canonical and standardized Stein operators introduced in [53] and introduce a (new) notion of inverse Stein operator (Definition 2.4). We also identify minimal conditions under which Stein-type probabilistic integration by parts formulas hold (see Lemmas 2.3 and 2.16). In Section 2.2 we provide the representation formulas for the inverse Stein operator (Lemmas 2.18 and 2.19). In Section 2.3 we clarify the conditions on the test functions under which the different identities hold, and provide bridges with the classical assumptions in the literature. Section 2.4 contains bounds on the solutions to the Stein equations. Section 3 contains the covariance identities and inequalities. After re-interpreting Hoeffding’s identity (1.4) we obtain general and flexible lower and upper covariance bounds (Proposition 3.1 and Theorem 3.5). We then deduce Klaassen’s bounds (Corollary 3.7) and provide examples for several concrete distributions, with more examples deferred to the three tables mentioned above. Finally a discussion is provided in Section 4, wherein several examples are treated and connections with other theories are established, for instance the Brascamp-Lieb inequality (Corollary 4.1) and Menz and Otto’s asymmetric Brascamp-Lieb inequality (Corollary 4.2

), as well as the link with an eigenfunction problem which can be seen as an extended Sturm-Liouville problem. The proofs from Section

2.3 are technical and postponed to the appendix A.

2 Stein differentiation

Stein’s method consists in a collection of techniques for distributional approximation that was originally developed for normal approximation in [69] and for Poisson approximation in [25]; for expositions see the books [70, 7, 8, 26, 57] and the review papers [61, 63, 21]

. Outside the Gaussian and Poisson frameworks, there exist several non-equivalent general theories allowing to setup Stein’s method for large swaths of probability distributions, of which we single out the papers

[22, 31, 71] for univariate distributions under analytical assumptions, [4, 5] for infinitely divisible distributions, [6] for discrete multivariate distributions, and [54, 38, 39] as well as [34] for multivariate densities under diffusive assumptions.

The backbone of the present paper consists in the approach from [50, 53, 62]. Before introducing these results, we fix the notations. Let and equip it with some -algebra and -finite measure . Let be a random variable on , with induced probability measure which is absolutely continuous with respect to ; we denote by the corresponding probability density, and its support by . As usual, is the collection of all real valued functions such that . We sometimes call the expectation under the -mean. Although we could in principle keep the discussion to come very general, in order to make the paper more concrete and readable we shall restrict our attention to distributions satisfying the following Assumption.

Assumption A. The measure is either the counting measure on or the Lebesgue measure on . If is the counting measure then there exist such that . If is the Lebesgue measure then there exist such that and . Moreover, the measure is not point mass.

Here not allowing point mass much simplifies the presentation. Stein’s method for point mass is available in [60].

Let . In the sequel we shall restrict our attention to the following three derivative-type operators:

with the weak derivative defined Lebesgue almost everywhere, the classical forward difference and the classical backward difference. Whenever we take as the Lebesgue measure and speak of the continuous case; whenever we take as the counting measure and speak of the discrete case. There are two choices of derivatives in the discrete case, only one in the continuous case. We let denote the collection of functions such that exists and is finite -almost surely. In the case , this corresponds to all absolutely continuous functions; in the case the domain is the collection of all functions on . For ease of reference we note that, if is such that then, for all such that we have

which we summarize as




We stress the fact that the values at are understood as limits if either is infinite.

2.1 Stein operators and Stein equations

Our first definitions come from [53]. We first define as the collection of such that .

Definition 2.1 (Canonical Stein operators).

Let and consider the linear operator defined as

for all and for . The operator is called the canonical (-)Stein operator of . The cases and provide the forward and backward Stein operators, denoted by and , respectively; the case provides the differential Stein operator denoted by .

To describe the domain and the range of we introduce the following sets of functions:

We draw the reader’s attention to the fact that the second condition in the definition of can be rewritten

The next lemma, which follows immediately from the definition of and of the different sets of functions, shows why is called the canonical Stein class.

Lemma 2.2 (Canonical Stein class).

For , .

Crucially for the results in this paper, for all , such that the operators satisfy the product rule


for all . This product rule leads to an integration by parts (IBP) formula (a.k.a. Abel-type summation formula) as follows.

Lemma 2.3 (Stein IBP formula - version 1).

For all , such that (i) and (ii) we have


Under the stated assumptions, we can apply (2.3) to get


for all . Condition (i) in the statement guarantees that the left hand side (l.h.s.) of (2.5) has mean 0, while condition (ii) guarantees that we can separate the expectation of the sum on the right hand side (r.h.s.) into the sum of the individual expectations. ∎

A natural interpretation of (2.4) is that operator is, in some sense to be made precise, the skew-adjoint operator to with respect to the scalar product ; this provides a supplementary justification to the use of the terminology “canonical” for operator . We discuss a consequence of this interpretation in Section 4. The conditions under which Lemma 2.3 holds are all but transparent. We clarify these assumptions in Section 2.3. For more details on Stein class and operators, we refer to [53] for the construction in an abstract setting, [50] for the construction in the continuous setting (i.e. ) and [33] for the construction in the discrete setting (i.e. ). Multivariate extensions are developed in [62].

The fundamental stepping stone for our theory is an inverse of the canonical operator provided in the next definition.

Definition 2.4 (Canonical pseudo inverse Stein operator).

Let and recall the notations from (2.2). The canonical pseudo-inverse Stein operator for the operator is defined, for , as


for all and for all .

Equality between the second and third expressions in (2.6) is justified because so that the integral of over the whole support cancels out. For ease of reference we detail in the three cases that interest us:

Note that but and, conversely, but . The denomination pseudo-inverse-Stein operator for is justified by the following lemma whose proof is immediate.

Lemma 2.5.

For any , . Moreover, (i) for all we have at all and (ii) for all we have at all . Operator is invertible (with inverse ) on the subclass of functions in which, moreover, satisfy .

Starting from (2.5) we postulate the next definition.

Definition 2.6 (Standardizations of the canonical operator).

Fix and . The -standardized Stein operator is


acting on the collection of test functions such that and .

Remark 2.7.

The conditions appearing in the definition of are tailored to ensure that all identities and manipulations follow immediately. For instance, the requirement that in the definition of guarantees that the resulting functions have -mean 0 and the condition guarantees that the expectations of the individual summands on the r.h.s. of (2.7) exist. Again, our assumptions are not transparent; we discuss them in detail in Section 2.3.

The final ingredient for Stein differentiation is the Stein equation:

Definition 2.8 (Stein equation).

Fix and . For , the -Stein equation for is the functional equation , i.e.


A solution to the Stein equation is any function which satisfies (2.8) for all .

Our notations lead immediately to the next result.

Lemma 2.9 (Solution to the Stein equation).

Fix . The Stein equation (2.8) for is solved by


with the convention that for all outside of .


With ,

using Lemma 2.5 for the last step. Hence (2.8) is satisfied for all . Since, by construction, , the claim follows. ∎

When the context is clear then we drop the superscripts and the subscript in of (2.9). Before proceeding we provide two examples. The notation refers to the identity function .

Example 2.10 (Binomial distribution).

Let be the binomial density with parameters and ; assume that

. Stein’s method for the binomial distribution was first developed in

[32] using ; see also [68, 43].

Picking , the class consists of functions which are bounded on and . Fixing gives leading to


with corresponding class which contains all functions . The solution to the -Stein equation (see (2.8)) is

and .

Picking , the class consists of functions which are bounded on and such that . Again fixing gives leading to

acting on the same class as (2.10). The solution to the -Stein equation is

and . The function is studied in [32] where bounds on are provided (see equation (10) in that paper); see also Section 2.4 where bounds on are provided.

Example 2.11 (Beta distribution).

Let be the beta density with parameters and

. Stein’s method for the beta distribution was developed in

[37, 31] using the Stein operator . In our notations, we have and consists of functions such that and is Lebesgue integrable on . Fixing gives leading to the operator

with domain the set of differentiable functions such that and . The solution to the -Stein equation is

The operator is, up to multiplication by , the classical Stein operator for the beta density, see [37, 31] for details and bounds on solutions and their derivatives. See also Section 2.4 where bounds on are provided.

In order to propose a more general example, we recall the concept of a Stein kernel, here extended to continuous and discrete distributions alike.

Definition 2.12 (The Stein kernel).

Let have finite mean. The (-)Stein kernel of (or of ) is the function

Metonymously, we refer to the random variable as the (-)Stein kernel of

Remark 2.13.

The function is studied in detail for in [70, Lecture VI]. This function is particularly useful for Pearson (and discrete Pearson a.k.a. Ord) distributions which are characterized by the fact that their Stein kernel is a second degree polynomial, see Example 3.8. For more on this topic, we also refer to forthcoming [33] as well as [28, 36, 35] wherein important contributions to the theory of Stein kernels are provided in a multivariate setting.

The next example gives some (-)Stein kernels, exploiting the fact that if the mean of is , then

Example 2.14.

If then using , Example 2.10 gives and . If then Example 2.11 with gives .

Example 2.15 (A general example).

Let satisfy Assumption A and suppose that it has finite mean . Fixing , operator (2.7) becomes

with corresponding class which contains all functions such that and . Again, we stress that such conditions are clarified in Section 2.3. Using Lemma 2.9, the solution to the Stein equation is

Bounds on are provided in Section 2.3. Stein’s method based on is already available in several important subcases, e.g. in [67, 47, 31] for continuous distributions.

The construction is tailored to ensure that all operators have mean 0 over the entire classes of functions on which they are defined. We immediately deduce the following family of Stein integration by parts formulas:

Lemma 2.16 (Stein IBP formula - version 2).

Let . Then


for all and all such that and .


Identity (2.11) follows directly from the Stein product rule in [53, Theorem 3.24] or by using the fact that expectations of the operators in (2.7) are equal to 0. ∎

We stress the fact that in the formulation of Lemma 2.16 the test functions and do not play a symmetric role. If then the right hand side of (2.11) is the covariance . We shall use this heavily in our future developments. Similarly as for Lemma 2.3, the conditions under which Lemma 2.16 applies are not transparent in their present form. In Section 2.3 various explicit sets of conditions are provided under which the IBP (2.11) is applicable.

2.2 Representations of the inverse Stein operator

This section contains the first main results of the paper, namely probabilistic representations for this operator. Such representations are extremely useful for manipulations of the operators. We start with a simple rewriting of . Given , recall the notation and define


Such generalized indicator functions particularize, in the three cases that interest us, to , and . Their properties lead to some form of “calculus” which shall be useful in the sequel.

Lemma 2.17 (Chi calculation rules).

The function is non-increasing in and non-decreasing in . For all we have




Let with support satisfy Assumption A. Then for any it is easy to check from the definition (2.6) that


Next, define


for all and 0 elsewhere. This function is used in the following representation formula for the Stein inverse operator:

Lemma 2.18 (Representation formula I).

Let be independent copies of with support . Then, for all we have


The condition on suffices for the expectation on the r.h.s. of (2.17) to be finite for all . Suppose without loss of generality that . Using that are i.i.d., we reap