Log In Sign Up

Bounds in L^1 Wasserstein distance on the normal approximation of general M-estimators

by   François Bachoc, et al.

We derive quantitative bounds on the rate of convergence in L^1 Wasserstein distance of general M-estimators, with an almost sharp (up to a logarithmic term) behavior in the number of observations. We focus on situations where the estimator does not have an explicit expression as a function of the data. The general method may be applied even in situations where the observations are not independent. Our main application is a rate of convergence for cross validation estimation of covariance parameters of Gaussian processes.


page 1

page 2

page 3

page 4


Convergence of the empirical measure in expected Wasserstein distance: non asymptotic explicit bounds in ℝ^d

We provide some non asymptotic bounds, with explicit constants, that mea...

The Sketched Wasserstein Distance for mixture distributions

The Sketched Wasserstein Distance (W^S) is a new probability distance sp...

Rate of Convergence of Polynomial Networks to Gaussian Processes

We examine one-hidden-layer neural networks with random weights. It is w...

Statistical analysis of Wasserstein GANs with applications to time series forecasting

We provide statistical theory for conditional and unconditional Wasserst...

Optimal Fusion of Elliptic Extended Target Estimates based on the Wasserstein Distance

This paper considers the fusion of multiple estimates of a spatially ext...

Spectral-norm risk rates for multi-taper estimation of Gaussian processes

We consider the estimation of the covariance of a stationary Gaussian pr...

1 Introduction

Our goal here is to derive quantitative bounds for approximate normality of parameter estimators that arise as minimizers of certain random functions. The main example to keep in mind is maximum likelihood estimation [51, Chapter 5.5], but other problems fit in the framework we shall consider, including least square estimators [48] and cross validation [9, 56].

Consider a fixed compact parameter space and a sequence of random functions , where for , . Throughout, is the set of non-zero natural numbers. The variable should be thought of as a sample size, and

the function for which a minimizer will be the M-estimator of interest, which is a (measurable) random vector

such that


A classical family of M-estimators is given by functions of the form


where the are the sample independent data, valued in a space , and is a fixed function. We shall address in details this class in Sections 3.1 and 3.2, but investigation shall go beyond this framework, in particular to cover covariance estimation for Gaussian processes, addressed in Section 3.3.

Our goal will be to derive quantitative central limit theorems in Wasserstein (or optimal transport) distance for the fluctuations of around a deterministic parameter (that is allowed to depend on ). The simplest example is when is fixed, typically when stems from the likelihood function and there is a fixed data generating process characterized by the “true” parameter [51, Chapter 5.5]. Nevertheless, we allow for a sample-size dependent which enables to address relevant situations such as misspecified models [15, 17, 34, 54]. In particular, in [15, 17], the parameter of interest that estimates explicitly depends on sample size.

In the context of this paper, it is typically already known that the distribution of

converges to a Gaussian distribution. General techniques for showing this convergence are available in a wealth of contributions, see for instance

[20, 47, 51] and references therein. Our goal is then to go beyond the convergence between these two distributions (for which, usually, no rates are available) by providing quantitative bounds on their Wasserstein distance. In this view, the main challenge is the M-estimation setting, which often entails that no explicit expression of is available. Our main abstract result, Theorem 1, is a general statement about reducing the problem to a central limit theorem for an explicit function of the data. More precisely, the Wasserstein distance between the distribution of and a Gaussian distribution is bounded by the sum of a term of order (up to a log factor) and the distance between a Gaussian distribution and the normalized gradient of at .

Hence, Theorem 1 enables to reduce the problem to quantifying the asymptotic normality of this normalized gradient. Since this quantity is explicit, there are many techniques in the literature that can be applied. We shall discuss this aspect of the problem in Section 2.3.

We shall illustrate the benefits of Theorem 1 with several examples of functions : averages of independent functions in Section 3.1, maximum likelihood for logistic regression in Section 3.2 and cross validation estimation of covariance parameters of Gaussian processes in Section 3.3. This last example highlights the flexibility of our techniques, since the observations are dependent and the function is not based on the likelihood. In all these three cases, eventually, we provide a bound, for the Wasserstein distance between the distribution of and a Gaussian distribution, of order (up to a log factor).

There has been a recent interest for bounding the normal approximation of M-estimators, as we do here. On connected topics, the normal approximation is quantified in [46] for the Delta method, in [8] for likelihood ratios and in [3] for gradient descent. Considering now specifically M-estimators, a series of articles successfully addressed them: [1, 2, 4, 5, 6, 7, 16, 45, 50]. These articles address not only the univariate case (for ) [1, 6, 7, 16, 45], but also the general multivariate one [2, 4, 5, 50]. In particular, some of these references exploit the characterization of the Wasserstein distance as a supremum of expectation differences, over Lipschitz functions. This enables to decompose the target Wasserstein distance into several terms that can be addressed independently with different approaches. This idea appears for instance in [1, (9), (10) and (20)], as well as some of the other articles above. We also rely on it, see (A) and (A).

We shall now highlight the novelty of our results compared to the above articles. First, the references [2, 4, 6, 7, 16, 45, 50] do not address the Wasserstein distance as we do. Only [1, 5] do. In [50]

, the distance is the supremum probability difference over convex sets, which is of the Berry-Esseen type. Earlier and similarly,

[16, 45] considered the Kolmogorov distance in the univariate case. Also, [6, 7] address Zolotarev-type distances based on supremums of expectation differences over absolutely continuous bounded test functions (and Lipschitz in [7], yielding the bounded-Wasserstein distance). Similarly, [2, 4] consider test functions that are bounded with bounded derivatives of various orders. Remark that while the

Wasserstein and Kolmogorov distances can be compared under regularity conditions and a priori moment bounds, using general comparison results typically worsens the quantitative estimates. Note also that bounding the

Wasserstein distance is stronger than in [2, 4, 7], as it allows for a larger class of test functions. Remark furthermore that Berry-Esseen-type and Kolmogorov distances may be less sensitive than Wasserstein distances to, for instance, the moments of . Thus, the Wasserstein distances necessitate specific treatments compared to them (for instance, see the proof and use of Lemma 7 here, or the terms in Theorem 2.1 in [7] involving the moments of ).

In addition, we allow for general functions , while most of the above references focus on maximum likelihood. Some arguments provided for maximum likelihood do carry over to general functions , but it is not clear that this is the case for all of them. Also, most of the above references focus on independent observations (often also identically distributed) defining the function (with the exception of [1]), while we allow for stemming from dependent observations. Again, some but not all arguments for independent observations can be extended to dependent observations. In the case of independent observations, as in [5] we shall rely on a result of Bonis [18] to bound the rate of convergence in the multivariate central limit theorem.

Furthermore, in comparison to [1, 2, 4, 5, 6, 7], our general bound in Theorem 1 only depends on and its derivatives, and does not feature . In contrast, most of the general bounds in these references contain moments of (see for instance Theorem 2.1 in [7]). Hence, our general bound seems more convenient to apply to examples, particularly when does not have an explicit expression, which is often the case. In agreement with this, in most of the examples provided by [1, 2, 4, 5, 6, 7], has an explicit expression. As an exception, [2, 7]

address maximum likelihood estimation of the shape parameters of the Beta distribution. Finally,

[1, 2, 4, 5, 6, 7] usually make the assumption that there is a unique satisfying (1), while Theorem 1 here holds for any satisfying (1). In many statistical models of interest, there is no guarantee that has a unique minimizer over , almost surely.

The examples we address are representative of the flexibility of Theorem 1. In particular we address general averages of independent functions in Section 3.1. We treat logistic regression in Section 3.2, with a simple proof once Theorem 1 is established, which illustrates that this theorem is efficient even when does not have an explicit expression, and is not necessarily unique. Finally, in Section 3.3 we address cross validation estimation of covariance parameters of Gaussian processes. This last example highlights our flexibility to dependent observations and to not stemming from a likelihood and even not being an average of functions of individual observations (most of the discussed references above consider these averages of functions for ). Again, has no explicit expression in this cross validation example.

The rest of the paper is organized as follows. Section 2 provides the general technical conditions and the general bound of Theorem 1, reducing the problem to the asymptotic normality of the normalized gradient. It also discusses many references to address this asymptotic normality in the probabilistic literature. Section 3 addresses the three examples discussed above. Some of the proofs are postponed to the appendix.

2 General bounds

For a matrix , we write

for its singular values, and for a symmetric matrix, we write

for its eigenvalues.

2.1 Technical conditions

For , we write and . We write for the interior of the parameter space . The next condition means that is, so to speak, well-behaved. It can be checked that this condition holds for most common compact parameter spaces, in particular hypercubes, balls, ellipsoids and polyhedral sets.

Condition 1.

There exist two constants and such that for each , there exist and satisfying the following. For each , there exists such that and .

Then, the next condition basically consists in asking for enough integrability on the derivatives of to be able to commute expectation and derivation, which is usually established using the dominated convergence theorem. Remark that the conditions on the first two derivative orders will actually be implied by some of our later conditions, but we state them here independently for convenience of writing.

Condition 2.

Consider . For

, the random variable

is absolutely summable. Almost surely, the function is three times differentiable on . For and , the random variables , and are absolutely summable. Furthermore,


The next condition means that, for a fixed , and , , concentrate around their expectations at rate , with an exponential decay for deviations of order larger than . Many tools from concentration inequalities (for instance [19, 23]) enable to check this condition in specific settings (see for instance those of Section 3). The rate in the exponential is sharp in general for averages of i.i.d. random variables.

Condition 3.

There are constants , and such that for and ,


For a function and for , we write the gradient column vector of at and we write the Hessian matrix of at . The next condition is a control on the deviations of the derivatives of of order 1 and 2, that is uniform over . Remark that the deviations that are controlled are of larger order than those in Condition 3. Hence, again, the condition can be checked in many settings.

Condition 4.

There are constants , and such that for and ,


We then require the derivatives of order 1, 2 and 3 of to have bounded moments of order 1, 1 and 2.

Condition 5.

There is a constant such that for ,




Above, the moments are for fixed for the order 1 and 2. The moments for the order 3 are uniform over . Note that it can be seen from the proof of Theorem 1 that assuming uniformity only locally around (see Condition 7) would be sufficient. For instance, [4] has a similar locally uniform moment bound on the third-derivatives of the log-likelihood function (see (R.C.3) there).

The next condition requires the variances of the derivatives of order 1 and 2 of

to be of order . This condition is natural and easy to check in many settings, for example for i.i.d. random variables.

Condition 6.

There is a constant such that for , ,


For and , we let be the closed Euclidean ball in with center and radius . The next condition introduces the sequence of deterministic parameters , to which is asymptotically close. In the applications of Sections 3.2 and 3.3, does not depend on the sample size and determines the fixed unknown data generating process. Nevertheless, it is beneficial to allow for a -dependent , to cover general cases of misspecified models, for instance as in [15, 17, 34, 54].

Condition 7.

There exists a sequence and a constant such that for each , . We each , . For each such that , there exist constants and such that for ,

Condition 7 is a usual one in M-estimation: cancels out the expected gradient of and is asymptotically the minimizer of , so to speak.

Remark 1.

In Condition 6, it is actually sufficient that the second inequality holds only for . We state Condition 6 as it is only for convenience of writing, and because checking the inequality uniformly over in the bounded usually brings no additional difficulty.

Then, define the covariance matrix of the normalized gradient


and the expected Hessian


The next condition requires the expected Hessian matrix of at to be asymptotically strictly positive definite. Similarly to Condition 7, this is a usual requirement for and to be close at asymptotic rate .

Condition 8.

There are constants and such that for

We finally require the covariance matrix of the normalized gradient to be asymptotically strictly positive definite, so that the Gaussian limit in the central limit theorem is non-degenerate.

Condition 9.

There are constants and such that for ,

2.2 Reduction to the normal approximation of the normalized gradient

We let be the set of -Lipschitz continuous functions from to , that is the set of functions such that, for all ,

Then, for two random vectors and in , the Wasserstein distance between the distributions of and is

Equivalently, is also the well known optimal transport cost, according to the Kantorovitch-Rubinstein duality formula:

where is the set of pairs of random vectors for which the first one is distributed as and the second one as .

For a symmetric non-negative definite matrix , we write for its unique symmetric non-negative definite square root. When is also invertible, we write .

The next theorem is the main result of this paper. It can be checked, using standard arguments, that the conditions of Section 2.1 imply that

is asymptotically normally distributed, with asymptotic covariance matrix taking the “sandwich” form

. Equivalently, converges to a standard Gaussian distribution. We are interested in the Wasserstein distance between the distribution of this latter random vector and the standard Gaussian one. We show that this distance is bounded by the sum of a term of order (up to a log factor) and the distance between and the standard Gaussian distribution. The benefit on Theorem 1 is then that is usually much easier to analyze than , since it takes an explicit form and is not defined as a minimizer. In Section 2.3, we discuss many existing possibilities to quantify the asymptotic normality of .

Theorem 1.

Assume that Conditions 1 to 9 hold. Consider as in (1). There are constants , and such that for , with following the standard Gaussian distribution on ,

Remark 2.

In Theorem 1, the bound on directly provides a similar bound on , where follows the centered Gaussian distribution with covariance matrix . Indeed the matrix is bounded and we can apply the well-known Lemma 1 below. The same remark applies to Theorems 2, 3 and 4, since the matrix is also bounded in these latter contexts (as is shown in the proofs).

Lemma 1.

Let be two random vectors of and be such that for , with . Then .

2.3 Background on approximate normality for functions of many random variables

Theorem 1 reduces the problem of proving a quantitative bound on distance to the Gaussian for a general M-estimator to proving the same statement for an explicit function of the data. We shall now describe some of the broad ideas for proving such statements, some of which will be used in the applications described in Section 3. We do not aim at being exhaustive, and other techniques can also be used in this context.

The abstract setting is to consider a random variable of the form where the are random variables. The classical central limit theorem consists in taking the to be i.i.d. and to be a normalized sum.

When is a sum, which arises for M-estimators of the form (2) (see Sections 3.1 and 3.2), there is a vast literature on quantitative central limit theorems, beyond the classical i.i.d. assumptions. For independent variables, we shall use here a very general result of Bonis [18], but many other results can be used in such a situation.


is not a sum, but is approximately affine, and all variables have some influence on the value, we still expect approximate normality. This heuristic has been made rigorous by second-order Poincaré inequalities, which bound distances to the Gaussian when certain functions of the first and second derivatives are small. They have been introduced in the Gaussian setting by Chatterjee

[22], extended in [41], and analogues for general independent random variables via discrete second-order derivatives were studied in [21, 26, 28]. Second-order Poincaré inequalities for non-Gaussian, non-independent random variables do not seem to have been yet addressed in the literature, and warrant further investigation.

Another method for proving approximate normality in the Gaussian setting when the function is a multivariate polynomial is via the quantitative fourth moment theorem of Nourdin and Peccati [38], which for example applies to U-statistics. When the polynomial is square-free and has low influences, it is possible to extend this phenomenon to more general i.i.d. random variables [42]. The approach extends to non-independent functions of Gaussian variables, a result known as the quantitative Breuer-Major theorem [37, 40]. We refer to the monograph [39] for a thorough discussion of this approach. We shall use a variant of it in Section 3.3.

For non-independent random variables, there have been successful implementations of variants of Stein’s method, often in situations where there is some symmetry. Classical techniques include the exchangeable pairs method and the zero-bias transform, and we refer to [49] for a survey.

3 Applications

3.1 Minimization of averages of independent functions

We now show how Theorem 1 applies to estimators provided by

as in (2) with independent random vectors .

We introduce the property of sub-Gaussianity, that holds for a large class of random variables, including Gaussian random variables, bounded random variables and uniformly log-concave random variables.

Definition 1.

A real-valued random variable is said to be sub-Gaussian with constant if for any we have

The next theorem, based on Theorem 1, provides a bound of order (up to a log factor) in Wasserstein distance for the asymptotic normality of M-estimators based on (2), under uniform sub-Gaussiannity for and its derivatives with respect to .

Theorem 2.

Assume that are independent. Assume moreover that there are constants and such that for any , for any , for any , for any ,


Assume moreover that Conditions 1, 2 and 7 to 9 hold. Consider , , and as in (2), (1), (5) and (6). Finally, assume that one of the two following conditions hold: either

  1. There exist fixed constants and such that


    for all and .


  2. All the functions , and have a modulus of continuity bounded by some function , uniformly in and in .

Then there are constants , and such that, for , with following the standard Gaussian distribution,

Remark 3.

The sub-Gaussianity assumption (2) of Theorem 2 on the partial derivatives of with respect to can be checked based on the sub-Gaussianity of only and on regularity properties of .

Indeed, it is known that if a random vector with values in has components that are sub-Gaussian with constant , then for any -Lipschitz function , the variable is sub-Gaussian with constant at most of order . The dimensional prefactor can be eliminated for example when the components are independent and satisfy Talagrand’s transport-entropy inequality [32]. Consider then the case where are uniformly sub-Gaussian and for any , for any , is Lipschitz in its second variable, uniformly in , and is bounded, also uniformly in , for some reference values of , . In this case then the uniform sub-Gaussianity assumption (2) of Theorem 2 holds.

Note also that these latter assumptions are not minimal. For example, we could relax the Lipschitz assumption on the second derivatives into some quadratic growth. The assumptions on the third derivatives are much stronger than what is necessary to ensure (4) to streamline applications: one can check essentially the same conditions on all derivatives up to order three, rather than single out a weaker condition for third derivatives.

Remark 4.

The two possible conditions 1 and 2 in Theorem 2 are used to ensure that Condition 4 holds. There are other possible ways of verifying it, such as classical chaining techniques used to bound the suprema of stochastic processes when stochastic forms of continuity (in ) hold, see for example [52, Chapter 8].

Proof of Theorem 2.

First we must check that the conditions required by Theorem 1 are satisfied. By assumptions, this means checking conditions 3 to 6.

From the sub-Gaussianity and bounded expectation assumption (2), we uniformly control moments of all order, and the first two parts of Condition 5 hold. Condition 3 is an immediate consequence of the Gaussian concentration assumption and Chernoff’s concentration bound. Condition 6 can be established using the fact that we wish to control the variances of averages of independent variables, and the uniform moment bounds.

Finally, we need to check that Condition 4 holds, assuming either 1 or 2 holds. If the first one holds, Condition 4 is just a consequence of Markov’s inequality. If the second one holds, by continuity, Condition 1 and fixing some , and some small enough, we have for any ,

for some constant , where the final bound uses the Gaussian concentration of for fixed and the uniform bound on its expectation. The same reasoning applies for the second derivatives, and therefore Condition 4 holds with the same argument as when 1 holds. One can also check (4) with the same reasoning.

Since Theorem 1 applies, we are reduced to understanding the asymptotic behavior of

Hence we are in the setting of a quantitative central limit theorem for sums of independent random vectors. From the sub-Gaussianity assumption (2), we see that the fourth moments of , , are uniformly bounded. Moreover, by Condition 9, this is not modified by multiplying these vectors by

. Hence we are considering a sum of independent random vectors with covariances summing to the identity matrix

, and we can apply the following statement to conclude the proof, which is a particular case of a result of Bonis [18, Theorem 11].

Proposition 1.

Let be a sequence of independent random vectors taking values in , each centered, and such that . Assume moreover that for any , , for a given . Then

where is a standard Gaussian vector on , and the constant is a numerical constant that does not depend on or on the distribution of the ’s.

3.2 Parameter estimation in logistic regression

We shall now present the simple example of logistic regression, where Theorem 2 is applied to a maximum likelihood estimator. We consider a deterministic sequence of vectors in . To match the assumptions of Theorem 2, we assume this sequence to be bounded.

Condition 10.

There is a constant such that for ,

As previously, we let be a fixed compact subset of . We let be fixed. We consider a sequence of independent random variables with, for , and


We let, for ,

Hence, we are in the classical well-specified case where the parameter characterizes the data generating process, or distribution, of . The likelihood function of is, for ,

Minus the logarithm of the likelihood of is, for ,

Hence minus the normalized log likelihood function is, for ,


Note that we do not have an explicit expression for the minimizer of . We have, for ,


We also have, for ,


Hence we see that is convex with respect to . Next, we assume that the empirical second moment matrix of the ’s is asymptotically strictly positive definite. This type of condition is common with logistic regression [15, 29, 35] and enables to have asymptotic identifiability (Condition 8).

Condition 11.

There are constants and such that, for ,

We can now state the Wasserstein bound on the asymptotic normality of the maximum likelihood estimator, in logistic regression. To our knowledge, this is the first established rate of convergence of asymptotic normality in logistic regression.

Theorem 3.

Assume that satisfies Condition 1. Assume that Conditions 10 and 11 hold. Consider in (9), as in (1), as defined in (8), as in (5) and as in (6). Then, there are constants , and such that for , with following the standard Gaussian distribution on ,

3.3 Covariance parameter estimation for Gaussian processes by cross validation

Our last example stems from the field of spatial statistics [9, 10, 14, 24, 25, 33, 53, 55, 56]. The goal is to illustrate the benefit of Theorem 1 to a situation where the observations are dependent and where does not correspond to a likelihood. We stress that has no explicit expression.

We consider a sequence of deterministic vectors in , that we call observation points. Then, for , the observed data consist in a vector of size which component is , where is a centered Gaussian process.

We are interested in the parametric estimation of the correlation function of , based on a parametric set of stationary correlation functions , where for , and is a correlation function. For an introduction to usual parametric sets of stationary correlation functions in spatial statistics, we refer for instance to [11, 24, 25, 31, 53].

As an estimator for , we consider the minimization of the average of square leave-one-out errors, letting, for ,

Above, is obtained from by deleting the component and means that the conditional expectation is computed as if the Gaussian process had correlation function . Now, for , let be the matrix with coefficient equal to , that is, the correlation matrix of under correlation function given by . Then, from for instance [9, 27, 56] (to which we refer for more background and discussions on cross validation for Gaussian processes), we have


where is obtained by setting the off-diagonal elements of a square matrix to zero.

For , we let , where is a fixed element of such that has correlation function , which also implies that has correlation matrix . This corresponds to a well-specified parametric set of correlation functions. The next condition means that we consider the increasing-domain asymptotic framework, where the sequence of observation points is unbounded, with a minimal distance between any two distinct points [10, 25, 36].

Condition 12.

There is a constant such that for , ,

The next condition is a lower bound on the smallest eigenvalues of the correlation matrices from the parametric model. Given the increasing-domain asymptotic framework (Condition

12), this lower bound indeed holds for a large class of families of stationary correlation functions [10, 13].

Condition 13.

There is a constant such that

Next, we assume a third-order smoothness with respect to as well as a decay of the correlation at large distance. As before, many families of stationary correlation functions do satisfy this.

Condition 14.

For any , is three times continuously differentiable with respect to on . There exist constants and such that for , for ,