# Nonparametric statistical inference for drift vector fields of multi-dimensional diffusions

The problem of determining a periodic Lipschitz vector field b=(b_1, ..., b_d) from an observed trajectory of the solution (X_t: 0 < t < T) of the multi-dimensional stochastic differential equation dX_t = b(X_t)dt + dW_t, t ≥ 0, where W_t is a standard d-dimensional Brownian motion, is considered. Convergence rates of a penalised least squares estimator, which equals the maximum a posteriori (MAP) estimate corresponding to a high-dimensional Gaussian product prior, are derived. These results are deduced from corresponding contraction rates for the associated posterior distributions. The rates obtained are optimal up to log-factors in L^2-loss in any dimension, and also for supremum norm loss when d < 4. Further, when d < 3, nonparametric Bernstein-von Mises theorems are proved for the posterior distributions of b. From this we deduce functional central limit theorems for the implied estimators of the invariant measure μ_b. The limiting Gaussian process distributions have a covariance structure that is asymptotically optimal from an information-theoretic point of view.

## Authors

• 9 publications
• 10 publications
12/22/2020

### Nonparametric Bayesian inference for reversible multi-dimensional diffusions

We study nonparametric Bayesian modelling of reversible multi-dimensiona...
10/13/2021

### Spectral-norm risk rates for multi-taper estimation of Gaussian processes

We consider the estimation of the covariance of a stationary Gaussian pr...
02/11/2019

### High-dimensional central limit theorems for homogeneous sums

This paper develops a quantitative version of de Jong's central limit th...
09/27/2019

### Adaptive posterior contraction rates for empirical Bayesian drift estimation of a diffusion

Gaussian process (GP) priors are attractive for estimating the drift of ...
09/24/2018

### Convergence rates for Penalised Least Squares Estimators in PDE-constrained regression problems

We consider PDE constrained nonparametric regression problems in which t...
03/06/2021

### Statistical analysis of discretely sampled semilinear SPDEs: a power variation approach

Motivated by problems from statistical analysis for discretely sampled S...
05/06/2020

### Multiscale Bayesian Survival Analysis

We consider Bayesian nonparametric inference in the right-censoring surv...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

For a -dimensional Brownian motion and a Lipschitz vector field, consider the multi-dimensional Markov diffusion process describing solutions to the stochastic differential equation (SDE)

 dXt=b(Xt)dt+dWt,t≥0. (1)

The random process describes a Brownian motion whose trajectories are subject to spatially variable displacements enforced by the drift vector field . We are interested in recovering the parameter based on observing the process up to time . A closely related problem is that of estimating the invariant measure

of the diffusion, which describes the probabilities

 μb(A)=a.s.limT→∞1T∫T01A(Xt)dt (2)

corresponding to the average asymptotic time the ergodic process spends in a given measurable subset of the state space.

While the one-dimensional case is well studied (see, e.g., [30], [14, 15, 50, 38, 54, 1, 2, 3]), comparably little is known about the in applications highly important multi-dimensional setting, particularly when is modelled in a nonparametric or high-dimensional way. In the measurement model we consider here, convergence rates of certain multivariate nonparametric kernel-type estimators were first obtained in Dalalyan and Reiß [16], and further recent results in this direction are by Strauch [45, 46, 47], who obtained sharp (and adaptive) convergence rate results for in pointwise and -loss, and for in uniform-norm loss. The proofs of these results are based on certain spectral gap assumptions that permit the use of geometric and functional inequalities for diffusion processes [5]. Ultimately these conditions boil down to requiring that arises as a gradient vector field of some scalar potential satisfying certain curvature assumptions. This is a strong hypothesis on , which in particular implies that identifies and thus the law of the diffusion .

For observations , the likelihood function is directly available from Girsanov’s theorem, and has a convenient ‘Gaussian’ form in the parameter . This motivates the use of likelihood based inference procedures: the estimators for we study in the present paper are minimisers of a penalised likelihood (or, equivalently, least squares) criterion over a high-dimensional approximation space. In fact, since the penalties we use are squared Hilbert norms, equals a Bayesian ‘maximum a posteriori’ (MAP) estimate arising from a truncated Gaussian series prior. The Bayesian interpretation of is exploited in our proofs, and is further appealing since it comes hand in hand with uncertainty quantification methodology (‘posterior credible sets’), and posterior sampling is in principle feasible even for ‘real-world’ discrete data by simulation techniques, see [37, 8] and references therein.

Let us briefly describe our contributions: we obtain convergence rates of to the ‘true’ vector field generating equation (1), and also frequentist contraction rates about for the corresponding posterior distributions, both in - and -distances. For -loss the rates obtained are minimax optimal (up to log-factors) over Hölder classes in any dimension, and this remains true for -loss whenever dimension . When , we further prove nonparametric Bernstein-von Mises theorems that establish asymptotic normality of the re-centred and scaled posterior distributions in a (large enough) function space. From this we in turn deduce corresponding central limit theorems for the implied plug-in estimators for the invariant density . We exploit that the non-linear identification map can be shown to be ‘one-smoothing’. Since inference on is asymptotically equivalent to a nonparametric regression problem [16], this offers an analytical explanation for why the invariant density of the process can be estimated at rate in stronger norms than is the case in i.i.d. density estimation (see Section 2.5).

Instead of the functional inequalities used in [16, 46, 47]

, our proofs exploit basic martingale concentration properties and techniques from elliptic partial differential equations (PDEs). To simplify the PDE arguments in our proofs we restrict to periodic vector fields

. We avoid the assumption that is a gradient vector field altogether. The invariant measure then no longer identifies the law of the process – see after Proposition 1

below for details. Consequently, consistent Bayesian inference for

cannot be based on a prior assigned directly to the invariant measure. In contrast, first modelling by a Gaussian prior and subsequently recovering via PDE techniques leads to optimal results.

Standard methods [48] from the study of minimum contrast estimators (such as ) do not generally allow to derive optimal nonparametric convergence rates in stronger norms (such as -loss). Our proofs employ techniques from Bayesian Nonparametrics [11, 12, 9, 13] to overcome these limitations in our setting. In this regard our results are related to recent investigations of Bayesian inverse problems [29, 39, 4, 17, 28, 35], Bernstein-von Mises theorems [40, 32, 31, 33] and diffusion models [50, 38, 54, 34, 49, 26, 1].

## 2 Main results

### 2.1 Basic notation and definitions

Let denote the -dimensional torus, isomorphic to if opposite points on the cube are identified. By we denote the usual -spaces with respect to Lebesgue measure , equipped with inner product . Let be a probability measure on . If its Lebesgue density, also denoted by , exists and is bounded and bounded away from zero, then an equivalent norm on arises from the inner product . The symbol denotes the subspace of functions for which , and denotes the subspace for which .

We define the space of continuous functions on normed by the usual supremum norm . For , we denote by the usual Hölder spaces of -times continuously differentiable functions on , where is the integer part of . For , let denote the usual Sobolev space of functions from to (defined by duality when ). They form the special case in the scale of Besov spaces see Chapter 3 of [44] for definitions, where it is also shown that embeds continuously into . When no confusion may arise, we employ the same function space notation for vector fields . For instance will then mean that each and the norm on is given by We shall repeatedly use standard multiplication inequalities for Besov-Sobolev norms,

 ∥fg∥Bspq≤c(s,p,q,d)∥f∥Bspq∥g∥Bs∞∞≤c′(s,p,q,d)∥f∥Bspq∥g∥Cs,s≥0. (3)

Starting with a ‘-regular’ Daubechies periodised wavelet basis of , we denote by

 {Φl,r:r=0,…,min(0,2ld−1),l={−1,0}∪N}

a tensor product wavelet basis of

as described in Section 4.3 of [25]. We denote by the span of all wavelets up to resolution level , a space whose dimension scales as as . The decay of wavelet coefficients in this basis, or equivalently the scaling (as ) of the approximation errors from projections onto the spaces, characterises the norms of the Besov spaces and Sobolev spaces , see p.370f. in [25].

If is a probability measure on some metric space, then means that

is a random variable in that space drawn from the distribution

, also called the law of . We write , or when no confusion can arise, to denote the usual notion of weak convergence of the laws as , see, e.g., Chapter 11 in [18].

For an arbitrary normed linear space , the topological dual space is

 X∗=(X,∥⋅∥X)∗:={L:X→R linear s.t. |L(x)|≤C∥x∥X for all x∈X and some C>0},

which is a Banach space for the norm We will sometimes use the symbols to denote one- or two-sided inequalities up to multiplicative constants that may either be universal or ‘fixed’ in the context where the symbols appear. We also write to denote the non-negative part of a real number, and to denote maximum and minimum of real numbers , respectively.

### 2.2 Diffusions with periodic drift; likelihood, prior and posterior

Consider the SDE (1) where the vector field is Lipschitz continuous and one-periodic, that is for every . Then a strong pathwise solution of this SDE exists which is a -dimensional diffusion Markov process . We denote by the cylindrical probability measure describing the law of in path space when ; its restriction to the separable space describes the law of the process until time , see, e.g., Sections 24 and 39 in [6]. We suppress the dependence on the starting value as our results do not depend on it.

We seek to recover the drift function from an observed trajectory . The periodic model (which has also been used in [38, 54] when ) is convenient in our context as it effectively confines the diffusion process to a bounded state space . To be precise, while our diffusion takes values in the whole of (in particular will not be globally recurrent), the values of the process modulo contain all relevant statistical information. In particular we have (arguing as in the proof of Lemma 1 below),

 1T∫T0φ(Xt)dt→Pb∫Tdφdμb as T→∞,  ∀φ∈C(Td),

where is a uniquely defined probability measure on and where we identify with its periodic extension to on the left-hand side. The measure

has the usual probabilistic interpretation as an invariant measure appearing in the limit of ergodic averages, but for our purposes it is more convenient to define it in terms of a partial differential equation involving the generator of the diffusion Markov process. Heuristically, if

is the transition operator of a diffusion process with invariant measure and generator , then we can differentiate the invariant identity at , so that for all smooth . If is the adjoint operator for the standard -inner product, then it must satisfy for all smooth , and hence necessarily (in the weak sense), which can be used to identify via the adjoint generator .

To make this precise, in our periodic setting the generator is

 L=Lb=12Δ+b.∇=12d∑i=1∂2∂x2i+d∑i=1bi(⋅)∂∂xi, (4)

and from integration by parts we see that the adjoint operator for equals

 L∗=L∗b=12Δ−b.∇−div(b),   div(b)=d∑j=1∂bj∂xj, (5)

so that can be identified as the solution of the PDE

 L∗bμb≡12Δμb−b.∇μb−div(b)μb=0. (6)

If arises as a gradient vector field for some , one can check directly that solves (6). For general vector fields one can prove the following result (see after (66) in Section 4 below).

###### Proposition 1.

Let . A unique periodic solution to (6) satisfying exists. Moreover, is Lipschitz continuous and bounded away from zero on , with and the Lipschitz constant depending on only through a bound for .

While for gradient vector fields we can recover from via , the invariant measure does not identify or the law of for general vector fields (unless ). To see this, start with a gradient vector field and invariant measure . For any smooth divergence free vector field and (so that ) one checks by integration by parts that for all smooth , and as a consequence is also the invariant measure for . Thus any statistical approach to recover via first estimating is bound to fail in our general setting.

We instead propose likelihood-based inference methods. The log-likelihood function of our measurement model can be obtained from Girsanov’s theorem (Section IX.1 in [42] or 17.7 in [6]): for any periodic and Lipschitz ,

 eℓT(b)=dPTbdPT0(XT)=exp(−12∫T0∥b(Xt)∥2dt+∫T0b(Xt).dXt), (7)

where is the law of a standard -dimensional Brownian motion . Note that an application of Itô’s formula (as in Lemma 1 below) allows to re-write in terms of path integrals against so that computation of from an observed trajectory is possible. [In practice this may involve a further discretisation step, see [37, 8].]

Our approach to inference on amounts to computing a penalised maximum likelihood estimator over a high-dimensional wavelet approximation space. More precisely, set

 ^bT=^b(XT)=argminb∈V⊗dJ[−ℓT(b)+12∥b∥2H], (8)

where and is a Hilbert tensor norm on . The estimator has a natural Bayesian interpretation as the maximum a posteriori (MAP) estimate arising from a mean zero Gaussian prior on with reproducing kernel Hilbert space . Indeed, the posterior distribution arising from observing is of the form

 dΠ(b|XT)=eℓT(b)dΠ(b)∫eℓT(b)dΠ(b)∝eℓT(b)−12∥b∥2H,  b∈V⊗dJ. (9)

Our proofs imply that the denominator in the last expression is finite and non-zero with probability approaching one under the law of as . The map induces an inverse covariance on some linear subspace . [Since , , and our proofs imply in fact that with high probability under the law of .] Using Theorem 9.5.7 in [18] and linearity of , the distribution is thus Gaussian on and the MAP estimate (8) equals the posterior mean .

Concretely, the Gaussian process priors we will use here are constructed from high-dimensional wavelet expansions for of the form:

 bj=∑l≤J∑rσlgl,r,jΦl,r,  gl,r,j∼iidN(0,1),j=1,…,d, (10)

where the form a periodised wavelet basis of , where as in a way to be chosen below, and where the weights govern the regularisation prescribed by the penalty functional. Recall (p.75 in [25]) that the Gaussian process (10) has reproducing kernel Hilbert space (RKHS) inner product of tensor form

 ⟨g1,g2⟩H=d∑j=1∑l≤J∑rσ−2l⟨g1,j,Φl,r⟩2⟨g2,j,Φl,r⟩L2,  g1,g2∈V⊗dJ. (11)

### 2.3 Contraction rates for the posterior distribution and MAP estimate

We now give results concerning the concentration of the posterior measure around the ‘ground truth’ vector field that generated according to the diffusion equation (1). This implies convergence rates of the same order of magnitude for the MAP estimate (see Corollary 1). We denote the ‘true’ invariant measure from Proposition 1 by .

Our first theorem gives a contraction rate in the ‘natural distance’ induced by the statistical experiment, following the general theory [21]. Initially this distance is a ‘random Hellinger semimetric’, and we straightforwardly adapt ideas in [50] to the multi-dimensional setting (see Theorem 7 below). In dimension , the theory of diffusion local times can then be used to deduce from this convergence results in the standard -distances [50, 38, 54], but when such local times are not appropriately defined. Instead, we exploit concentration properties of the high-dimensional ‘design’ matrices induced by the random Hellinger semimetric on (see Lemma 10) to prove the following theorem.

###### Theorem 1.

Let . Suppose . Consider the Gaussian prior from (10) with and for and . Then for and every , as ,

 ΠT(b:∥b−b0∥μ0≥MTεT|XT)→Pb00.

In particular, if then .

Since we wish to perform the primary regularization via the truncation level

rather than the variance scaling

we have assumed that . It is possible to improve the logarithmic factors here and in Theorem 2 below under certain choices of , but we do not pursue this further in the present paper.

From the previous theorem, and imposing slightly stronger conditions on and , one can obtain perturbation approximations of the Laplace transform of

by the Laplace transform of a certain Gaussian distribution (see Proposition

2), and this makes more precise ‘semiparametric’ tools available for the analysis of the posterior distribution. In particular, adapting ideas in [9] (see also [12, 13, 10, 33]) we obtain contraction results in the -norm.

###### Theorem 2.

Let . Suppose . Consider the Gaussian prior from (10) with and for . Assume further that (if ) or (. Then for every ,

 ΠT(b:d∑j=1∥bj−b0,j∥∞≥(logT)δT−s∧[a−(d/2−2)+]2a+d∣∣XT)→Pb00  as T→∞.

In particular, if and , then the rate is .

By Gaussianity of the posterior distribution, the previous theorems translate into convergence rates of the MAP estimates from (8).

###### Corollary 1.

Let . Under the conditions of Theorem 1, for every ,

 ∥^bT−b0∥μ0=OPb0(MTT−a∧s2a+dlogT) as T→∞,

while under the conditions of Theorem 2, for every ,

 ∥^bT−b0∥∞=OPb0(T−s∧[a−(d/2−2)+]2a+d(logT)δ), as T→∞.
###### Proof.

Consider the function

 H(b′)=ΠT(b:∥b−b′∥μ0≤MTεT|XT), b′∈V⊗dJ.

The posterior is a Gaussian measure on the finite-dimensional space , centered at . Since -norm balls centred at the origin are convex symmetric sets, Anderson’s Lemma (Theorem 2.4.5 of [25]) yields that is a maximizer of . Using Exercise 8.3 in [21] with the contraction rate from Theorem 1, we deduce that as . The -rate follows similarly using the contraction rate from Theorem 2. ∎

Up to log-factors, the -rates obtained are minimax optimal for any dimension (the lower bounds follow, e.g., from the asymptotic equivalence results in [16], see also [45, 46]). The -rates are then also optimal whenever , up to log-factors. The sub-optimality of our rate for is related to the presence of common semiparametric ‘bias terms’: the approximation-theoretic Lemma 6 below, which quantitatively improves on previous bounds in [9, 33] of a similar kind to apply also when , gives the desired rate only when . We do not know if this is an artefact of our proof or whether it can be essentially improved.

### 2.4 Bernstein-von Mises theorems for b

We now adopt the framework of nonparametric Bernstein-von Mises theorems from [11, 12], see also the recent contributions [10, 41, 32, 33, 31]. The idea is to obtain a Gaussian approximation for the posterior distribution in a function space in which -convergence rates can be obtained. More precisely, we will view the re-centred and re-scaled posterior draws as (conditionally on ) random vector fields acting linearly on test functions by integration

 (ϕ↦√T∫Td(b−^b(XT)).ϕ:ϕ∈Bρ1∞∣∣XT),

and show that a Bernstein-von Mises theorem holds true uniformly in belonging to any bounded subset of the Besov space , . Equivalently, the limit theorem holds for the probability laws induced by these stochastic processes in the ‘dual’ Banach space . The limit will be the tight Gaussian probability measure on

induced by the centred Gaussian white noise process

with covariance

 EW0(ϕ)W0(ϕ′)=⟨ϕ,ϕ′⟩1/μ0=d∑j=1∫Tdϕj(x)ϕ′j(x)μ−10(x),  ϕ,ϕ′∈Bρ1∞;

its existence is established in the proof of the following theorem. The choice of Besov space parameters is maximal, see Remark 1. By embedding other spaces into one deduces various further limit theorems, e.g., in negative Sobolev spaces . For the applications to estimation of that follow, this particular choice of Besov space is, however, crucial, and restriction to the simpler scale of Sobolev spaces would be insufficient to obtain the results in Section 2.5 below.

For two probability measures in a metric space , define the bounded Lipschitz (BL) metric for weak convergence (p.157 in [19]) by

 βS(τ,τ′)=supF:S→R,∥F∥Lip≤1∣∣∣∫SFd(τ−τ′)∣∣∣,   ∥F∥Lip≡supx∈S|F(x)|+supx≠y,x,y∈S|F(x)−F(y)|e(x,y).
###### Theorem 3.

Let , , and let be such that . Suppose . Let be the Gaussian prior from (10) with and chosen such that . Let be the conditional law , where and is the posterior mean, and let denote the law in of a centred Gaussian white noise process for . Then, as ,

 β(Bρ1∞)∗(~Π(⋅|XT),Nb0)→Pb00.

Convergence of moments in the previous theorem yields the asymptotics of

.

###### Theorem 4.

Under the conditions of the previous theorem, the MAP estimate satisfies, as ,

 √T(^bT−b0)→dNb0 in (Bρ1∞)∗.

A confidence set for

can now be constructed by using the posterior quantiles to create a multiscale ball around

, which we can further intersect with smoothness information as in [11, 12] to obtain confidence bands that are valid and near-optimal also in -diameter.

###### Remark 1.

Just as in the related situations in [11, 32], one shows that the condition cannot be relaxed as otherwise the limiting process does not exist as a tight probability measure in . Moreover the choices give the maximal Besov space (on the bounded domain ) in view of standard embeddings, see [44], Section 3.5.

###### Remark 2.

As remarked at the end of Section 2.3, the presence of semi-parametric bias terms prevents our proof from giving a Bernstein-von Mises theorem when , and also necessitates the assumption in Theorem 3. Similar phenomena occur in nonparametric smoothing problems, see, e.g., Section 3.6 in [24].

### 2.5 CLTs and asymptotic inference for the invariant measure

We now turn to the problem of making inference on the invariant measure . From any vector field we can identify via the elliptic PDE (6) and hence, given and , we can (numerically) solve (6) to generate posterior samples and point estimates for . Of course other much simpler estimators of can be proposed and some discussion of the relative merits of the likelihood-based approach is given in Remark 5.

Using perturbation arguments for the PDE (6) combined with Theorem 3, we obtain the following Bernstein-von Mises theorem for . In the proof we show that the Frechet-derivative of the non-linear map is ‘one-smoothing’, a fact that follows from elliptic regularity theory for PDEs. As a consequence, the constraint from Theorem 3 can be relaxed to when the target of inference is rather than .

For the formulation of the following result, we define spaces

 Br=Br1∞(Td)∩L2(Td),r>0, d≤3,

normed by ; similarly to the previous subsection, the conditional laws induce stochastic processes in the normed dual space via actions

 g↦√T∫Td(μb−μ^bT)g, g∈Br,

and weak convergence occurs in . We note that the inverse of the generator from (4) exists as a well-defined mapping from into , see Lemma 11 in Section 4. We postpone the special case to Theorem 6 below.

###### Theorem 5.

Let and . Under the conditions of Theorem 3, if are the solutions of (6) (invariant measures) associated with a posterior draw and , respectively, then for the conditional law in we have as

 βB∗r(τ(⋅|XT),Nμ0)→Pb00,

where is the tight Borel probability measure on induced by the centred Gaussian process with covariance metric

 EM(g)M(g′)=⟨∇L−1b0[¯g],∇L−1b0[¯g′]⟩μ0,  ¯g=g−∫Tdgdμ0,  g,g′∈Br.

Moreover, as we have

This theorem has various corollaries, upon using the richness of the spaces . For instance since imbeds continuously into on the bounded domain one deduces weak convergence in -probability of the conditional laws in negative Sobolev spaces ; as ,

 βH−r(L(√T(μb−μ^bT)|XT),Nμ0)→Pb00,  r>d/2−1, d=2,3.
###### Remark 3.

Indicator functions of measurable subsets of of finite perimeter define elements of (proved, e.g., as in Lemma 8b in [23]) and we can thus make inference on invariant probabilities . More concretely, suppose is a class of Borel subsets of that have perimeter bounded by a fixed constant . This includes, in particular, all convex subsets of (e.g., Remark 5 in [23]). Then the collection of functions is bounded in , and if

 (μb(C):C∈C),  b∼ΠT(⋅|XT),

is the resulting set-indexed process of posterior invariant probabilities, we deduce from Theorem 5 and the continuous mapping theorem for weak convergence that, as ,

 βℓ∞(C)(L(√T(μb(⋅)−μ^bT(⋅))|XT),Nμ0)→Pb00, and √T(μ^bT−μb0)→dNμ0 in ℓ∞(C), (12)

where is the Banach space of bounded functions on (see Proposition 3.7.24 in [25] for a precise definition of for non-separable ). One further deduces that the estimated invariant probabilities induced by the MAP estimate obey the limit law

 √TsupC∈C|μ^bT(C)−μ0(C)|→dsupC∈C|M(1C)|<∞ a.s., T→∞.

We finally turn to the special case , where the proof of a version of Theorem 5 needs slight adaptations as then includes negative values. We obtain a central limit theorem for the invariant probability densities viewed as sequences of random functions in the space .

For the solution map from before Theorem 5 has a representation with periodic Green kernel such that for all . This follows, e.g., from deriving directly explicit expressions for the solution of the ODE , where .

###### Theorem 6.

Under the conditions of Theorem 3 with and , if are the invariant probability densities associated to , respectively, then

 βC(T)(L(√T(μb−μ^bT)|XT),¯Nb