# Sup-norm adaptive simultaneous drift estimation for ergodic diffusions

We consider the question of estimating the drift and the invariant density for a large class of scalar ergodic diffusion processes, based on continuous observations, in -norm loss. The unknown drift b is supposed to belong to a nonparametric class of smooth functions of unknown order. We suggest an adaptive approach which allows to construct drift estimators attaining minimax optimal -norm rates of convergence. In addition, we prove a Donsker theorem for the classical kernel estimator of the invariant density and establish its semiparametric efficiency. Finally, we combine both results and propose a fully data-driven bandwidth selection procedure which simultaneously yields both a rate-optimal drift estimator and an asymptotically efficient estimator of the invariant density of the diffusion. Crucial tool for our investigation are uniform exponential inequalities for empirical processes of diffusions.

• 2 publications
• 8 publications
09/27/2021

### Estimating the characteristics of stochastic damping Hamiltonian systems from continuous observations

We consider nonparametric invariant density and drift estimation for a c...
07/30/2018

### Concentration of scalar ergodic diffusions and some statistical implications

We derive uniform concentration inequalities for continuous-time analogu...
01/21/2020

### Invariant density adaptive estimation for ergodic jump diffusion processes over anisotropic classes

We consider the solution X = (Xt) t>0 of a multivariate stochastic diffe...
06/11/2018

### Adaptive Denoising of Signals with Shift-Invariant Structure

We study the problem of discrete-time signal denoising, following the li...
03/25/2022

### Sharp adaptive similarity testing with pathwise stability for ergodic diffusions

Within the nonparametric diffusion model, we develop a multiple test to ...
03/27/2018

### Adaptive nonparametric estimation for compound Poisson processes robust to the discrete-observation scheme

A compound Poisson process whose jump measure and intensity are unknown ...
07/25/2021

### Adaptive Estimation and Uniform Confidence Bands for Nonparametric IV

We introduce computationally simple, data-driven procedures for estimati...

## 1 Introduction

The field of nonparametric statistics for stochastic processes has become an integral part of statistics. Due to their practical relevance as standard models in many areas of applied science such as genetics, meteorology or financial mathematics to name very few, the statistical analysis of diffusion processes receives special attention. The first contribution of the present paper is an investigation of adaptive -norm convergence rates for a nonparametric Nadaraya–Watson-type drift estimator, based on a continuous record of observations of a diffusion process on the real line. The suggested data-driven bandwidth choice relies on Lepski’s method for adaptive estimation. Characterising upper and lower bounds, we show that the proposed estimation procedure in the asymptotic regime is minimax rate-optimal over nonparametric Hölder classes. Remarkably, we impose only very mild conditions on the drift coefficient, not going far beyond standard assumptions that ensure the existence of ergodic solutions of the underlying SDE over the real line. In particular, we allow for unbounded drift coefficients. Secondly, we prove a Donsker-type theorem for the classical kernel estimator of the invariant density in and establish its semiparametric efficiency. With regard to the direct relation between drift coefficient and the invariant density, it is clear that the corresponding estimation problems are closely connected. In a last step, we combine both tasks and suggest an adaptive bandwidth choice that simultaneously yields both an asymptotically efficient, asymptotically normal (in ) estimator of the invariant density and, at the same time, the corresponding minimax rate-optimal drift estimator.

So far, results analysing the

-norm risk in the context of diffusion processes are rather scarce, even though quantifying expected maximal errors is of immense relevance, in particular for practical applications. We therefore start in the basic set-up of continuous observations of a scalar ergodic diffusion process. While the idealised framework of continuous observations of the process may be considered as being far from the reality, it is indisputably of substantial theoretical interest because the statistical results incorporate the very nature of the diffusion process, not being influenced by any discretisation errors. Consequently, they serve as relevant benchmarks for further investigations. Moreover, our approach is attractive in the sense that it provides a reasonable starting point for extending the statistical analysis to discrete observation schemes and even multivariate diffusion processes. A second, very concrete motivation for our framework is the idea of bringing together methods from stochastic control and nonparametric statistics. Diffusion processes serve as a prototype model in stochastic optimal control problems which are solved under the long-standing assumption of continuous observations of a process driven by known dynamics. Relaxing this assumption to the framework of continuous observations of a process driven by an

unknown drift coefficient, imposing merely mild regularity assumptions, raises interesting questions on how to learn the dynamics by means of nonparametric estimation procedures and to control in an optimal way at the same time. With respect to the statistical methods, these applications typically require optimal bounds on -norm errors. The present paper provides these tools for a large class of scalar diffusion processes.

Taking a look at the evolution of the area of statistical estimation for diffusions up to the mid 2000’s, we refer to Gobet et al. (2004) for a very nice summary. The monograph Kutoyants (2004) provides a comprehensive overview on inference for one-dimensional ergodic diffusion processes on the basis of continuous observations considering pointwise and -risk measures. Banon (1978) is commonly mentioned as the first article addressing the question of nonparametric identification of diffusion processes from continuous data. In nonparametric models, asymptotically efficient estimators typically involve the optimal choice of a tuning parameter that depends on the smoothness of the nonparametric class of targets. From a practical perspective, this is not satisfying at all because the smoothness is usually not known. One thus aims at adaptive estimation procedures which are based on purely data-driven estimators adapting to the unknown smoothness.

Spokoiny (2000) and Dalalyan (2005) were the first to study adaptive drift estimation in the diffusion model based on continuous observations. Spokoiny (2000) considers pointwise estimation whereas Dalalyan (2005) investigates a weighted -norm. Hoffmann (1999) initiated adaptive estimation in a high-frequency setting, proposing a data driven estimator of the diffusion coefficient based on wavelet thresholding which is rate optimal wrt -loss, for and a compact set . With regard to low-frequency data, we refer to the seminal paper by Gobet et al. (2004). Their objective is inference on the drift and diffusion coefficient of diffusion processes with boundary reflections. The quality of the proposed estimators is measured in the distance for any . Like restricting to estimation on arbitrary but fixed compact sets, looking at processes with boundary reflections constitutes a possibility to circumvent highly technical issues that we will face in our investigation of diffusions on the entire real line. Gobet et al. postulate that allowing diffusions on the real line would require to introduce a weighting in the risk measure given by the invariant density. This phenomenon will become visible in our results, as well. The same weighting function can be found in Dalalyan (2005)

. Intuitively, it seems natural that the estimation risk would explode without a weighting since the observations of the continuous process during a finite period of time do not contain information about the behaviour outside the compact set where the paths lives in. For a more detailed heuristic account on the choice of the weight function for

-risk, we refer to Remark 4.1 in Dalalyan (2005)

. Sharp adaptive estimation of the drift vector for multidimensional diffusion processes from continuous observations for the

- and the pointwise risk has been addressed in Strauch (2015) and Strauch (2016), respectively.

As illustrated, the pointwise and -risks are already well-understood in different frameworks. The present paper complements these developments by an investigation of the -norm risk in the continuous observation scheme. In the low-frequency framework, this strong norm was studied in Söhl and Trabs (2016)

who construct both an adaptive estimator of the drift and adaptive confidence bands. They prove a functional central limit theorem for wavelet estimators in a multi-scale space, i.e., considering a weaker norm that still allows to construct adaptive confidence bands for the invariant density and the drift with optimal

-diameter. Still, there exist a lot of challenging open questions, and in view of the growing field of applications, there is a clear need for developing and adding techniques and tools for the statistical analysis of stochastic processes under -norm risk. Ideally, these tools should include the probabilistic features of the processes and, at the same time, allow for an in-depth analysis of issues such as adaptive estimation in a possibly broad class of models.

A common device for the derivation of adaptive estimation procedures in

-norm loss are uniform Talagrand-type concentration inequalities and moment bounds for empirical processes based on chaining methods. These tools are made available for a broad class of scalar ergodic diffusion processes in our recent paper

Aeckerle and Strauch (2018), in the sequel abbreviated as [AS18]. The concentration inequalities derived therein will serve as the central vehicle for our analysis, and we conjecture that they allow for generalizations on discrete observation schemes, multivariate state variables and even more general Markov processes. Therefore, the approach presented in this paper provides guidance for further statistical investigations of stochastic processes in -norm risk.

Besides the frequentist statistical research, the Bayesian approach found a lot of interest, more recently. In the framework of continuous observations,

van der Meulen et al. (2006) consider the asymptotic behaviour of posterior distributions in a general Brownian semimartingale model which, as a special case, includes the ergodic diffusion model. Pokern et al. (2013) investigate a Bayesian approach to nonparametric estimation of the periodic drift of a scalar diffusion from continuous observations and derive bounds on the rate at which the posterior contracts around the true drift in -norm. Improvements in terms of these convergence rates results and adaptivity are given in van Waaij and van Zanten (2016). Nonparametric Bayes procedures for estimating the drift of one-dimensional ergodic diffusion models from discrete-time low-frequency data are studied in van der Meulen and van Zanten (2013). The authors give conditions for posterior consistency and verify these conditions for concrete priors. Given discrete observations of a scalar reflected diffusion, Nickl and Söhl (2017) derive (and verify) conditions in the low-frequency sampling regime for prior distributions on the diffusion coefficient and the drift function that ensure minimax optimal contraction rates of the posterior distribution over Hölder–Sobolev smoothness classes in -distance, for any

### Basic framework and outline of the paper

Taking into view the -norm risk, the aim of this paper is to suggest a rate-optimal nonparametric drift estimator, based on continuous observations of an ergodic diffusion process on the real line which is given as the solution of the SDE

 dXt = b(Xt)dt+σ(Xt)dWt,X0=ξ, t>0, (1.1)

with unknown drift function , dispersion and some standard Brownian motion . The initial value

is a random variable independent of

. We restrict to the ergodic case where the Markov process admits an invariant measure, and we denote by and the invariant density and the associated invariant measure, respectively. Furthermore, we will always consider stationary solutions of (1.1), i.e., we assume .

In the set-up of continuous observations, there is no interest in estimating the volatility since this quantity is identifiable using the quadratic variation of . We thus focus on recovering the unknown drift. We develop our results in the following classical scalar diffusion model.

###### Definition 1.

Let and assume that, for some constants , , satisfies and for all . For fixed constants and , define the set as

 Σ := {b∈Liploc(R):|b(x)|≤C(1+|x|), ∀|x|>A:b(x)σ2(x)sgn(x)≤−γ}. (1.2)

Given any , there exists a unique strong solution of the SDE (1.1) with ergodic properties and invariant density

 ρ(x)=ρb(x) := 1Cb,σσ2(x) exp(∫x02b(y)σ2(y)dy),x∈R, (1.3)

with denoting the normalising constant. Throughout the sequel and for any , we will denote by the expected value with respect to the law of associated with the drift coefficient . The distribution function corresponding to and the invariant measure of the distribution will be denoted by and , respectively.

Our statistical analysis relies heavily on uniform concentration inequalities for continuous-time analogues of empirical processes of the form , , as well as stochastic integrals , , indexed by some infinite-dimensional function class . These key devices are provided in our work on concentration inequalities for scalar ergodic diffusions. They are tailor-made for the investigation of -norm risk criteria and can be considered as continuous-time substitutes for Talagrand-type concentration inequalities and moment bounds for empirical processes in the classical i.i.d. framework. In [AS18], upper bounds on the expected

-norm error for a kernel density estimator of the invariant density (that we will use in the present work) are derived as a first statistical application of the developed concentration inequalities. In Section

2, we will present the announced probabilistic tools and statistical results from [AS18] that will be of crucial importance in our subsequent developments. The advantage of the methods proposed in [AS18] is that the martingale approximation approach - which is at the heart of the derivations - yields very elementary simple proofs, working under minimal assumptions on the diffusion process.

#### The estimators

Given continuous observations of a diffusion process as described in Definition 1, first basic statistical questions concern the estimation of the invariant density and the drift coefficient and the investigation of the respective convergence properties. Since (for differentiable ), the question of drift estimation is obviously closely related to estimation of the invariant density and its derivative . For some smooth kernel function with compact support, introduce the standard kernel invariant density estimator

 ρt,K(h)(x) := 1th∫t0K(x−Xuh)du,x∈R. (1.4)

A natural estimator of the drift coefficient , which relies on the analogy between the drift estimation problem and the model of regression with random design, is given by a Nadaraya–Watson-type estimator of the form

 bt,K(h)(x) := ¯¯¯ρt,K(h)(x)ρt,K(t−1/2)(x)+√logttexp(√logt), (1.5) where ¯¯¯ρt,K(h)(x) := 1th∫t0K(x−Xsh)dXs. (1.6)

We recognize the kernel density estimator in the denominator, and we will see that with the proposed (adaptive) bandwidth choice serves as a rate-optimal estimator of . The additive term in the denominator prevents it from becoming small too fast in the tails.

Given a record of continuous observations of a scalar diffusion process with coefficients as described in Definition 1, the local time estimator , for denoting the local time process of , is available. This is a natural density estimator since diffusion local time can be interpreted as the derivative of the empirical measure. In the past, the latter was exhaustively studied for pointwise estimation and in -risk unlike the -norm case. In (Kutoyants, 1998, Sec. 7), weak convergence of the local time estimator to a Gaussian process in is shown. The same is done for more general diffusion processes in van der Vaart and van Zanten (2005). Having provided the required tools from empirical process theory, upper bounds on all moments of the -norm error of are proven in [AS18]. Unfortunately, the local time estimator is viewed as not being very feasible in practical applications. In addition, it does not offer straightforward extensions to the case of discretely observed or multivariate diffusions, in sharp contrast to the classical kernel-based density estimator. We therefore advocate the usage of the kernel density estimator introduced in (1.4) which can be viewed as a universal approach in nonparametric statistics, performing an optimal behaviour over a wide range of models. Furthermore, the kernel density estimator naturally appears in the denominator of our Nadaraya–Watson-type drift estimator defined according to (1.5).

#### Asymptotically efficient density estimation

In the present work, we will complement the sup-norm analysis started in [AS18] with an investigation of the asymptotic distribution of the kernel density estimator in a functional sense. We will prove a Donsker-type theorem for the kernel density estimator, thereby demonstrating that this estimator for an appropriate choice of bandwidth behaves asymptotically like the local time estimator. We then go one step further and establish optimality of the limiting distribution, optimality seen in the sense of the general convolution theorem 3.11.2 for the estimation of Banach space valued parameters presented in van der Vaart and Wellner (1996). Their theorem states that, for an asymptotically normal sequence of experiments and any regular estimator, the limiting distribution is the convolution of a specific Gaussian process and a noise factor. This Gaussian process is viewed as the optimal limit law, and we refer to it as the semiparametric lower bound. We establish this lower bound and verify that it is achieved by the kernel density estimator. The Donsker-type theorem and the verification of semiparametric efficiency of the kernel-based estimator are the main results on density estimation in the present paper. They are presented in Section 3

. Donsker-type theorems can be regarded as frequentist versions of functional Bernstein–von Mises theorems to some extent. In particular, our methods and techniques are interesting for both the frequentist and Bayesian community. The optimal limiting distribution in the sense of the convolution theorem is relevant in the context of Bayesian Bernstein–von Mises theorems in the following sense: If this lower bound is attained, Bayesian credible sets are optimal asymptotic frequentist confidence sets as argued in

Castillo and Nickl (2014); see also (Nickl and Söhl, 2017, p. 12) who address Bernstein–von Mises theorems in the context of compound Poisson processes. Our approach concerning the question of efficiency is based on some recent work by Nickl and Ray on a Bernstein–von Mises theorem for multidimensional diffusions. We thank Richard Nickl for the private communication that motivated the derivation of the semiparametric lower bound in this work.

#### Minimax optimal adaptive drift estimation in sup-norm

Subject of Section 4 is an adaptive scheme for the -norm rate-optimal estimation of the drift coefficient. This is the main contribution and initial motivation of the present paper. Our approach for estimating the drift coefficient is based on Lepski’s method for adaptive estimation and the exponential inequalities presented in Section 2. For proving upper bounds on the expected -norm loss, we follow closely the ideas developed in Giné and Nickl (2009) for the estimation of the density and the distribution function in the classical i.i.d. setting. We suggest a purely data-driven bandwidth choice for the estimator defined in (1.5) and derive upper bounds on the convergence rate of the expected -norm risk uniformly over Hölder balls in Theorem 13, imposing very mild conditions on the drift coefficient. To establish minimax optimality of the rate, we prove lower bounds presented in Theorem 14.

#### Simultaneous adaptive density and drift estimation

Observing from (1.3) that the invariant density is a transformation of the integrated drift coefficient, it is not surprising that we can carry over the aforementioned approach in Giné and Nickl (2009) (which aims at simultaneous estimation of the distribution function and density in the i.i.d. framework) to the problems of invariant density and drift estimation. We suggest a simultaneous bandwidth selection procedure that allows to derive a result in the spirit of their Theorem 2. Adjusting the procedure from Section 4 for choosing the bandwidth in a data-driven way, we can find a bandwidth such that is an asymptotically efficient estimator in for the invariant density and, at the same time, estimates the drift coefficient with minimax optimal rate of convergence wrt -norm risk. We formulate this result in Theorem 15.

## 2 Preliminaries

We will investigate the question of adapting to unknown Hölder smoothness. For ease of presentation, we will suppose in the sequel that . The subsequent results however can be extended to the case of a general diffusion coefficient fulfilling standard regularity and boundedness assumptions. Recall the definition of the class of drift functions in (1.2).

###### Definition 2.

Given , denote by the Hölder class (on ) as the set of all functions which are -times differentiable and for which

 ∥f(k)∥∞ ≤ L∀k=0,1,...,l, ∥f(l)(⋅+s)−f(l)(⋅)∥∞ ≤ L|s|β−l∀s∈R.

Set

 Σ(β,L)=Σ(β,L,C,A,γ) := {b∈Σ(C,A,γ,1): ρb∈HR(β+1,L)}. (2.7)

Here, denotes the greatest integer strictly smaller than .

Considering the class of drift coefficients , we use kernel functions satisfying the following assumptions,

 ∙K:R→R+ is Lipschitz continuous and symmetric;∙supp(K)⊆[−1/2,1/2];∙for some α≥β+1,K is of order ⌊α⌋. (2.8)

The subsequent deep results from [AS18] are fundamental for the investigation of the -norm risk. They rely on diffusion specific properties, in particular the existence of local time, on the one hand, and classical empirical process methods like the generic chaining device on the other hand. In the classical setting of statistical inference based on i.i.d. observations , the analysis of -norm risks typically requires investigating empirical processes of the form , indexed by a possibly infinite-dimensional class of functions which, in many cases, are assumed to be uniformly bounded. Analogously, in the current continuous, non-i.i.d. setting, our analysis raises questions about empirical processes of the form

Clearly, the finite variation part of the stochastic integral entails the need to look at unbounded function classes since we do not want to restrict to bounded drift coefficients. Answers are given in [AS18] where we provide exponential tail inequalities both for

imposing merely standard entropy conditions on . As can be seen from the construction of the estimators, we have to exploit these results in order to deal with both empirical diffusion processes induced by the kernel density estimator (see (1.4)) and with stochastic integrals like the estimator of the derivative of the invariant density (see (1.6)). One first crucial auxiliary result for proving the convergence properties of the estimation schemes proposed in Sections 4 and 5 is stated in the following

###### Proposition 3 (Concentration of the estimator ¯¯¯ρt,K(h) of ρ′b/2).

Given a continuous record of observations of a diffusion with as introduced in Definition 1 and a kernel satisfying (2.8), define the estimator according to (1.6). Then, there exist constants such that, for any , , ,

 supb∈Σ(Eb[∥¯¯¯ρt,K(h)−Eb[¯¯¯ρt,K(h)]∥p∞])1p≤ ϕt,h(p),supb∈ΣPb(supx∈R∣∣¯¯¯ρt,K(h)(x)−Eb[¯¯¯ρt,K(h)(x)]∣∣>eϕt,h(u))≤ e−u, (2.9)

for

 (2.10)
###### Proof.

We apply Theorem 18 in [AS18] to the class

 F:={K(x−⋅h):x∈Q}. (2.11)

For doing so, note that , and, for denoting the Lebesgue measure,

 ∥∥∥K(x−⋅h)∥∥∥2L2(λ) = ∫K2(x−yh)dy = h∫K2(z)dz ≤ h∥K∥2L2(λ)

and . Due to the Lipschitz continuity of , Lemma 23 in [AS18] yields constants , (only depending on

) such that, for any probability measure

on and any , . Here and throughout the sequel, given some semi-metric , , , denotes the covering number of wrt , i.e., the smallest number of balls of radius in needed to cover . Since the assumption on the covering numbers of in Theorem 18 in [AS18] is fulfilled, Theorem 18 can be applied to with and . In particular, there exist positive constants , , and such that

 supb∈Σ(Eb[∥¯¯¯ρt,K(h)−Eb[¯¯¯ρt,K(h)]∥p∞])1p ≤ ˜L{1√t{⎛⎝log⎛⎝√h+pΛth⎞⎠⎞⎠3/2+⎛⎝log⎛⎝√h+pΛth⎞⎠⎞⎠1/2+p3/2} +pth+1hexp(−˜L0t)+1√th⎛⎝log⎛⎝√h+pΛth⎞⎠⎞⎠1/2 ≤ϕt,h(p),

and (2.9) immediately follows. ∎

The uniform concentration results for stochastic integrals from [AS18] further allow to prove the following result on the -norm distance between the local time and the kernel density estimator. The exponential inequality for this distance will be the key to transferring the Donsker theorem for the local time to the kernel density estimator. It can also be interpreted as a result on the uniform approximation error of the scaled local time by its smoothed version, noting that can be seen as a convolution of a mollifier and a scaled version of diffusion local time. The next result actually parallels Theorem 1 in Giné and Nickl (2009) which states a subgaussian inequality for the distribution function in the classical i.i.d. set-up. It serves as an important tool for the analysis of the proposed adaptive scheme for simultaneous estimation of the distribution function and the associated density in Giné and Nickl (2009). The subsequent proposition plays an analogue role for the adaptive scheme for simultaneous estimation of the invariant density and the drift coefficient presented in Section 5.

###### Proposition 4 (Theorem 15 in [AS18]).

Given a diffusion with , for some , consider some kernel function fulfilling (2.8) and such that . Then, there exist positive constants and such that, for all , where

 λ0(h) + √thβ+1L2⌊β+1⌋!∫|K(v)vβ+1|dv],

and any ,

 supb∈Σ(β,L)Pb(√t∥∥ρt,K(h)−L∙t(X)t∥∥∞>λ) ≤ exp(−Λ1λ√h).

The very first step of our approach to -norm adaptive drift estimation consists in estimating the invariant density in -norm loss. Corresponding upper bounds on the -norm risk have been investigated in [AS18]. We next cite these bounds for the local time estimator and the kernel density estimator. Our estimation procedure does not involve the local time density estimator. For the sake of presenting a complete statistical -norm analysis of ergodic scalar diffusions based on continuous observations, we still include it here.

###### Lemma 5 (Moment bound on the supremum of centred diffusion local time, Corollary 16 of [AS18]).

Let be as in Definition 1. Then, there are positive constants such that, for any ,

 supb∈Σ(C,A,γ,1)(Eb[∥∥L∙t(X)t−ρb∥∥p∞])1p ≤ ζ(pt+1√t(1+√p+√logt)+te−ζ1t).

In [AS18], we have also shown the analogue fundamental result for the -norm risk of the kernel density estimator. The following upper bounds will be essential for deriving convergence rates of the Nadaraya–Watson-type drift estimator (see (1.5)).

###### Proposition 6 (Concentration of the kernel invariant density estimator, Corollary 14 of [AS18]).

Let be a diffusion with , for some , and let be a kernel function fulfilling (2.8). Given some positive bandwidth , define the estimator according to (1.4). Then, there exist positive constants such that, for any , ,

 supb∈Σ(β,L)(Eb[∥ρt,K(h)−ρb∥p∞])1p ≤ ψt,h(p), (2.12) supb∈Σ(β,L)Pb(∥ρt,K(h)−ρb∥∞≥eψt,h(u)) ≤ e−u,

for

 ψt,h(u):= ν1√t⎧⎨⎩1+ ⎷log(1√h)+√log(ut)+√u⎫⎬⎭+ν2ut+1he−ν3t+Lhβ+1⌊β+1⌋!∫|vβ+1K(v)|dv. (2.13)

Specifying to , an immediate consequence of (2.12) is the convergence rate for the risk of the kernel density estimator . Note that we obtain the parametric convergence rate for the bandwidth choice which in particular does not depend on the (typically unknown) order of smoothness of the drift coefficient. Thus, there is no extra effort needed for adaptive estimation of the invariant density. This phenomenon appears only in the scalar setting.

## 3 Donsker-type theorems and asymptotic efficiency of kernel invariant density estimators

This section is devoted to the study of weak convergence properties of the kernel density estimator . Using the exponential inequality for (Proposition 4 from Section 2), we derive a uniform CLT for the kernel invariant density estimator. In particular, the result holds for the ‘universal’ bandwidth choice . Furthermore, we use the general theory developed in van der Vaart and Wellner (1996) for establishing asymptotic semiparametric efficiency of in .

### 3.1 Donsker-type theorems

The exponential inequality for the -norm difference of the kernel and the local time density estimator stated in Proposition 4 allows to transfer an existing Donsker theorem for the local time density estimator presented in van der Vaart and van Zanten (2005).

###### Proposition 7.

Given a diffusion with , consider some kernel function fulfilling (2.8). Define the estimator according to (1.4) with bandwidth satisfying , as Then,

 √t(ρt,K(h)−ρb) Pb⟹ H, as t→∞,

in , where is a centered, Gaussian random map with covariance structure

 E[H(x)H(y)]=4m(R)ρb(x)ρb(y)∫R(1{[x,∞)}−Fb)(1{[y,∞)}−Fb)ds, (3.14)

and denoting the speed measure and the scale function of , respectively.

###### Proof.

We apply Proposition 4 to show that

 √t∥∥ρt,K(h)−L∙t(X)t∥∥∞ = oPb(1). (3.15)

There exists a constant such that fulfills the assumption , for any bandwidth satisfying the above conditions. Since , for any and sufficiently large,

 Pb(√t∥ρt,K(h)−t−1L∙t(X)∥∞>ϵ) ≤ Pb(√t∥ρt,K(h)−t−1L∙t(X)∥∞>λt) ≤ exp(−Λ1λt√h) = exp(−Λ1C((1+logt)+√thβ+12)) ⟶ 0,as t→∞.

Consequently, (3.15) holds, and Lemma 17 from Section A gives the assertion. ∎

###### Remark 8.

Donsker-type results turn out to be useful far beyond the question of the behaviour of the density estimator wrt the

-norm as a specific loss function. In particular, they provide immediate access to solutions of statistical problems concerned with functionals of the invariant density

. Clearly, this includes the estimation of bounded, linear functionals of such as integral functionals, to name just one common class. As an instance, Kutoyants and Yoshida (2007) study the estimation of moments for known functions . The target is estimated by the empirical moment estimator , and it is shown that this estimator is asymptotically efficient in the sense of local asymptotic minimaxity (LAM) for polynomial loss functions. Parallel results can directly be deduced from the Donsker theorem. Defining the linear functional , , the target can be written as , and the empirical moment estimator equals the linear functional applied to the local time estimator, that is,

 1t∫t0G(Xs)ds = ΦG(t−1L∙t(X)).

Thus, if is bounded, it follows from the results of van der Vaart and van Zanten (2005) (see Lemma 17 in Section A) and from Proposition 7, respectively, that

 √t(ΦG(t−1L∙t(X))−ΦG(ρb)) as well as √t(ΦG(ρt,K(t−1/2))−ΦG(ρb))

are asymptotically normal with the limiting distribution . Optimality of in the sense of the convolution theorem 3.11.2 in van der Vaart and Wellner (1996) will be shown in the next section.

Not only linear, but also nonlinear functions that allow for suitable linearisations can be analysed, once the required CLTs and optimal rates of convergence are given. This is related to the so-called plug-in property introduced in Bickel and Ritov (2003). The suggested connection is explained a bit more detailed in Giné and Nickl (2009).

### 3.2 Semiparametric lower bounds for estimation of the invariant density

We now want to analyse semiparametric optimality aspects of the limiting distribution in Proposition 7 as treated in Chapter 3.11 in van der Vaart and Wellner (1996) or Chapter 25 of van der Vaart (1998). To this end, we first look at lower bounds.

Denote by the law of a diffusion process with perturbed drift coefficient , given as a solution of the SDE

 dYs = (b(Ys)+h(Ys)√t)ds+dWs,Y0=X0,

and denote by the associated invariant density. Set , and define the set of experiments

 {C(0,t),B(C(0,t)),Pt,h:h∈G},t>0, (3.16)

with viewed as a linear subspace of . By construction and Girsanov’s Theorem (cf. (Liptser and Shiryaev, 2001, Theorem 7.18)), the log-likelihood is given as

 log(dPt,hdPb)(Xt) = 1√t∫t0h(Xs)dWs−12t∫t0h2(Xs)ds = Δt,h−12∥h∥2L2(μb)+oPb(1),

where

. Here, the last line follows from the law of large numbers for ergodic diffusions, and the CLT immediately gives

. Thus, (3.16) is an asymptotically normal model. Lemma 18 from Section A now implies that the sequence , is regular (or differentiable). In fact, it holds

 √T(Ψ(Pt,h)−Ψ(Pb)) ⟶t→∞ A′h in ℓ∞(R), for any h∈G, (3.17)

for the continuous, linear operator

 A′:(G,L2(μb))→(ℓ∞(R),∥⋅∥∞), h↦2ρb(H−μb(H)),

with . We want to determine the optimal limiting distribution for estimating the invariant density in in the sense of the convolution theorem 3.11.2 in van der Vaart and Wellner (1996). Since the distribution of a Gaussian process in is determined by the covariance structure , , we need to find the Riesz-representer for pointwise evaluations , where , , for any . Stated differently, we need to find the Cramér–Rao lower bound for pointwise estimation of , . Speaking about these one-dimensional targets in such as point evaluations or linear functionals of the invariant density, we refer to semiparametric Cramér–Rao lower bounds

as the variance of the optimal limiting distribution from the convolution theorem. This last quantity is a lower bound for the variance of any limiting distribution of a regular estimator.

Our first step towards this goal is to look at integral functionals which we will use to approximate the pointwise evaluations. For any continuous, linear functional , we can infer from (3.17) that

 √t(b∗(Ψ(Pt,h))−b∗(Ψ(Pb))) ⟶t→∞ b∗(A′h)in R, for all h∈G.

Considering , , for a function , and letting , this becomes

 √t(Φg(Pt,h)−Φg(Pb)) ⟶t→∞ ∫g(x)(A′h)(x)dx.

The limit defines a continuous, linear map with representation

 κ(h) = ∫g(x)(A′h)(x)dx = ∫2g(x)(H(x)−μb(H))ρb(x)dx = 2⟨gc,Hc⟩μb = 2⟨LbL−1bgc,Hc⟩μb = −⟨∂L−1bgc,h⟩μb. (3.18)

Here and throughout the sequel, denotes the generator of the diffusion process with drift coefficient , i.e., , for any , and denotes the centered version of , for any function . Note that due to the following lemma whose proof is deferred to Section A.

###### Lemma 9.

Let , and set

Then, is contained in the image of the generator , and

 L−1b(gc) = T(z) := ∫z0∫2g(x)ρb(x)h(u,x)dxdu.

In particular,

 ∬g(x)H(x,y)g(y)dydx = ∥∂L−1b(gc)∥2L2(μb),

where , for the Gaussian process fulfilling (3.14).

We conclude by means of Theorem 3.11.2 in van der Vaart and Wellner (1996) that the Cramér–Rao lower bound for estimation of is given by . Using an approximation procedure, it then can be shown that the Cramér–Rao lower bound for pointwise estimation of is defined via

 CR(y):=∥2ρb(y)h(⋅,y)∥2L2(μb),for any y∈R. (3.19)

For details, see Proposition 19 in Section A. The same arguments apply to estimation of linear combinations , and the corresponding Cramér–Rao bound reads . It follows that the covariance of the optimal Gaussian process in the convolution theorem is given as

 CR(x,y):=4ρb(x)ρb(y)∫h(z,x)h(z,y)ρb(z)dz,x,y∈R. (3.20)

### 3.3 Semiparametric efficiency of the kernel density estimator

Having characterised the optimal limit distribution in the previous section, it is natural to ask in a next step for an efficient estimator of linear functionals of the invariant density such as pointwise estimation, functionals of the form or, even more, for estimation of in .

###### Definition 10.

An estimator of the invariant density is called asymptotically efficient in if the estimator is regular, i.e.,

 √t(ˆρt−Ψ(Pt,h)) Pt,h⟹ L, as % t→∞, for any h∈G,

for a fixed, tight Borel probability measure in , and if is the law of a centered Gaussian process with covariance structure as specified in (3.20), i.e, achieves the optimal limiting distribution.

Given an asymptotically efficient estimator in and any bounded, linear functional , efficiency of the estimator for estimation of then immediately follows. Our next result shows that estimation via for the universal bandwidth choice is suitable for the job. Its proof is given in Section A.

###### Theorem 11.

The invariant density estimator defined according to (1.4) is an asymptotically efficient estimator in .

###### Remark 12.
1. From the proof of Theorem 11, it can be inferred that the local time estimator is an asymptotically efficient estimator, as well.

2. In terms of earlier research on efficient estimation of the density as a function in , we shall mention Kutoyants (1998) and Negri (2001). The works deal with the efficiency of the local time estimator in the LAM sense for certain classes of loss functions. Subject of (Kutoyants, 1998, Section 8) are risks for some finite measure on of the form , for estimators of , whereas Negri complements this work for risks of the form for a class of bounded, positive functions . Of course, the distribution appearing in the lower bound corresponds to the optimal distribution in the sense of the convolution theorem that we derived. The derivation in Kutoyants (1998) of the lower bound is based on the van Trees inequality as established in Gill and Levit (1995) as an alternative to the classical approach relying on Hájek–Le Cam theory. On the other hand, Negri’s method follows Millar (1983) and makes use of the idea of convergence of experiments originally provided by Le Cam. The optimal distribution in the sense of a convolution theorem is not shown neither is any asymptotic efficiency result in for the kernel density estimator.

## 4 Minimax optimal adaptive drift estimation wrt sup-norm risk

We now turn to the original question of estimating the drift coefficient in a completely data-driven way. The aim of this section is to suggest a scheme for rate-optimal choice of the bandwidth , based on a continuous record of observations of a diffusion as introduced in Definition 1, optimality considered in terms of the -norm risk. Since we stick to the continuous framework, our previous concentration results are directly applicable, allowing, e.g., for the straightforward derivation of upper bounds on the variance of the estimator of the order

 ¯¯¯σ2(h,t):=(log(t/h))3t+log(t/h)th. (4.21)

Standard arguments provide for any bounds on the associated bias of order . In case of known smoothness , one can then easily derive the optimal bandwidth choice by balancing the components of the bias-variance decomposition

 supb∈Σ(β,L)Eb[∥∥¯¯¯ρt,K(h)−ρ′b2∥∥∞]≤B(h)+K¯¯¯σ(h,t),

resulting in . In order to remove the (typically unknown) order of smoothness from the bandwidth choice, we need to find a data-driven substitute for the upper bound on the bias in the balancing process. Heuristically, this is the idea behind the Lepski-type selection procedure suggested in (4.24) below.

#### 1.)

Specify the discrete grid of candidate bandwidths

 H≡Ht := {hk=η−k: k∈N, η−k>(logt)2t},η>1 % arbitrary, (4.22)

and define , and

 ˜