# Schwartz type model selection for ergodic stochastic differential equation models

We study the construction of the theoretical foundation of model comparison for ergodic stochastic differential equation (SDE) models and an extension of the applicable scope of the conventional Bayesian information criterion. Different from previous studies, we suppose that the candidate models are possibly misspecified models, and we consider both Wiener and a pure-jump Lévy noise driven SDE. Based on the asymptotic behavior of the marginal quasi-log likelihood, the Schwarz type statistics and stepwise model selection procedure are proposed. We also prove the model selection consistency of the proposed statistics with respect to an optimal model. We conduct some numerical experiments and they support our theoretical findings.

## Authors

• 3 publications
• 5 publications
01/31/2018

### Data driven time scale in Gaussian quasi-likelihood inference

We study parametric estimation of ergodic diffusions observed at high fr...
04/22/2015

### Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal Likelihood

Factorized information criterion (FIC) is a recently developed approxima...
05/03/2022

### Complementary Goodness of Fit Procedure for Crash Frequency Models

This paper presents a new procedure for evaluating the goodness of fit o...
09/11/2020

### Bootstrap method for misspecified ergodic Lévy driven stochastic differential equation models

In this paper, we consider possibly misspecified stochastic differential...
10/28/2019

### Penalized quasi likelihood estimation for variable selection

Penalized methods are applied to quasi likelihood analysis for stochasti...
02/06/2016

### A Tractable Fully Bayesian Method for the Stochastic Block Model

The stochastic block model (SBM) is a generative model revealing macrosc...
10/25/2019

### Unified model selection approach based on minimum description length principle in Granger causality analysis

Granger causality analysis (GCA) provides a powerful tool for uncovering...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

We suppose that the data-generating process is defined on the stochastic basis and it is the solution of the one-dimensional stochastic differential equation written as:

 (1.1) dXt=A(Xt)dt+C(Xt−)dZt,

where:

• The coefficients and are Lipschitz continuous.

• The driving noise is a standard Wiener process or pure-jump Lévy process satisfying that for any ,

 (1.2) E[Z1]=0,E[Z21]=1,E[|Z1|q]<∞.
• The initial variable is independent of , and

 Ft=σ(X0)∨σ(Zs|s≤t)

As the observations from , we consider the discrete but high-frequency samples with

 tnj:=jhn,Tn:=nhn→∞,nh2n→0.

For , candidate models are supposed to be given. Here, for each and , the candidate model is expressed as:

 dXt=am2(Xt,αm2)dt+cm1(Xt−,γm1)dZt,

and the functional form of is known except for the and -dimensional unknown parameters and being elements of the bounded convex domains and . The main objective of this paper is to give a model selection procedure for extracting an “optimal” model among which reflects the feature of well.

For selecting an appropriate model from the data in hand quantitively, information criteria are one of the most convenient and powerful tools, and have widely been used in many fields. Their origin dates back to Akaike information criterion (AIC) introduced in [1] which puts an importance on prediction, and after that, various kinds of criteria have been produced up to the present, for their comprehensive overview, see [3], [4], and [11]. Among them, this paper especially sheds light on Bayesian information criterion (BIC) introduced by [13]. It is based on an approximation up to -term of log-marginal likelihood, and its original form is as follows:

 (1.3) BICn=−2ln(^θMLEn)+plogn,

where , , and

stand for the log-likelihood function, maximum likelihood estimator, and dimension of the parameter including the subject model. However, since the closed form of the transition density of

is unavailable in general, to conduct some feasible statistical analysis, we cannot rely on its genuine likelihood; this implies that the conventional likelihood based (Bayesian) information criteria are unpractical in our setting. Such a problem often occurs when discrete time observations are obtained from a continuous time process, and to avoid it, the replacement of a genuine likelihood by some quasi-likelihood is effective not only for estimating parameters included in a subject model but also for constructing (quasi-)information criteria, for instance, see [14], [9], [6] (ergodic diffusion model), [17] (stochastic regression model), and [7] (CARMA process). Especially, [6] used the Gaussian quasi-likelihood in place of the genuine likelihood, and derived quasi-Bayesian information criterion (QBIC) under the conditions: the driving noise is a standard Wiener process, and for each candidate model, there exist and satisfying and , respectively. Moreover, by using the difference of the small time activity of the drift and diffusion terms, the paper also gave two-step QBIC which selects each term separately, and reduces the computational load. In the paper, the model selection consistency of the QBIC is shown for nested case.

However, regarding the estimation of the parameters and , the Gaussian quasi maximum likelihood estimator (GQMLE) works well for a much broader situation: the driving noise is a standard Wiener process or pure-jump Lévy process with (1.2), and either or both of the drift and scale coefficients are misspecified. For the technical account of the GQMLE for ergodic SDE models, see [19], [10], [15], [16], [12], and [18]. These results naturally provides us an insight that the aforementioned QBIC is also theoretically valid for the broader situation, and has the model selection consistency even if a non-nested model is contained in candidate models. In this paper, we will show that the insight is true. More specifically, we will give the QBIC building on the stochastic expansion of the log-marginal Gaussian quasi-likelihood. Although the convergence rate of the GQMLE differs in the Lévy driven or misspecified case, the form is the same as the correctly specified diffusion case. We will also show the model selection consistency of the QBIC.

The rest of this paper is as follows: Section 2 provides the notations and assumptions used throughout this paper. In Section 3, the main result of this paper is given. Section 4 exhibits some numerical experiments. The technical proofs of the main results are summarized in Appendix.

## 2. Notations and Assumptions

For notational convenience, we previously introduce some symbols used in the rest of this paper.

• is referred to as a differential operator with respect to any variable x.

• implies that there exists a positive constant being independent of satisfying for all large enough .

• For any set , denotes its closure.

• We write and for any stochastic process .

In the next section, we will first give the stochastic expansion of the log-marginal Gaussian quasi-likelihood for the following model:

 (2.1) dXt=a(Xt,α)dt+c(Xt−,γ)dZt.

Below, we table the assumptions for our main result.

###### Assumption 2.1.

is a standard Wiener process, or a pure-jump Lévy process satisfying that: , , and for all .

###### Assumption 2.2.
1. The coefficients and are Lipschitz continuous and twice differentiable, and their first and second derivatives are of at most polynomial growth.

2. The drift coefficient and scale coefficient are Lipschitz continuous, and for every .

3. For each and , the following conditions hold:

• The coefficients and admit extension in and have the partial derivatives possessing extension in .

• There exists nonnegative constant satisfying

 (2.2) sup(x,α,γ)∈R×¯Θα×¯Θγ11+|x|C(i,k){|∂ix∂kαa(x,α)|+|∂ix∂kγc(x,γ)|+|c−1(x,γ)|}<∞.
###### Assumption 2.3.
1. There exists a probability measure

such that for every , we can find constants and for which

 (2.3) supt∈R+exp(at)||Pt(x,⋅)−π0(⋅)||hq≤Cqhq(x),

for any where .

2. For any , we have

 (2.4) supt∈R+E[|Xt|q]<∞.

Let and be the prior densities for and , respectively.

###### Assumption 2.4.

The prior densities and are continuous, and fullfil that

 supγ∈Θγπ1(γ)∨supα∈Θαπ2(α)<∞.

We define an optimal value of in the following manner:

 γ⋆∈argmaxγ∈¯ΘγG1(γ),α⋆∈argmaxα∈¯ΘαG2(α),

for -valued functions (resp. ) on (resp. ) defined by

 (2.5) G1(γ):=−∫R(logc2(x,γ)+C2(x)c2(x,γ))π0(dx), (2.6) G2(α):=−∫Rc(x,γ⋆)−2(A(x)−a(x,α))2π0(dx),

Recall that is supposed to be a bounded convex domain. Then, we assume that:

###### Assumption 2.5.
• is unique and is in .

• There exist positive constants and such that for all ,

 (2.7) G1(γ)−G1(γ⋆)≤−χγ|γ−γ⋆|2, (2.8) G2(α)−G2(α⋆)≤−χα|α−α⋆|2.

Define the -matrix and -matrix by:

 Iγ=4∫R(∂γc(x,γ⋆))⊗2c4(x,γ⋆)C2(x)π0(dx)−2∫R∂⊗2γc(x,γ⋆)c(x,γ⋆)−(∂γc(x,γ⋆))⊗2c4(x,γ⋆)(C2(x)−c2(x,γ⋆))π0(dx), Iα=2∫R(∂αa(x,α⋆))⊗2c2(x,γ⋆)π0(dx)−2∫R∂⊗2αa(x,α⋆)c2(x,γ⋆)(A(x)−a(x,α⋆))π0(dx).
###### Assumption 2.6.

and are positive definite.

## 3. Main results

In this paper, we consider the stepwise Gaussian quasi-likelihood functions and on and , respectively. They are defined by the following manner:

 (3.1) G1,n(γ)=−1hnn∑j=1{hnlogc2j−1(γ)+(ΔjX)2c2j−1(γ)} (3.2) G2,n(α)=−n∑j=1(ΔjX−hnaj−1(α))2hnc2j−1(^γn)

For such functions, we consider the maximum likelihood-type estimators and , that is,

 ^γn∈argmaxγ∈¯ΘγG1,n(γ), ^αn∈argmaxα∈¯ΘαG2,n(α).

Building on the stepwise Gaussian quasi-likelihood and , the next theorem gives the stochastic expansion of the log-marginal quasi-likelihood:

###### Theorem 3.1.
 log(∫Θγexp(G1,n(γ))π1(γ)dγ)=G1,n(^γn)−12pγlogn+logπ1(γ⋆)+pγ2log2π−12logdetIγ+op(1), log(∫Θαexp(G2,n(α))π2(α)dα)=G2,n(^αn)−12pαlogTn+logπ2(α⋆)+pα2log2π−12logdetIα+op(1).

By ignoring the terms in each expansion, we define two-step quasi-Bayesian information criteria (QBIC) by

 QBIC1,n=G1,n(^γn)−12pγlogn, QBIC2,n=G2,n(^αn)−12pαlog(Tn).

Next, we consider model selection consistency of the proposed information criteria. Suppose that candidates for drift and scale coefficients are given as

 (3.3) c1(x,γ1),…,cM1(x,γM1), (3.4) a1(x,α1),…,aM2(x,αM2),

where for any and for any . Then, each candidate model is given by

 dXt=am2(Xt,αm2)dt+cm1(Xt−,γm1)dZt.

In each candidate model , the functions (3.1) and (2.5) are denoted by and , respectively. The functions and correspond to (3.2) and (2.6) with . Using the QBIC, we propose the stepwise model selection as follows.

• We select the best scale coefficient among (3.3), where satisfies with

 QBIC(m1)1,n=G(m1)1,n(^γm1,n)−pγm1logn,

.

• Under and , we select the best drift coefficient with index such that , where

 QBIC(m2)2,n=G(m2|^m1,n)2,n(^αm2,n)−pαm2log(Tn),

.

Through this procedure, we can obtain the model as the final best model among the candidates described by (3.3) and (3.4).

The optimal value of is defined in a similar manner as the previous section. We assume that the model indexes and are uniquely given as follows:

 {m⋆1} =\operatornamewithlimitsargminm1∈M1dim(Θγm1), {m⋆2} =\operatornamewithlimitsargminm2∈M2dim(Θαm2),

where and . Then, we say that is the optimal model. That is, the optimal model consists of the elements of optimal model sets and which have the smallest dimension. The following theorem means that the proposed criteria and model selection method have the model selection consistency.

###### Theorem 3.2.

Suppose that Assumptions 2.1-2.6 hold for the all candidate models and that these exists a is the optimal model. Let and . Then the model selection consistency of the proposed QBIC hold in the following senses.

 limn→∞P(QBIC(m⋆1)1,n−QBIC(m1)1,n>0)=1, limn→∞P(QBIC(m⋆2|^m1,n)2,n−QBIC(m2|^m1,n)2,n>0)=1.

## 4. Numerical experiments

In this section, we present simulation results to observe finite-sample performance of the proposed QBIC. We use the R package yuima(see [2]) for generating data. In the examples below, all the Monte Carlo trials are based on 1000 independent sample paths, and the simulations are done for , and . We simulate the model selection frequencies by using proposed QBIC and compute the model weight ([3, Section 6.4.5]) defined by

 (4.1) wm1,m2 =exp{−12(QBIC(m1)1,n−QBIC(^m1,n)1,n)}M1∑k=1exp{−12(QBIC(k)1,n−QBIC(^m1,n)1,n)} (4.2) ×exp{−12(QBIC(m2|m1)2,n−QBIC(^m2,n|m1)n)}M2∑ℓ=1exp{−12(QBIC(ℓ|m1)2,n−QBIC(^m2,n|m1)2,n)}×100.

The model weight can be used to empirically quantify relative frequency(percentage) of the model selection from a single data set. The model which has the highest value is the most probable model. Because of the (4.2), satisfies the equation .

Suppose that we have a sample with from the true model

 dXt=−12Xtdt+dwt,t∈[0,Tn],X0=0,

where , and is a one-dimensional standard Wiener process. We consider the following diffusion(Diff) and drift(Drif) coefficients:

 Diff1:c1(x,γ1)=exp{γ1,1+γ1,2x+x21+x2};Diff2:c2(x,γ2)=exp{γ2,1+x+γ2,3x21+x2}; Diff3:c3(x,γ3)=exp{1+γ3,2x+γ3,3x21+x2};Diff4:c4(x,γ4)=exp{1+γ4,2x1+x2}; Diff5:c5(x,γ5)=exp{1+γ5,3x21+x2};Diff6:c6(x,γ6)=exp{γ6,2x+x21+x2}; Diff7:c7(x,γ7)=exp{x+γ6,3x21+x2},

and

 Drif1:a1(x,α1)=−α1(x−1);Drif2:a2(x,α2)=−α2x−1;Drif3:a3(x,α3)=−α3.

Each candidate model is given by a combination of diffusion and drift coefficients; for example, in the case of Diff 1 and Drif 1, we consider the statistical model

 stdXt=−α1(Xt−1)dt+exp{γ1+γ2Xt+X2t1+X2t}dwt.

In this example, although the candidate models do not include the true model, the optimal parameter and optimal model indexes and can be obtained by the functions

 G(m1)1(γm1) =−∫R{logcm1(x,γm1)2+1cm1(x,γm1)2}π0(dx), G(m2|m⋆1)2(αm2) =−∫Rcm⋆1(x,γ⋆m⋆1)−2{−x2−am2(x,αm2)}2π0(dx),

where . The definition of the optimal model, Tables 1, and 2 provide that the optimal model consists of Diff 1 and Drif 1.

Tables 3 summarizes the comparison results of the model selection frequencies and the mean of . The indicator of the optimal model defined by Diff 1 and Drif 1 is given by . For all cases, the optimal model is selected with high frequency, and the value of is the highest. Also observed is that the frequencies that the optimal model is selected and the value of become higher as increases.

## 5. Appendix

Proof of Theorem 3.1 In the following, we consider the zero-extended version of and just for the simplicity of the following discussion. Applying the change of variable, we have

 log(∫Θγexp(G1,n(γ))π1(γ)dγ) =G1,n(^γn)−pγ2logn+log(∫Rpexp{G1,n(^γn+t√n)−G1,n(^γn)}π1(^γn+t√n)dt).

Below we show that

 log(∫Rpexp{G1,n(^γn+t√n)−G1,n(^γn)}π1(^γn+t√n)dt)=logπ1(γ⋆)+pγ2log2π−12logdetIγ+op(1).

For a fixed positive constant , we divide into

 D1,n:={t∈Rpγ:|t|<δn12}, D2,n:={t∈Rpγ:|t|≥δn12}.

First we look at the integration on . Taylor’s expansion around gives

Here, for any , the second term of the right-hand-side is bounded by

It follows from [8, THEOREM 2 (d)] that for any subsequence , we can pick a subsubsequence fulfilling that for any

 Tϵn(^γnkj−γ⋆)a.s.−−→0, 1nkj∂2γG1,nkj(^γnkj)a.s.−−→−Iγ, 1nkj∫10∂3γG1,nkj(^γnkj+t√nkju)dua.s.−−→∃~G<∞, supt∈R∣∣ ∣∣π1(^γnkj+t√nkj)−π1(γ⋆)∣∣ ∣∣a.s.−−→0, supγ∈Θγ∣∣∣1nkjGnkj(γ)−G(γ)∣∣∣a.s.−−→0, ∣∣ ∣∣G1(^γnkj+t√nkj)−G1(γ⋆+t√nkj)∣∣ ∣∣+∣∣∣G1(^γnkj)−G1(γ⋆)∣∣∣a.s.−−→0.

We hereafter write the set on which these convergence holds. For simplicity, we write

 Rnkj=∣∣∣12nkj∂2γG1,nkj(^γnkj)+12Iγ∣∣∣+δ∣∣∣16nkj∫10∂3γG1,nkj(^γnkj+u(γ⋆−^γnkj))du∣∣∣.

For any and , we can pick and small enough satisfying that for all and , . For any set , we define the indicator function by:

 1A(t)={1,t∈A,0,otherwise.

Then, for all and , we can choose such that for all ,

 ∣∣ ∣∣exp{G1,nkj<