1. Introduction
We suppose that the datagenerating process is defined on the stochastic basis and it is the solution of the onedimensional stochastic differential equation written as:
(1.1) 
where:

The coefficients and are Lipschitz continuous.

The driving noise is a standard Wiener process or purejump Lévy process satisfying that for any ,
(1.2) 
The initial variable is independent of , and
As the observations from , we consider the discrete but highfrequency samples with
For , candidate models are supposed to be given. Here, for each and , the candidate model is expressed as:
and the functional form of is known except for the and dimensional unknown parameters and being elements of the bounded convex domains and . The main objective of this paper is to give a model selection procedure for extracting an “optimal” model among which reflects the feature of well.
For selecting an appropriate model from the data in hand quantitively, information criteria are one of the most convenient and powerful tools, and have widely been used in many fields. Their origin dates back to Akaike information criterion (AIC) introduced in [1] which puts an importance on prediction, and after that, various kinds of criteria have been produced up to the present, for their comprehensive overview, see [3], [4], and [11]. Among them, this paper especially sheds light on Bayesian information criterion (BIC) introduced by [13]. It is based on an approximation up to term of logmarginal likelihood, and its original form is as follows:
(1.3) 
where , , and
stand for the loglikelihood function, maximum likelihood estimator, and dimension of the parameter including the subject model. However, since the closed form of the transition density of
is unavailable in general, to conduct some feasible statistical analysis, we cannot rely on its genuine likelihood; this implies that the conventional likelihood based (Bayesian) information criteria are unpractical in our setting. Such a problem often occurs when discrete time observations are obtained from a continuous time process, and to avoid it, the replacement of a genuine likelihood by some quasilikelihood is effective not only for estimating parameters included in a subject model but also for constructing (quasi)information criteria, for instance, see [14], [9], [6] (ergodic diffusion model), [17] (stochastic regression model), and [7] (CARMA process). Especially, [6] used the Gaussian quasilikelihood in place of the genuine likelihood, and derived quasiBayesian information criterion (QBIC) under the conditions: the driving noise is a standard Wiener process, and for each candidate model, there exist and satisfying and , respectively. Moreover, by using the difference of the small time activity of the drift and diffusion terms, the paper also gave twostep QBIC which selects each term separately, and reduces the computational load. In the paper, the model selection consistency of the QBIC is shown for nested case.However, regarding the estimation of the parameters and , the Gaussian quasi maximum likelihood estimator (GQMLE) works well for a much broader situation: the driving noise is a standard Wiener process or purejump Lévy process with (1.2), and either or both of the drift and scale coefficients are misspecified. For the technical account of the GQMLE for ergodic SDE models, see [19], [10], [15], [16], [12], and [18]. These results naturally provides us an insight that the aforementioned QBIC is also theoretically valid for the broader situation, and has the model selection consistency even if a nonnested model is contained in candidate models. In this paper, we will show that the insight is true. More specifically, we will give the QBIC building on the stochastic expansion of the logmarginal Gaussian quasilikelihood. Although the convergence rate of the GQMLE differs in the Lévy driven or misspecified case, the form is the same as the correctly specified diffusion case. We will also show the model selection consistency of the QBIC.
2. Notations and Assumptions
For notational convenience, we previously introduce some symbols used in the rest of this paper.

is referred to as a differential operator with respect to any variable x.

implies that there exists a positive constant being independent of satisfying for all large enough .

For any set , denotes its closure.

We write and for any stochastic process .
In the next section, we will first give the stochastic expansion of the logmarginal Gaussian quasilikelihood for the following model:
(2.1) 
Below, we table the assumptions for our main result.
Assumption 2.1.
is a standard Wiener process, or a purejump Lévy process satisfying that: , , and for all .
Assumption 2.2.

The coefficients and are Lipschitz continuous and twice differentiable, and their first and second derivatives are of at most polynomial growth.

The drift coefficient and scale coefficient are Lipschitz continuous, and for every .

For each and , the following conditions hold:

The coefficients and admit extension in and have the partial derivatives possessing extension in .

There exists nonnegative constant satisfying
(2.2)

Assumption 2.3.

There exists a probability measure
such that for every , we can find constants and for which(2.3) for any where .

For any , we have
(2.4)
Let and be the prior densities for and , respectively.
Assumption 2.4.
The prior densities and are continuous, and fullfil that
We define an optimal value of in the following manner:
for valued functions (resp. ) on (resp. ) defined by
(2.5)  
(2.6) 
Recall that is supposed to be a bounded convex domain. Then, we assume that:
Assumption 2.5.

is unique and is in .

There exist positive constants and such that for all ,
(2.7) (2.8)
Define the matrix and matrix by:
Assumption 2.6.
and are positive definite.
3. Main results
In this paper, we consider the stepwise Gaussian quasilikelihood functions and on and , respectively. They are defined by the following manner:
(3.1)  
(3.2) 
For such functions, we consider the maximum likelihoodtype estimators and , that is,
Building on the stepwise Gaussian quasilikelihood and , the next theorem gives the stochastic expansion of the logmarginal quasilikelihood:
Theorem 3.1.
By ignoring the terms in each expansion, we define twostep quasiBayesian information criteria (QBIC) by
Next, we consider model selection consistency of the proposed information criteria. Suppose that candidates for drift and scale coefficients are given as
(3.3)  
(3.4) 
where for any and for any . Then, each candidate model is given by
In each candidate model , the functions (3.1) and (2.5) are denoted by and , respectively. The functions and correspond to (3.2) and (2.6) with . Using the QBIC, we propose the stepwise model selection as follows.

Under and , we select the best drift coefficient with index such that , where
.
Through this procedure, we can obtain the model as the final best model among the candidates described by (3.3) and (3.4).
The optimal value of is defined in a similar manner as the previous section. We assume that the model indexes and are uniquely given as follows:
where and . Then, we say that is the optimal model. That is, the optimal model consists of the elements of optimal model sets and which have the smallest dimension. The following theorem means that the proposed criteria and model selection method have the model selection consistency.
4. Numerical experiments
In this section, we present simulation results to observe finitesample performance of the proposed QBIC. We use the R package yuima(see [2]) for generating data. In the examples below, all the Monte Carlo trials are based on 1000 independent sample paths, and the simulations are done for , and . We simulate the model selection frequencies by using proposed QBIC and compute the model weight ([3, Section 6.4.5]) defined by
(4.1)  
(4.2) 
The model weight can be used to empirically quantify relative frequency(percentage) of the model selection from a single data set. The model which has the highest value is the most probable model. Because of the (4.2), satisfies the equation .
Suppose that we have a sample with from the true model
where , and is a onedimensional standard Wiener process. We consider the following diffusion(Diff) and drift(Drif) coefficients:
and
Each candidate model is given by a combination of diffusion and drift coefficients; for example, in the case of Diff 1 and Drif 1, we consider the statistical model
In this example, although the candidate models do not include the true model, the optimal parameter and optimal model indexes and can be obtained by the functions
where . The definition of the optimal model, Tables 1, and 2 provide that the optimal model consists of Diff 1 and Drif 1.
Diff 1  Diff 2  Diff 3  Diff 4  Diff 5  Diff 6  Diff 7  
1.2089  1.2822  1.4833  1.6225  1.4833  1.2602  3.2860 
Drif 1  Drif 2  Drif 3  
0.0624  0.8193  0.0979 
Tables 3 summarizes the comparison results of the model selection frequencies and the mean of . The indicator of the optimal model defined by Diff 1 and Drif 1 is given by . For all cases, the optimal model is selected with high frequency, and the value of is the highest. Also observed is that the frequencies that the optimal model is selected and the value of become higher as increases.
Diff  Diff 2  Diff 3  Diff 4  Diff 5  Diff 6  Diff 7  

10  0.01  Drif  frequency  409  72  5  1  5  95  70 
weight  30.27  7.26  0.41  0.04  0.41  7.57  5.38  
Drif 2  frequency  60  84  2  0  0  31  22  
weight  5.94  6.67  0.13  0.01  0.02  2.64  1.98  
Drif 3  frequency  125  5  0  0  0  3  11  
weight  22.91  2.50  0.15  0.04  0.10  3.06  2.51  
10  0.005  Drif  frequency  449  86  6  0  4  73  45 
weight  33.19  8.07  0.53  0.02  0.30  5.61  3.51  
Drif 2  frequency  64  96  3  0  0  26  8  
weight  6.61  7.65  0.19  0.00  0.01  1.95  0.89  
Drif 3  frequency  129  4  1  0  0  2  4  
weight  24.63  2.94  0.26  0.02  0.07  2.07  1.48  
50  0.01  Drif  frequency  832  58  2  0  1  1  12 
weight  62.59  5.19  0.19  0.00  0.10  0.08  0.99  
Drif 2  frequency  2  13  0  0  0  0  0  
weight  0.29  1.12  0.00  0.00  0.00  0.00  0.04  
Drif 3  frequency  79  0  0  0  0  0  0  
weight  28.43  0.74  0.01  0.00  0.00  0.01  0.21  
50  0.005  Drif  frequency  841  59  3  0  2  0  7 
weight  62.80  5.30  0.30  0.00  0.19  0.00  0.59  
Drif 2  frequency  3  13  0  0  0  0  0  
weight  0.31  1.15  0.00  0.00  0.00  0.00  0.00  
Drif 3  frequency  72  0  0  0  0  0  0  
weight  28.46  0.76  0.01  0.00  0.00  0.00  0.12 
5. Appendix
Proof of Theorem 3.1 In the following, we consider the zeroextended version of and just for the simplicity of the following discussion. Applying the change of variable, we have
Below we show that
For a fixed positive constant , we divide into
First we look at the integration on . Taylor’s expansion around gives
Here, for any , the second term of the righthandside is bounded by
It follows from [8, THEOREM 2 (d)] that for any subsequence , we can pick a subsubsequence fulfilling that for any
We hereafter write the set on which these convergence holds. For simplicity, we write
For any and , we can pick and small enough satisfying that for all and , . For any set , we define the indicator function by:
Then, for all and , we can choose such that for all ,
Comments
There are no comments yet.