State-space models (also known as continuous-state hidden Markov models) are a powerful and versatile tool for statistical modeling of complex time-series data and stochastic dynamic systems. These models can be viewed as a discrete-time Markov process which can be observed only through noisy measurements of its states. In this context, one of the most important problems is the optimal estimation of the current state given the noisy measurements of the current and previous states. In the statistics and engineering literature, this problem is known as optimal filtering, while the corresponding estimator is called the optimal filter. Due to its (practical and theoretical) importance, optimal filtering has been studied in a number of papers and books (see e.g., ,  and references cited therein). However, to the best of our knowledge, the existing results do not address at all the optimal filter higher-order derivatives and their stability properties. The purpose of our results presented here is to fill this gap in the literature on optimal filtering.
In many applications, a state-space model depends on a parameter whose value needs to be inferred from data. When the number of data points is large, it is desirable, for the sake of computational efficiency, to infer the parameter recursively (i.e., online). In the maximum likelihood approach, recursive parameter estimation can be performed using stochastic gradient search and the optimal filter (first-order) derivative (see , , ; see also ,  and references cited therein). In , a link between the asymptotic properties of recursive maximum likelihood estimation (convergence and convergence rate) and the analytical properties of the underlying log-likelihood (higher-order differentiability and analyticity) has been established in the context of finite-state hidden Markov models. In view of the recent results on stochastic gradient search , a similar link is likely to hold for state-space models. However, to apply the results of 
to recursive maximum likelihood estimation in state-space models, it is necessary to establish results on the higher-order differentiability of the log-likelihood for such models. Since the log-likelihood for any state-space model is a functional of the optimal filter, the analytical properties of such log-likelihood (including the higher order differentiability) are tightly connected to the existence and stability of the optimal filter higher-order derivatives. Hence, one of the first steps in the asymptotic analysis of recursive maximum likelihood estimation in state-space models would be establishing results on the existence and stability properties of these derivatives. Our results presented here are meant to provide a basis for this step.
In order to get a faster convergence rate of recursive maximum likelihood estimation in state-space models, it is desirable to maximize the underlying log-likelihood using the (stochastic) Newton method (instead of stochastic gradient search). As the Newton method relies on the information matrix (i.e., on the Hessian of the log-likelihood), the second-order derivative of the optimal filter is needed to estimate this matrix (for details see , ). Hence, to gain any theoretical insight into the asymptotic behavior of the approach based on the Newton method, it is necessary to establish results on the existence and stability of the optimal filter second-order derivative. These results are meant to be included as a particular case in the analysis carried out here.
In this paper, the optimal filter higher-order derivatives and their existence and stability properties are studied. Under (relatively) mild stability and regularity conditions, we show that these derivatives exist and forget initial conditions exponentially fast. We also show that the optimal filter higher-order derivatives are geometrically ergodic. The obtained results cover (relatively) large class of state-space models met in practice. They are also relevant for several (theoretically and practically) important problems arising in statistical inference, system identification and information theory. E.g., the results presented here are one of the first stepping stones to analyze the asymptotic behavior of recursive maximum likelihood estimation in non-linear state-space modes (see ).
The paper is organized as follows. In Section 2, the existence and stability of the optimal filter higher-order derivatives are studied and the main results are presented. In Section 3, the main results are used to study the analytical properties of log-likelihood for state-space models. An example illustrating the main results is provided in Section 4. In Sections 5 – 8, the main results and their corollaries are proved.
2 Main Results
2.1 State-Space Models and Optimal Filter
To specify state-space models and to formulate the problem of optimal filtering, we use the following notation. and are integers. and are Borel sets. is a transition kernel on .
is a conditional probability measure ongiven . is a probability space. is an -valued stochastic process which is defined on and satisfies
almost surely for each and any Borel set . In the statistics and engineering literature, stochastic process is called a state-space model. are the (unobservable) model states, while are the state-observations. can be viewed as a noisy measurement of state . States
form a Markov chain, whileis their transition kernel. Conditionally on , state-observations are mutually independent, while is the conditional distribution of given .
In the context of state-space models, one of the most important problems is the estimation of the current state given the state-observations . This problem is known as filtering. In the Bayesian approach, the optimal estimation of given is based on the (optimal) filtering distribution . In practice, and are rarely available, and therefore, the filtering distribution is computed using some approximate (i.e., misspecified) models.
In this paper, we assume that the model can accurately be approximated by a parametric family of state-space models. To define such a family, we rely on the following notation. is an integer. is an open set. is the set of probability measures on . and are measures on and (respectively). and are functions which map , , to and satisfy
for all , . With this notation, approximate hidden Markov models can be specified as a family of -valued stochastic process which are defined on , parameterized by , and satisfy
almost surely for each and any Borel set .111To evaluate the values of for which provides the best approximation to , we usually rely on the maximum likelihood principle. For further details on maximum likelihood estimation in state-space and hidden Markov models, see ,  and references cited therein.
To explain how the filtering distribution is computed using approximate model , we need the following notation. is the collection of Borel-sets in . is the function defined by
for , , . are the functions recursively defined by
for and a sequence in (, , have the same meaning as in (1)). and are the function and the probability measure (respectively) defined by
for each , , , and any sequence in . In this context, can be interpreted as the initial condition of the filtering distribution .
2.2 Optimal Filter Higher-Order Derivatives
Let . Throughout the paper, we assume that and are -times differentiable in for each , , .
To define the higher-order derivatives of the optimal filter, we use the following notation. is the set of non-negative integers. For , , notation and stand for
For , , relation is taken component-wise, i.e., if and only if for each . For , satisfying , denotes the multinomial coefficient
is the element of whose all components are zero. is the integer defined by
(notice that is the number of partial derivatives of order up to ). is the set of finite signed measures on (i.e., for each , ). is the set of
-dimensional finite vector measures on. The components of an element of are indexed by multi-indices in and ordered lexicographically. More specifically, an element of is denoted by
where .222 can also be defined as a -additive function mapping to . Thus, for each , is a -dimensional vector and is its component. is referred to as the component of . The components of are lexicographically ordered.333In (4), the component precedes the component if and only if , for some and each satisfying , , where , . is the set of -dimensional finite vector measures whose component is a probability measure.444 specified in (4) belongs to if and only if , for , . For , notation stands for the total variation norm of . For , notation stands for the total variation norm of induced by the vector norm, i.e.,
for specified in (4).
Besides the previously introduced notation, we rely here on the following notation, too. and are the functions defined by
for , , , , , , . are the functions recursively defined by
(, , , , have the same meaning as in (5)).555Equation (6) is a recursion in . In this recursion, is the initial condition. At iteration of (6) (), function is computed for multi-indices , using the results obtained at the previous iterations. , and are the elements of defined by
for (, , , , have the same meaning as in (5)), while , , are a ‘short-hand’ notation for , , (respectively). , and are the functions defined by
(, , , , have the same meaning as in (5)). is the element of defined by
(, , have the same meaning as in (5)).666 is the component of . since . are the elements of recursively defined by
for and a sequence in (, have the same meaning as in (5)), while is the component of . is the function defined by
for , , , ( is the component of ).
As demonstrated in Theorem 2.1, is the vector of the optimal filter derivatives of order up to . More specifically, is the optimal filter derivative of order , i.e.,
for each , , , , , and any sequence in .
2.3 Existence and Stability Results
We analyze here the existence and stability of the optimal filter higher-order derivatives. The analysis is carried out under the following assumptions.
There exists a real number and for each , , there exists a measure on such that and
for all , .
There exists a function such that
for all , , and any multi-index , .
There exists a function such that
for all , , .
Assumptions 2.1 – 2.3 correspond to , and their (higher-order) derivatives. Assumption 2.1 ensures that the filtering distribution forgets its initial condition exponentially fast (see Proposition 5.2). Assumption 2.2 provides for the higher-order score functions
to be well-defined and uniformly bounded in , .
Together with Assumption 2.2, Assumption 2.3 ensures
the higher-order differentiability of the filtering distribution
(see Theorem 2.1, Proposition 7.1 and their proofs).
In this or similar form, Assumptions 2.1 – 2.3 have been a standard ingredient
of many results
on the asymptotic properties of the optimal filter and its particle approximations
(see e.g., , , 
see also , , 
and references cited therein).
These assumptions have also routinely been used in a number of results on the asymptotic properties
of maximum likelihood estimation in state-space and hidden Markov models
(see , ,
, , ;
see also , , 
and references cited therein).
Assumptions 2.1 – 2.3
hold if is a compact set
and is a mixture (in )
of Gaussian, Gamma, logistic, Pareto and/or Wiebull densities.777If is compact and is a mixture of
Gaussian, Gamma, logistic, Pareto and/or Wiebull densities,
then it is reasonable to assume the following:
Our results on the existence and stability of the optimal filter higher-order derivatives are presented in the next two theorems.
Theorem 2.1 (Higher-Order Differentiability).
Theorem 2.2 (Forgetting).
Theorems 2.1 and 2.2 are proved in Sections 7 and 5 (respectively). Theorem 2.1 claims that the filtering density and the filtering distribution are times differentiable in . It also shows how the filtering density and distribution can be computed recursively using mappings , . On the other side, according to Theorem 2.2, the filtering distribution and its higher-order derivatives forget their initial conditions exponentially fast.
In the rest of the section, we study the ergodicity properties of the optimal filter higher-order derivatives. To do so, we use the following notation. is the set defined by . is the collection of Borel-sets in . is a function which maps , , , to . is another notation for , i.e., for , , , and . and are stochastic processes defined by
for , , , where denotes stochastic process (i.e., ). and are the kernels on defined by
for , , , , and . Then, it is easy to show that and are homogeneous Markov processes whose transition kernels are and (respectively).
To analyze the ergodicity properties of and , we rely on following assumptions.
There exist a probability measure on and real numbers , such that
for all , , .
There exit a function and a real number such that
for all , , , . There also exists a real number such that
for all , where .
Assumption 2.4 corresponds to state-space model and its stability. According to this assumption, Markov processes and are geometrically ergodic (for more details on geometric ergodicity, see ). Assumption 2.5 is related to function and its analytical properties. It requires to be locally Lipschitz continuous in and to grow in at most polynomially. Assumption 2.5 also requires the conditional mean of given to be uniformly bounded in . In this or a similar form, Assumptions 2.4 and 2.5 are involved in many results on the stability of the optimal filter and the asymptotic properties of maximum likelihood estimation in state-space and hidden Markov models (see e.g. , , , , , ; see also ,  and references cited therein).
Our results on the ergodicity of and are presented in the next theorem.