Stability of Optimal Filter Higher-Order Derivatives

06/25/2018 ∙ by Vladislav Z. B. Tadic, et al. ∙ University of Oxford University of Bristol 0

In many scenarios, a state-space model depends on a parameter which needs to be inferred from data. Using stochastic gradient search and the optimal filter (first-order) derivative, the parameter can be estimated online. To analyze the asymptotic behavior of online methods for parameter estimation in non-linear state-space models, it is necessary to establish results on the existence and stability of the optimal filter higher-order derivatives. The existence and stability properties of these derivatives are studied here. We show that the optimal filter higher-order derivatives exist and forget initial conditions exponentially fast. We also show that the optimal filter higher-order derivatives are geometrically ergodic. The obtained results hold under (relatively) mild conditions and apply to state-space models met in practice.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

State-space models (also known as continuous-state hidden Markov models) are a powerful and versatile tool for statistical modeling of complex time-series data and stochastic dynamic systems. These models can be viewed as a discrete-time Markov process which can be observed only through noisy measurements of its states. In this context, one of the most important problems is the optimal estimation of the current state given the noisy measurements of the current and previous states. In the statistics and engineering literature, this problem is known as optimal filtering, while the corresponding estimator is called the optimal filter. Due to its (practical and theoretical) importance, optimal filtering has been studied in a number of papers and books (see e.g.

[3], [4], [9] and references cited therein). However, to the best of our knowledge, the existing results do not address at all the optimal filter higher-order derivatives and their stability properties. The purpose of our results presented here is to fill this gap in the literature on optimal filtering.

In many applications, a state-space model depends on a parameter whose value needs to be inferred from data. When the number of data points is large, it is desirable, for the sake of computational efficiency, to infer the parameter recursively (i.e., online). In the maximum likelihood approach, recursive parameter estimation can be performed using stochastic gradient search and the optimal filter (first-order) derivative (see [10], [15], [17]; see also [3], [9] and references cited therein). In [17], a link between the asymptotic properties of recursive maximum likelihood estimation (convergence and convergence rate) and the analytical properties of the underlying log-likelihood (higher-order differentiability and analyticity) has been established in the context of finite-state hidden Markov models. In view of the recent results on stochastic gradient search [19], a similar link is likely to hold for state-space models. However, to apply the results of [19]

to recursive maximum likelihood estimation in state-space models, it is necessary to establish results on the higher-order differentiability of the log-likelihood for such models. Since the log-likelihood for any state-space model is a functional of the optimal filter, the analytical properties of such log-likelihood (including the higher order differentiability) are tightly connected to the existence and stability of the optimal filter higher-order derivatives. Hence, one of the first steps in the asymptotic analysis of recursive maximum likelihood estimation in state-space models would be establishing results on the existence and stability properties of these derivatives. Our results presented here are meant to provide a basis for this step.

In order to get a faster convergence rate of recursive maximum likelihood estimation in state-space models, it is desirable to maximize the underlying log-likelihood using the (stochastic) Newton method (instead of stochastic gradient search). As the Newton method relies on the information matrix (i.e., on the Hessian of the log-likelihood), the second-order derivative of the optimal filter is needed to estimate this matrix (for details see [10], [15]). Hence, to gain any theoretical insight into the asymptotic behavior of the approach based on the Newton method, it is necessary to establish results on the existence and stability of the optimal filter second-order derivative. These results are meant to be included as a particular case in the analysis carried out here.

In this paper, the optimal filter higher-order derivatives and their existence and stability properties are studied. Under (relatively) mild stability and regularity conditions, we show that these derivatives exist and forget initial conditions exponentially fast. We also show that the optimal filter higher-order derivatives are geometrically ergodic. The obtained results cover (relatively) large class of state-space models met in practice. They are also relevant for several (theoretically and practically) important problems arising in statistical inference, system identification and information theory. E.g., the results presented here are one of the first stepping stones to analyze the asymptotic behavior of recursive maximum likelihood estimation in non-linear state-space modes (see [20]).

The paper is organized as follows. In Section 2, the existence and stability of the optimal filter higher-order derivatives are studied and the main results are presented. In Section 3, the main results are used to study the analytical properties of log-likelihood for state-space models. An example illustrating the main results is provided in Section 4. In Sections 58, the main results and their corollaries are proved.

2 Main Results

2.1 State-Space Models and Optimal Filter

To specify state-space models and to formulate the problem of optimal filtering, we use the following notation. and are integers. and are Borel sets. is a transition kernel on .

is a conditional probability measure on

given . is a probability space. is an -valued stochastic process which is defined on and satisfies

almost surely for each and any Borel set . In the statistics and engineering literature, stochastic process is called a state-space model. are the (unobservable) model states, while are the state-observations. can be viewed as a noisy measurement of state . States

form a Markov chain, while

is their transition kernel. Conditionally on , state-observations are mutually independent, while is the conditional distribution of given .

In the context of state-space models, one of the most important problems is the estimation of the current state given the state-observations . This problem is known as filtering. In the Bayesian approach, the optimal estimation of given is based on the (optimal) filtering distribution . In practice, and are rarely available, and therefore, the filtering distribution is computed using some approximate (i.e., misspecified) models.

In this paper, we assume that the model can accurately be approximated by a parametric family of state-space models. To define such a family, we rely on the following notation. is an integer. is an open set. is the set of probability measures on . and are measures on and (respectively). and are functions which map , , to and satisfy

for all , . With this notation, approximate hidden Markov models can be specified as a family of -valued stochastic process which are defined on , parameterized by , and satisfy

almost surely for each and any Borel set .111To evaluate the values of for which provides the best approximation to , we usually rely on the maximum likelihood principle. For further details on maximum likelihood estimation in state-space and hidden Markov models, see [3], [9] and references cited therein.

To explain how the filtering distribution is computed using approximate model , we need the following notation. is the collection of Borel-sets in . is the function defined by

(1)

for , , . are the functions recursively defined by

(2)

for and a sequence in (, , have the same meaning as in (1)). and are the function and the probability measure (respectively) defined by

(3)

for , , (, , have the same meaning as in (1), (2)), while is a ‘short-hand’ notation for . Then, it can easily be shown that is the filtering distribution (based on approximate model ), i.e.,

for each , , , and any sequence in . In this context, can be interpreted as the initial condition of the filtering distribution .

2.2 Optimal Filter Higher-Order Derivatives

Let . Throughout the paper, we assume that and are -times differentiable in for each , , .

To define the higher-order derivatives of the optimal filter, we use the following notation. is the set of non-negative integers. For , , notation and stand for

For , , relation is taken component-wise, i.e., if and only if for each . For , satisfying , denotes the multinomial coefficient

is the element of whose all components are zero. is the integer defined by

(notice that is the number of partial derivatives of order up to ). is the set of finite signed measures on (i.e., for each , ). is the set of

-dimensional finite vector measures on

. The components of an element of are indexed by multi-indices in and ordered lexicographically. More specifically, an element of is denoted by

(4)

where .222 can also be defined as a -additive function mapping to . Thus, for each , is a -dimensional vector and is its component. is referred to as the component of . The components of are lexicographically ordered.333In (4), the component precedes the component if and only if , for some and each satisfying , , where , . is the set of -dimensional finite vector measures whose component is a probability measure.444 specified in (4) belongs to if and only if , for , . For , notation stands for the total variation norm of . For , notation stands for the total variation norm of induced by the vector norm, i.e.,

for specified in (4).

Besides the previously introduced notation, we rely here on the following notation, too. and are the functions defined by

(5)

for , , , , , , . are the functions recursively defined by

(6)

(, , , , have the same meaning as in (5)).555Equation (6) is a recursion in . In this recursion, is the initial condition. At iteration of (6) (), function is computed for multi-indices , using the results obtained at the previous iterations. , and are the elements of defined by

(7)

for (, , , , have the same meaning as in (5)), while , , are a ‘short-hand’ notation for , , (respectively). , and are the functions defined by

(8)

(, , , , have the same meaning as in (5)). is the element of defined by

(9)

(, , have the same meaning as in (5)).666 is the component of . since . are the elements of recursively defined by

(10)

for and a sequence in (, have the same meaning as in (5)), while is the component of . is the function defined by

(11)

for (, , , , have the same meaning as in (5), (10)). is the element of defined by

(12)

for , , , ( is the component of ).

Remark.

As demonstrated in Theorem 2.1, is the vector of the optimal filter derivatives of order up to . More specifically, is the optimal filter derivative of order , i.e.,

for each , , , , , and any sequence in .

2.3 Existence and Stability Results

We analyze here the existence and stability of the optimal filter higher-order derivatives. The analysis is carried out under the following assumptions.

Assumption 2.1.

There exists a real number and for each , , there exists a measure on such that and

for all , .

Assumption 2.2.

There exists a function such that

(13)

for all , , and any multi-index , .

Assumption 2.3.

There exists a function such that

for all , , .

Assumptions 2.12.3 correspond to , and their (higher-order) derivatives. Assumption 2.1 ensures that the filtering distribution forgets its initial condition exponentially fast (see Proposition 5.2). Assumption 2.2 provides for the higher-order score functions

to be well-defined and uniformly bounded in , . Together with Assumption 2.2, Assumption 2.3 ensures the higher-order differentiability of the filtering distribution (see Theorem 2.1, Proposition 7.1 and their proofs). In this or similar form, Assumptions 2.12.3 have been a standard ingredient of many results on the asymptotic properties of the optimal filter and its particle approximations (see e.g., [1], [5], [6] [11], [12]; see also [3], [4], [9] and references cited therein). These assumptions have also routinely been used in a number of results on the asymptotic properties of maximum likelihood estimation in state-space and hidden Markov models (see [2], [8], [10], [16], [17]; see also [3], [4], [9] and references cited therein). Assumptions 2.12.3 hold if is a compact set and is a mixture (in ) of Gaussian, Gamma, logistic, Pareto and/or Wiebull densities.777If is compact and is a mixture of Gaussian, Gamma, logistic, Pareto and/or Wiebull densities, then it is reasonable to assume the following:

for , , , , , where is a constant and is a polynomial function of . Combining this with Leibniz rule, we get
for , , , , . The reasoning outlined here directly leads to Assumptions 2.2 and 2.3. From the theoretical point of view, Assumption 2.1 is restrictive as it (implicitly) requires to be bounded. However, as shown in Section 4, this assumption covers a (relatively) broad class of state-space models met in practice.

Our results on the existence and stability of the optimal filter higher-order derivatives are presented in the next two theorems.

Theorem 2.1 (Higher-Order Differentiability).

Let Assumptions 2.12.3 hold. Then, and are times differentiable in for each , , , , and any sequence in . Moreover, we have

(14)

for all , , , , , any multi-index , and any sequence in .888 is the measure of with respect to .

Theorem 2.2 (Forgetting).

Let Assumptions 2.1 and 2.2 hold. Then, there exist real numbers , (depending only on , ) such that

(15)
(16)

for all , , and any sequence in .

Theorems 2.1 and 2.2 are proved in Sections 7 and 5 (respectively). Theorem 2.1 claims that the filtering density and the filtering distribution are times differentiable in . It also shows how the filtering density and distribution can be computed recursively using mappings , . On the other side, according to Theorem 2.2, the filtering distribution and its higher-order derivatives forget their initial conditions exponentially fast.

In the rest of the section, we study the ergodicity properties of the optimal filter higher-order derivatives. To do so, we use the following notation. is the set defined by . is the collection of Borel-sets in . is a function which maps , , , to . is another notation for , i.e., for , , , and . and are stochastic processes defined by

for , , , where denotes stochastic process (i.e., ). and are the kernels on defined by

for , , , , and . Then, it is easy to show that and are homogeneous Markov processes whose transition kernels are and (respectively).

To analyze the ergodicity properties of and , we rely on following assumptions.

Assumption 2.4.

There exist a probability measure on and real numbers , such that

for all , , .

Assumption 2.5.

There exit a function and a real number such that

for all , , , . There also exists a real number such that

(17)

for all , where .

Assumption 2.4 corresponds to state-space model and its stability. According to this assumption, Markov processes and are geometrically ergodic (for more details on geometric ergodicity, see [14]). Assumption 2.5 is related to function and its analytical properties. It requires to be locally Lipschitz continuous in and to grow in at most polynomially. Assumption 2.5 also requires the conditional mean of given to be uniformly bounded in . In this or a similar form, Assumptions 2.4 and 2.5 are involved in many results on the stability of the optimal filter and the asymptotic properties of maximum likelihood estimation in state-space and hidden Markov models (see e.g. [1], [5], [11], [12], [17], [20]; see also [3], [4] and references cited therein).

Our results on the ergodicity of and are presented in the next theorem.

Theorem 2.3 (Ergodicity).

Let Assumptions 2.12.5 hold. Moreover, let . Then, there exist functions , mapping to such that

for all , .999 and are the functions defined by

for , , . There also exist real numbers , (depending only on , , , , , ) such that