# Analyticity of Entropy Rates of Continuous-State Hidden Markov Models

The analyticity of the entropy and relative entropy rates of continuous-state hidden Markov models is studied here. Using the analytic continuation principle and the stability properties of the optimal filter, the analyticity of these rates is shown for analytically parameterized models. The obtained results hold under relatively mild conditions and cover several classes of hidden Markov models met in practice. These results are relevant for several (theoretically and practically) important problems arising in statistical inference, system identification and information theory.

## Authors

• 4 publications
• 44 publications
• ### Talking Condition Identification Using Second-Order Hidden Markov Models

This work focuses on enhancing the performance of text-dependent and spe...
07/01/2017 ∙ by Ismail Shahin, et al. ∙ 0

• ### Reduction of Maximum Entropy Models to Hidden Markov Models

We show that maximum entropy (maxent) models can be modeled with certain...
12/12/2012 ∙ by Joshua Goodman, et al. ∙ 0

• ### Real Entropy Can Also Predict Daily Voice Traffic for Wireless Network Users

Voice traffic prediction is significant for network deployment optimizat...
03/28/2020 ∙ by Sihai Zhang, et al. ∙ 0

• ### A generalized risk approach to path inference based on hidden Markov models

Motivated by the unceasing interest in hidden Markov models (HMMs), this...
07/21/2010 ∙ by Jüri Lember, et al. ∙ 0

• ### Completely Unsupervised Phoneme Recognition By A Generative Adversarial Network Harmonized With Iteratively Refined Hidden Markov Models

Producing a large annotated speech corpus for training ASR systems remai...
04/08/2019 ∙ by Kuan-Yu Chen, et al. ∙ 0

• ### The Functional Thermodynamics of Finite-State Maxwellian Ratchets

Autonomous Maxwellian demons exploit structured environments as a resour...
02/29/2020 ∙ by Alexandra M. Jurgens, et al. ∙ 0

• ### Blind Construction of Optimal Nonlinear Recursive Predictors for Discrete Sequences

We present a new method for nonlinear prediction of discrete random sequ...
08/09/2014 ∙ by Cosma Shalizi, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Hidden Markov models are a powerful and versatile tool for statistical modeling of complex time-series data and stochastic dynamic systems. They can be described as a discrete-time Markov chain observed through noisy measurements of its states. In this context, the entropy rate can be interpreted as a measure of the information revealed by a model (through noisy measurements of the states), while the relative entropy rate can be viewed as a distance between two models.

The entropy rates of hidden Markov models and their analytical properties have recently gained significant attention by the information theory community. These properties and their links with statistical inference, system identification, stochastic optimization and information theory have extensively been studied in several papers [8][11], [12], [17], [18], [20], [21]. However, to the best of our knowledge, the existing results on the analytical properties of the entropy rates of hidden Markov models apply exclusively to the models with finite state-spaces and do not address the continuous-state models models at all. Our results presented here are meant to fill this gap in the literature on hidden Markov models and their entropy rates.

In [21]

, recursive maximum likelihood estimation in finite-state hidden Markov models has been analyzed and a link between its asymptotic properties (convergence and convergence rate) and the analyticity of the underlying average log-likelihood (i.e., of the underlying relative entropy rate) has been established. In view of the recent results on stochastic gradient search

[23], a similar link is expected to hold for continuous-state hidden Markov models (including non-linear state-space models). However, to apply the results of [23]

to recursive maximum likelihood estimation in continuous-state hidden Markov models, it is necessary to establish results on the analyticity of the average log-likelihood for these models. Hence, one of the first (and most important) steps in the asymptotic analysis of recursive maximum likelihood estimation in continuous-state hidden Markov models would be establishing the analyticity for entropy rates of such models. The results presented here should provide a theoretical basis for this step.

In this paper, the analyticity of the entropy and relative entropy rates of continuous-state hidden Markov models is studied. Using the analytic continuation principle and the stability properties of the optimal filter, we show the analyticity of these rates for analytically parameterized models. The obtained results hold under (relatively) mild conditions and cover a (relatively) broad class of state-space and continuous-state hidden Markov models met in practice. Moreover, these results generalize the existing results on the analyticity of entropy rates of finite-state hidden Markov models. Further to this, the results presented here are relevant for several (theoretically and practically) important problems related to statistical inference, system identification and information theory.

The paper is organized as follows. In Section 2, the entropy rates of hidden Markov models are specified. In he same section, the main results are presented. Examples illustrating the main results are provided in Sections 3 and 4. In Sections 57, the main results are proved.

## 2 Main Results

To define hidden Markov models and their entropy rates, we use the following notation.

is a probability space.

and are integers, while and are Borel sets. is a transition kernel on , while is a conditional probability measure on given . Then, a hidden Markov model can be defined as the -valued stochastic process (i.e., , ) which is defined on and satisfies

 P((Xn+1,Yn+1)∈B|X0:n,Y0:n)=∫IB(x,y)Q(x,dy)P(Xn,dx)

almost surely for and any Borel set . are the (unobservable) model states, while are the state-observations. can be interpreted as a noisy measurement of state . States form a Markov chain, while is their transition kernel. Conditionally on , state-observations are mutually independent, while is the conditional distribution of given . For more details on hidden Markov models, see [1], [5] and references cited therein.

Besides the model , we also consider a parameterized family of hidden Markov models. To define such a family, we rely on the following notation. is an integer, while is an open set. is the set of probability measures on . and are measures on and (respectively). and are functions which map , , to and satisfy

 ∫Xpθ(x′|x)μ(dx′)=∫Yqθ(y|x)ν(dy)=1

for all , . Then, a parameterized family of hidden Markov models can be defined as a collection of -valued stochastic processes (i.e., , ) which are defined on , parameterized by , and satisfy

 P((Xθ,λ0,Yθ,λ0)∈B)=∫∫IB(x,y)qθ(y|x)λ(dx), P((Xθ,λn+1,Yθ,λn+1)∈B∣∣Xθ,λ0:n,Yθ,λ0:n)=∫∫IB(x,y)qθ(y|x)pθ(x|Xθ,λn)μ(dx)ν(dy)

almost surely for and any Borel set .111In the context of system identification, is interpreted as the true system (i.e., the system being identified), while is viewed as a candidate model for .

To define the entropy rates, we need the following notation. is the function defined by

 rθ(y,x′|x)=qθ(y|x′)pθ(x′|x)

for , , . For , , , , let

 qnθ(y1:n|λ)=∫⋯∫∫(n∏k=1rθ(yk,xk|xk−1))μ(dxn)⋯μ(dx1)λ(dx0).

Moreover, for , , , let

 (1)

Then, the entropy rate of model (i.e., the entropy rate of stochastic process ) is defined as . Similarly, the relative entropy rate between models and (i.e., the relative entropy rate between stochastic processes and ) is defined as . For more details on the entropy rate, see [6], [7] and references cited therein.222As pointed out in the introduction, the rates and are closely related to several important problems arising in engineering and statistics, in particular, to system identification. E.g., in the recursive maximum likelihood approach to system identification, the true system is identified through the maximization of (in this context, is referred to as the average log-likelihood).

We study here the rates , and their analytical properties. To formulate the assumptions under which these rates are analyzed, we rely on the following notation. For , denotes the Euclidean norm of . For , is the open -vicinity of in , i.e.,

 Vγ(Θ)={η∈Cd:∃θ∈Θ,∥η−θ∥<γ}.

Our analysis carried out here is based on the following assumptions.

###### Assumption 2.1.

There exists a real number and for each , , there exists a measure on such that

 ελθ(B|y)≤∫Brθ(y,x′|x)μ(dx′)≤λθ(B|y)ε

for all and any Borel set .

###### Assumption 2.2.

is real-analytic in for each , , . Moreover, has a complex-valued continuation with the following properties:

(i) maps , , to .

(ii) for all , , .

(iii) There exists a real number such that is analytic in for each , , .

(iv) There exists a function such that and

 ∣∣^rη(y,x′|x)∣∣≤φ(y)

for all , , .

###### Assumption 2.3.

There exists a real number such that

 ∫rθ(y,x′|x)μ(dx′)≥γφ(y)

for all , , .

.

###### Assumption 2.5.

There exists a real number such that

 ∫|logφ(y)|Q(x,dy)≤K

for all . Moreover, there exists a probability measure on and a real number such that

 |Pn(x,B)−π(B)|≤Kρn

for all , and any Borel-set .

Assumption 2.1 is related to the stability of the hidden Markov model and its optimal filter, while Assumptions 2.22.4 correspond to the parameterization of the model . Assumption 2.1 (together with Assumptions 2.2, 2.4) ensures that the transition kernel of and its analytical continuation are geometrically ergodic (see Lemma 5.4). Assumption 2.1 (together with Assumptions 2.22.4) also ensures that the optimal filter and its analytical continuation forget initial conditions at an exponential rate (see Lemmas 6.2, 6.6). Assumptions 2.12.4 cover several important classes of hidden Markov models met in practice (for further details, see Sections 3, 4). In this or similar form, Assumption 2.1 is an ingredient of a number of asymptotic results on optimal filtering and maximum likelihood estimation in hidden Markov models (see [3], [4], [14], [15]; see also [1], [2], [5] and references cited therein).

Assumption 2.5 corresponds to the stability of the hidden Markov model . According to this assumption, stochastic processes and are geometrically ergodic (for further details on geometric ergodicity, see e.g., [16]).

The main results of the paper are contained in the next two theorems.

###### Theorem 2.1.

Let Assumptions 2.12.3 and 2.5 hold. Then, there exists a function such that is real-analytic for each and for all , .

###### Theorem 2.2.

Let Assumptions 2.12.4 hold. Then, there exists a function such that is real-analytic for each and for all , .

###### Remark.

As can be represented as a union of open balls, it is sufficient to show Theorems 2.1 and 2.2 for the case where is convex. Therefore, throughout the analysis carried out in Sections 58, we assume that is an open convex set.

Theorems 2.1 and 2.2 are proved in Section 7. According to these theorems, for all , , rates and are well-defined. According to the same theorems, for each , rates and are independent of and real-analytic in .

Recently, the analytical properties of the entropy rates of hidden Markov models have extensively been studied in several papers [8][12], [17], [18], [20], [21]. However, the results presented therein apply only to the models with finite state-spaces and do not address the continuous-state models at all. To the best of our knowledge, Theorems 2.1 and 2.2 are the first result on the analyticity of the entropy rates of continuous-state hidden Markov models. These theorems also generalize the existing results on the analyticity of the entropy rates of finite-state hidden Markov models.333The results presented in [10] can be considered as the strongest result on the analytical properties of the entropy rate of finite-state hidden Markov models. When has a finite number of elements, Theorem 2.2 includes all results of [10] as a particular case. Theorem 2.2 also simplifies (considerably) the conditions which the results of [10] are based on. Moreover, combining the techniques used in the proof of Theorem 2.2 (in particular, Lemma 5.4, Section 5) with the results of [21], the results presented in [10] can further be generalized (for details, see Appendix Appendix 1). Further to this, Theorems 2.1 and 2.2 are relevant for several (theoretically and practically) important problems arising in statistical inference and system identification. E.g., in [24], we crucially rely on these theorems to analyze recursive maximum likelihood estimation in non-linear state-space models. The same theorems can also be used to study the higher-order statistical asymptotics for maximum likelihood estimation in time-series models (for details on such asymptotics, see [25]).

## 3 Example: Mixture of Densities

In this section, the main results are applied to the case when and are mixtures of densities. More specifically, we consider the case where

 pθ(x′|x)=Nx∑i=1aiθ(x)vi(x′),qθ(y|x)=Ny∑j=1bjθ(x)wj(y) (2)

for , , (, , have the same meaning as in the previous section). Here, and are integers. and are functions which map , to and satisfy

 ∫vi(x)μ(dx)=∫wj(y)ν(dy)=1

for each , (, have the same meaning as in the previous section).444 and are probability densities on and (respectively). They can be considered as components of the mixtures (2). and are functions which map , to and satisfy

 Nx∑i=1aiθ(x)=Ny∑j=1bjθ(x)=1

for each , .555 and are probability masses in and (respectively). They can be viewed as weights in the mixtures (2).

The entropy rates of hidden Markov model specified in (2) are studied under the following assumptions.

###### Assumption 3.1.

is a compact set.

###### Assumption 3.2.

and for all , , , . Moreover, and are real-analytic in for each , , , .

###### Assumption 3.3.

There exists a real number such that for all , .

###### Assumption 3.4.

for each .

Assumptions 3.13.4 cover several classes of hidden Markov models met in practice. E.g., these assumptions hold if is a mixture of Gamma, Gaussian, Pareto and logistic distributions, and if is a mixture of the same distributions truncated to a compact domain. For other models satisfying (2) and Assumptions 3.13.4, see e.g., [1], [2], [5] and references cited therein.

Using Theorems 2.1 and 2.2, we get the following results.

###### Corollary 3.1.

Let Assumptions 2.5 and 3.13.3 hold. Then, all conclusions of Theorem 2.1 are true.

###### Corollary 3.2.

Let Assumptions 3.13.4 hold. Then, all conclusions of Theorem 2.2 are true.

Corollaries 3.1 and 3.2 are proved in Section 8.

## 4 Example: Non-Linear State-Space Models

In this section, the main results are used to study the entropy rates of non-linear state-space models. We consider the following (parameterized) state-space model:

 Xθ,λn+1=Aθ(Xθ,λn)+Bθ(Xθ,λn)Vn,Yθ,λn=Cθ(Xθ,λn)+Dθ(Xθ,λn)Wn,n≥0. (3)

Here, , are the parameters indexing the state-space model (3) (, have the same meaning as in Section 2). and are functions which map , (respectively) to and ( has the same meaning as in Section 2). and are functions which map , (respectively) to and ( has the same meaning as in Section 2). is an

-valued random variable defined on a probability space

and distributed according to . are -valued i.i.d. random variables which are defined on and have (marginal) probability density with respect to the Lebesgue measure. are -valued i.i.d. random variables which are defined on and have (marginal) probability density with respect to the Lebesgue measure. We also assume that , and are (jointly) independent.

In this section, we rely on the following notation. and are the functions defined by

 ~pθ(x′|x)=v(B−1θ(x)(x′−Aθ(x)))|detBθ(x)|,~qθ(y|x)=w(D−1θ(x)(y−Cθ(x)))|detDθ(x)|

for , , (provided that , are invertible). and are the functions defined by

 pθ(x′|x)=v(B−1θ(x)(x′−Aθ(x)))1X(x′)∫Xv(B−1θ(x)(x′′−Aθ(x)))dx′′,qθ(y|x)=w(D−1θ(x)(y−Cθ(x)))1Y(y)∫Yw(D−1θ(x)(y′−Cθ(x)))dy′ (4)

for , , (, have the same meaning as in Section 2). It is straightforward to show that and are the conditional densities of and (respectively) given . On the other side, and can be interpreted as truncations of and to domains and (i.e., the hidden Markov model specified in (4) can be viewed as a truncated version of the original model (3)). It is easy to notice that and accurately approximate and when domains and are sufficiently large (i.e., when , contain balls of sufficiently large radius). This kind of approximation is involved (implicitly or explicitly) in any numerical implementation of the optimal filter for state-space model (3) (for details see e.g., [1], [2], [5] and references cited therein).

The entropy rates of the hidden Markov model (4) are studied under the following assumptions.

###### Assumption 4.1.

and are compact sets with non-empty interiors.

###### Assumption 4.2.

and for all , . Moreover, and are real-analytic for each , .

###### Assumption 4.3.

and are invertible for all , . Moreover, , , and are real-analytic in for each , .

Assumptions 4.14.3 are relevant for several practically important classes of non-linear state-space models. E.g., these assumptions cover stochastic volatility and dynamic probit models and their truncated versions. For other models satisfying (3) and Assumptions 4.14.3, see [1], [2], [5] and references cited therein.

Using Theorems 2.1 and 2.2, we get the following results.

###### Corollary 4.1.

Let Assumptions 2.5 and 4.14.3 hold. Then, all conclusions of Theorem 2.1 are true.

###### Corollary 4.2.

Let Assumptions 4.14.3 hold. Then, all conclusions of Theorem 2.2 are true.

Corollaries 4.1 and 4.2 are proved in Section 8.

## 5 Results Related to Kernels of {(Xn,Yn)}n≥0 and {(Xθ,λn,Yθ,λn)}n≥0

In this section, an analytical (complex-valued) continuation of the transition kernel of is constructed, and its asymptotic properties (geometric ergodicity) are studied. The same properties of the transition kernel of are studied, too.

Throughout this section and the whole paper, the following notation is used. is the set defined by , while is the collection of Borel sets in . is the collection of probability measures on , while is the set of positive measures on . is the collection of complex measures on , while is the set defined by

 Pc(Z)={ζ∈Mc(Z):ζ(Z)=1}.

For , denotes the total variation norm of , while is the total variation of . is the function defined by

 ψ(z)=1+|logφ(y)| (5)

for , and . is the function defined by

 ~rη(y,x′|x)={^rη(y,x′|x)/∫∫^rη(y′,x′′|x)ν(dy′)μ(dx′′), if ∫∫^rη(y′,x′′|x)ν(dy′)μ(dx′′)≠00, otherwise (6)

for , , . is the function defined by

 unη(x0:n,y1:n)=n∏k=1~rη(yk,xk|xk−1) (7)

for , , ( has the same meaning as in (6)). is the Dirac measure on centered at (i.e., for ). is the measure defined by

 σ(B)=∫∫IB(y,x)Q(x,dy)π(dx)

for . , are the kernels defined by

 S(z,B)=∫∫IB(y′,x′)Q(x′,dy′)P(x,dx′), Sη(z,B)=∫∫IB(y′,x′)~rη(y′,x′|x)ν(dy′)μ(dx′) (8)

for , , and ( has the same meaning as in (6)). , are the kernels recursively defined by and

 Sn+1(z,B)=∫Sn(z′,B)S(z,dz′),Sn+1η(z,B)=∫Snη(z′,B)Sη(z,dz′)

for , , ( has the same meaning as in (6)). are the measures defined by

 (Snηζ)(B)=∫Snη(z,B)ζ(dz)

for , , ( has the same meaning as in (6)).

Under the above notation, we have the following: , are (respectively) the transition kernels of , , where , . We also have

 (Snηζ)(B)=∫⋯∫∫IB(yn,xn)unη(x0:n,y1:n)(ν×μ)(dyn,dxn)⋯(ν×μ)(dy1,dx1)ζ(dy0,dx0) (9)

for , , , .

###### Lemma 5.1.

Let Assumption 2.5 hold. Then, there exists a real number such that

 ∫ψ(z′)S(z,dz′)≤C1,|Sn−σ|(z,B)≤C1ρn

for all , , (here, denotes the total variation of , while is specified in Assumption 2.5).

###### Proof.

Let ( is specified in Assumption 2.5). Moreover, let , be any elements of , (respectively), while (notice that is any element of ). Then, we have

 ∫ψ(z′)S(z,dz′)=∫∫(1+|logφ(y′)|)Q(x′,dy′)P(x,dx′)≤1+K≤C1.

We also have

 |Sn(z,B)−σ(B)|= ∣∣∣∫∫IB(y′,x′)Q(x′,dy′)(Pn−π)(x,dx′)∣∣∣ ≤ ∫∫IB(y′,x′)Q(x′,dy′)|Pn−π|(x,dx′) ≤ 2Kρn ≤ C1ρn

for , . ∎

###### Lemma 5.2.

Let Assumption 2.2 hold. Then, the following is true:

(i) for all , , .

(ii) There exists a real number such that is analytic in and satisfies

 ∣∣~rη(y,x′|x)∣∣≤2φ(y),∫∫~rη(y′,x′′|x)ν(dy′)μ(dx′′)=1

for all , , ( is specified in Assumption 2.2).

###### Remark.

As a direct consequence of Lemma 5.2 (Part (ii)), we have (i.e., ) for , , .

###### Proof.

Due to Assumption 2.2, we have

 ∫∫φ(y)ν(dy)μ(dx)=∥μ∥∫φ(y)ν(dy)<∞.

Then, using Assumption 2.2 and Lemma A1.1 (see Appendix 8), we conclude that integral

 ∫∫^rη(y,x′|x)ν(dy)μ(dx′) (10)

is analytic in for each , . Relying on the same arguments, we deduce

 ∣∣^rη′(y,x′|x)−^rη′′(y,x′|x)∣∣≤ dφ(y)∥η′−η′′∥δ (11)

for , , (here,

denotes the dimension of vectors in

, ).

Throughout the rest of the proof, the following notation is used. , are the real numbers defined by

 ~C=d∥μ∥δ∫φ(y)ν(dy),δ1=min{δ,12~C}.

is any element of , while , , are any elements in . , are any elements of , while is any element in