1 Introduction
Let be a sample following some unknown distribution . The maximum likelihood estimator can be formalized as follows: let , the model, be a family of possible distributions; pick a distribution of the model which maximizes the likelihood of the observed sample.
In many situations, the true distribution may not belong to the model at hand: this is the socalled misspecified setting. One would like the estimator to give sensible results even in this setting. This can be done by showing that the estimated distribution converges to the best approximation of the true distribution within the model. The goal of this paper is to establish a finite sample bound on the error of the maximum likelihood estimator for a large class of true distributions and a large class of nonparametric hidden Markov models.
In this paper, we consider maximum likelihood estimators (shortened MLE) based on model selection among finite state space hidden Markov models (shortened HMM). A finite state space hidden Markov model is a stochastic process where only the observations are observed, such that the process
is a Markov chain taking values in a finite space and such that the
are independent conditionally to with a distribution depending only on the corresponding . The parameters of a HMM are the initial distribution and the transition matrix of and the distributions of conditionally to .HMMs have been widely used in practice, for instance in climatology (Lambert et al., 2003), ecology (Boyd et al., 2014), voice activity detection and speech recognition (Couvreur and Couvreur, 2000; Lefèvre, 2003), biology (Yau et al., 2011; Volant et al., 2014)… One of their advantages is their ability to account for complex dependencies between the observations: despite the seemingly simple structure of these models, the fact that the process is hidden makes the process nonMarkovian.
Up to now, most theoretical work in the literature focused on wellspecified and parametric HMMs, where a smooth parametrization by a subset of is available, see for instance Baum and Petrie (1966) for discrete state and observations spaces, Leroux (1992) for general observation spaces and Douc and Matias (2001) and Douc et al. (2011) for general state and observation spaces. Asymptotic properties for misspecified models have been studied recently by Mevel and Finesso (2004) for consistency and asymptotic normality in finite state space HMMs and Douc and Moulines (2012) for consistency in HMMs with general state space. Let us also mention Pouzo et al. (2016)
, who studied a generalization of hidden Markov models in a semimisspecified setting. All these results focus on parametric models.
Few results are available on nonparametric HMMs, and all of them focus on the wellspecified setting. Alexandrovich et al. (2016) prove consistency of a nonparametric maximum likelihood estimator based on finite state space hidden Markov models with nonparametric mixtures of parametric densities. Vernet (2015a, b) study the posterior consistency and concentration rates of a Bayesian nonparametric maximum likelihood estimator. Other methods have also been considered, such as spectral estimators in Anandkumar et al. (2012); Hsu et al. (2012); De Castro et al. (2017); Bonhomme et al. (2016); Lehéricy (2017) and least squares estimators in de Castro et al. (2016); Lehéricy (2017). Besides Vernet (2015b), to the best of our knowledge, there has been no result on convergence rates or finite sample error of the nonparametric maximum likelihood estimator, even in the wellspecified setting.
The main result of this paper is an oracle inequality that holds as soon as the models have controlled tails. This bound is optimal when the true distribution is a HMM taking values in . Let us give some details about this result.
Let us start with an overview of the assumptions on the true distribution . The first assumption is that the observed process is strongly mixing. Strong mixing assumptions can be seen as a strengthened version of ergodicity. They have been widely used to extend results on independent observation to dependent processes, see for instance Bradley (2005) and Dedecker et al. (2007) for a survey on strong mixing and weak dependence conditions. The second assumption is that the process forgets its past exponentially fast. For hidden Markov models, this forgetting property is closely related to the exponential stability of the optimal filter, see for instance Le Gland and Mevel (2000); Gerencsér et al. (2007); Douc et al. (2004, 2009). The last assumption is that the likelihood of the true process has subpolynomial tails. None of these assumptions are specific to HMMs, thus making our result applicable to the misspecified setting.
To approximate a large class of true distributions, we consider nonparametric HMMs, where the parameters are not described by a finite dimensional space. For instance, one may consider HMMs with arbitrary number of states and arbitrary emission distributions. Computing a maximizer of the likelihood directly in a nonparametric model may be hard or result in overfitting. The model selection approach offers a way to circumvent this problem. It consists in considering a countable family of parametric sets –the models–and selecting one of them. The larger the union of all models, the more distributions are approximated. Several criteria can be used to select the model, such as bootstrap, cross validation (see for instance Arlot and Celisse (2010)) or penalization (see for instance Massart (2007)). We use a penalized criterion, which consists in maximizing the function
where is the density of under the parameter and the penalty pen only depends on the model and the number of observations .
Assume that the emission distributions of the HMMs–that is the distribution of the observations conditionally to the hidden states–are absolutely continuous with respect to some known probability measure, and call
emission densities their densities with respect to this measure. The tail assumption ensures that the emission densities have subpolynomial tail:where the supremum is taken over all emission densities in the models for a function . For instance, this assumption holds when all densities are upper bounded by . A key remark at this point is the dependency of with : we allow the models to depend on the sample size. Typically, taking a larger sample makes it possible to consider larger models. A good choice is to take proportional to .
To stabilize the loglikelihood, we modify the models in the following way. First, only keep HMMs whose transition matrix is lower bounded by a positive function . We show that taking this lower bound as is a safe choice. Then, replace the emission densities by a convex combination of the original emission densities and of the dominating measure with a weight that decreases polynomially with the sample size. In other words, replace by for some . Taking ensures that the component is asymptotically negligible. Any works, but the constants of the oracle inequality depend on it.
A simplified version of our main result (Theorem 3.1) is the following oracle inequality: for all , there exists constants and such that if the penalty is large enough, the penalized maximum likelihood estimator satisfies for all , and , with probability larger than :
where
can be seen as a KullbackLeibler divergence between the distributions
and . In other words, the estimator recovers the best approximation of the true distribution within the model, up to the penalty and the residual term.In the case where the true distribution is a HMM, it is possible to quantify the approximation error . Using the results of Kruijer et al. (2010), we show that the above oracle inequality is optimal in the minimax sense–up to logarithmic factors–for realvalued HMMs, see Corollary 3.2. This is done by taking HMMs whose emission densities are mixtures of exponential power distributions–which include Gaussian mixtures as a special case.
The paper is organized as follows. We detail the framework of the article in Section 2. In particular, Section 2.3 describes the assumptions on the true distribution, Section 2.4 presents the assumptions on the model and Section 2.5 introduces the Kullback Leibler criterion used in the oracle inequality. Our main results are stated in Section 3. Section 3.1 contains the oracle inequality and Section 3.2 shows how it can be used to show minimax adaptivity for realvalued HMMs. Section 4 lists some perspectives for this work.
One may wish to relax our assumptions depending on the setting. For instance, one could want to change the dependency of the functions and on , change the tail conditions or the rate of forgetting. We give an overview of the key steps of the proof of our oracle inequality in Section 5 to make it easier to adapt our result.
2 Notations and assumptions
We will use the following notations:

[noitemsep]

is the maximum of and , the minimum;

For , we write ;

is the set of positive integers;

For , we write ;

is the set of measurable and square integrable functions defined on the measured space . We write when the sigmafield is not ambiguous;

is the inverse function of the exponential function .
2.1 Hidden Markov models
Finite state space hidden Markov models (HMM in short) are stochastic processes with the following properties. The hidden state process is a Markov chain taking value in a finite set (the state space). We denote by the cardinality of , and and the initial distribution and transition matrix of respectively. The observation process takes value in a polish space (the observation space) endowed with a Borel probability measure . The observations are independent conditionally to with a distribution depending only on . In the following, we assume that the distribution of conditionally to is absolutely continuous with respect to with density . We call the emission densities.
Therefore, the parameters of a HMM are its number of hidden states , its initial distribution (the distribution of ), its transition matrix and its emission densities . When appropriate, we write the density of the process with respect to the dominating measure under the parameters . For a sequence of observations , we denote by the associated loglikelihood under the parameters , defined by
We denote by the true (and unknown) distribution of the process , the expectation under , the density of under the dominating measure and the loglikelihood of the observations under . Let us stress that this distribution may not be generated by a finite state space HMM.
2.2 The model selection estimator
Let be a family of parametric models such that for all and , the parameters correspond to HMMs with hidden states. Note that the models may depend on the number of observations . Let us see two ways to construct such models.
 Mixture densities.

Let be a parametric family of probability densities indexed by . Let . We choose to be the set of parameters such that and are uniformly lower bounded by and for all , is a convex combination of elements of .
 densities.

Let be a family of finite dimensional subspaces of . We choose to be the set of parameters such that and are uniformly lower bounded by and for all , is a probability density such that for a function such that .
In both cases, we took a lower bound on the coefficients of the transition matrix that tends to zero when the number of observations grows. This allows to estimate parameters for which some coefficients of the transition matrix are small or zero. We prove the choice to be a good choice in general in Theorem 3.1.
For all and , we define the maximum likelihood estimator on :
Since the true distribution does not necessarily correspond to a parameter of , taking a larger model will reduce the bias of the estimator
. However, larger models will make the estimation more difficult, resulting in a larger variance. This means one has to perform a biasvariance tradeoff to select a model with a reasonable size. To do so, we select a number of states
among a set of integers and a model index among a set of indices such that the penalized loglikelihood is maximal:for some penalty to be chosen.
In the following, we use the following notations.

is the set of all parameters involved with the construction of the maximum likelihood estimator;

is the set of density vectors from the model . is defined in the same way.
2.3 Assumptions on the true distribution
In this section, we introduce the assumptions on the true distribution of the process . We assume that is stationary, so that one can extend it into a process .
2.3.1 Forgetting and mixing
Let us state the two assumptions on the dependency of the process .
 [Aforgetting]

There exists two constants and such that for all , for all and for all ,
For the mixing assumption, let us recall the definition of the mixing coefficient. Let be a measured space and and be two sigmafields. Let
The mixing coefficient of is defined by
 [Amixing]

There exists two constants and such that
[Aforgetting] ensures that the process forgets its initial distribution exponentially fast. This assumption is especially useful for truncating the dependencies in the likelihood. [Amixing] is a usual mixing assumption and is used to obtain Bernsteinlike concentration inequalities. Note that [Amixing] implies that the process is ergodic.
Even if [Aforgetting] is analog to a mixing condition (see Bradley (2005) for a survey on mixing conditions) and is proved using the same tool as [Amixing] in hidden Markov models–namely the geometric ergodicity of the hidden state process–these two assumptions are different in general. For instance, a Markov chain always satisfies [Aforgetting] but not necessarily [Amixing]. Conversely, there exists processes satisfying [Amixing] but not [Aforgetting].
Assume that is generated by a HMM with a compact metric state space (not necessarily finite) endowed with a Borel probability measure . Write its transition kernel and assume that admits a density with respect to that is uniformly lower bounded and upper bounded by positive and finite constants and . Write its emission densities and assume that they satisfy for all .
Then [Aforgetting] and [Amixing] hold by taking , , and .
This lemma follows from the geometric ergodicity of the HMM.
For [Aforgetting], see for instance Douc et al. (2004), proof of Lemma 2.
For [Amixing], the Doeblin condition implies that for all distribution and on ,
Let and such that . Taking the stationary distribution of and the distribution of conditionally to in the above equation implies
Therefore, the process is mixing with , so that it is mixing with (see e.g. Bradley (2005) for the definition of the mixing coefficient and its relation to the mixing coefficient). One can check that the choice of and allows to obtain [Amixing] from this inequality.
2.3.2 Extreme values of the true density
We need to control the probability that the true density takes extreme values.
 [Atail]

There exists two constants and such that
In practice, only two values of are of interest. The case occurs when the densities are lower and upper bounded by positive and finite constants. If the densities are not bounded, then works in most cases and corresponds to subpolynomial tails. Indeed, the lower bound on is always true when taking and by definition of the density , resulting in the following equivalent assumption:
 [Atail’]

There exists a constant such that
This can be obtained from Markov’s inequality under a moment assumption, as shown in the following lemma.
Assume that there exists such that
Then [Atail] holds for and .
2.4 Model assumptions
We now state the assumptions on the models. Let us recall that the distribution of the observed process is not assumed to belong to one of these models.
Consider a family of models such that for each , and , the elements of are of the form where is a probability density on , is a transition matrix on and is a vector of probability densities on with respect to .
2.4.1 Transition kernel
We need the following assumption on the transition matrices and initial distributions of .
 [Aergodic]

There exists such that for all ,
[Aergodic] is standard in maximum likelihood estimation. It ensures that the process forgets the past exponentially fast, which implies that the difference between the loglikelihood and its limit converges to zero with rate in supremum norm.
2.4.2 Tail of the emission densities
When , [Aergodic] implies that under the parameters , for all , the probability to jump to state at time is at least , whatever the past may be. This implies that the density is lower bounded by . Furthermore, it is upper bounded by . Thus, it is enough to bound this quantity to control without having to handle the time dependency.
For all and , let
We need to control the tails of like we did for in order to get nonasymptotic bounds. This is the purpose of the following assumption.
 [Atail]

There exists two constants and such that
This assumption is often easy to check in practice, as shown in the following lemma. Assume that one of the two following assumption holds:

(subpolynomial tails) there exists such that

(bounded densities) there exists such that
Consider a new model where all are replaced by for a fixed constant . Then [Atail] holds for this new model with (resp. with the second assumption) and .
Changing the densities as in the lemma amounts to adding a mixture component (with weight and distribution ) to the emission densities to make sure that they are uniformly lower bounded. We shall see in the following that if , then this additional component changes nothing to the approximation properties of the models, see the proof of Corollary 3.2. This is in agreement with the fact that this component is asymptotically never observed as soon as .
2.4.3 Complexity of the approximation spaces
The following assumption means that as far as the bracketing entropy is concerned, the set of emission densities of the model (without taking the hidden state into account) behaves like a parametric model with dimension .
 [Aentropy]

There exists a function and a sequence such that for all , , and ,
(1) where is the distance associated with the supremum norm and is the smallest number of brackets of size for the distance needed to cover . Let us recall that the bracket is the set of functions such that , and that the size of the bracket is .
Note that we allow the models to depend on the sample size , which can make grow to infinity with . To control the growth of the models, we use the following assumption.
 [Agrowth]

There exists and such that for all ,
A typical way to check [Aentropy] is to use a parametrization of the emission densities, for instance a lipschitz application . This reduces the construction of a bracket covering on to the construction of a bracket covering of the unit ball of . In this case, depends on the lipschitz constant of the parametrization. An example of this approach is given in Section 3.2 for mixtures of exponential power distributions.
2.5 Limit and properties of the loglikelihood
In this section, we focus on the convergence of the loglikelihood. First, we recall results from Barron (1985) and Leroux (1992) that show the existence of its limit in a general setting. Then, we show how to control the difference between the loglikelihood and its limit using the assumptions from the previous Sections.
2.5.1 Convergence of the loglikelihood
The first result comes from Barron (1985) and shows that the true loglikelihood converges almost surely with no assumption other than the ergodicity of the process .
The second result follows from Theorem 2 of Leroux (1992). A careful reading of his proof shows that one can relax his assumptions to get the following lemma. Note that the definition of extends naturally to the case where is not a vector of probability densities, or even a vector of integrable functions with respect to , through the formula
[Leroux (1992)] Let be a positive integer, a vector of nonnegative and measurable functions, a transition matrix of size and a probability measure on .
Assume that the process is ergodic and that for all . Then:

There exists a quantity which does not depend on such that
and such that if , then

Assume . Then the almost sure convergence also holds in .

Assume for all . Then .
When appropriate, we define by
Note that when is a vector of probability densities, since it is the limit of a sequence of KullbackLeibler divergences: under the assumptions of Lemma 2.5.1, if ,
2.5.2 Approximation of the limit
The following lemma controls the difference between the loglikelihood and its limit. When [Aforgetting] (resp. [Aergodic]) holds, the logdensity of conditionally to the previous observations converges exponentially fast to what can be seen as the density of conditionally to the whole past, that is (resp. ). Strictly speaking, we define the limit of the logdensity and , which can be seen respectively as and .
For all , , let
where the process is extended into a process by stationarity. Likewise, for all , ,
and for all probability distribution
on , letwhere is the density of a stationary HMM with parameters . When is the stationary distribution of the Markov chain under the parameter , we write .

(Douc et al. (2004)). Assume [Aergodic] holds. Let . Then for all , , , and ,
and there exists a process such that for all and , in supremum norm (when seen as a function of ) and for all , and ,

Assume [Aforgetting] holds, then for all , and , and there exists a process such that for all , and for all and ,

Assume [Aforgetting] and [Aergodic] hold. Under , the processes and are stationary for all . Moreover, if is ergodic (for instance if [Amixing] holds), they are ergodic and:

if [Atail] holds, then for all , exists, is finite and

if [Atail] holds, then exists and is finite and

The second point follows directly from [Aforgetting].
The third point follows from the ergodicity of under [Amixing], from the integrability of and under [Atail] and [Atail] and from Lemmas 2.5.1 and 2.5.1.
Note that under the assumptions of point 3 of Lemma 2.5.2, one has for all (recall that is a vector of probability densities in this case), or with some notation abuses:
Thus, can be seen as a Kullback Leibler divergence that measures the difference between the distribution of conditionally to the whole past under the parameter and under the true distribution. It can be seen as the prediction error under the parameter .
In the particular case where the true distribution of is a finite state space hidden Markov model, characterizes the true parameters, up to permutation of the hidden states, provided the emission densities are all distinct and the transition matrix is invertible, as shown in the following result.
[Alexandrovich et al. (2016), Theorem 5] Assume is generated by a finite state space HMM with parameters . Assume is invertible and ergodic, that the emission densities are all distinct and that for all (so that ).
Then for all , for all transition matrix of size and for all uple of probability densities , one has .
In addition, if , then if and only if up to permutation of the hidden states.
3 Main results
3.1 Oracle inequality for the prediction error
The following theorem states an oracle inequality on the prediction error of our estimator. It shows that with high probability, our estimator performs as well as the best model of the class in terms of Kullback Leibler divergence, up to a multiplicative constant and up to an additive term decreasing as , provided the penalty is large enough.
Assume [Aforgetting], [Amixing], [Atail], [Aergodic], [Atail], [Aentropy] and [Agrowth] hold.
Let be a nonnegative sequence such that . Assume and for some constants and (where is defined in [Agrowth]). Let . For all and , let
and let
be the nonparametric maximum likelihood estimator.
Then there exists constants and depending only on , , , and and a constant depending only on , and such that for all
Comments
There are no comments yet.