 # On Large Lag Smoothing for Hidden Markov Models

In this article we consider the smoothing problem for hidden Markov models (HMM). Given a hidden Markov chain {X_n}_n≥ 0 and observations {Y_n}_n≥ 0, our objective is to compute E[φ(X_0,...,X_k)|y_0,...,y_n] for some real-valued, integrable functional φ and k fixed, k ≪ n and for some realisation (y_0,...,y_n) of (Y_0,...,Y_n). We introduce a novel application of the multilevel Monte Carlo (MLMC) method with a coupling based on the Knothe-Rosenblatt rearrangement. We prove that this method can approximate the afore-mentioned quantity with a mean square error (MSE) of O(ϵ^2), for arbitrary ϵ>0 with a cost of O(ϵ^-2). This is in contrast to the same direct Monte Carlo method, which requires a cost of O(nϵ^-2) for the same MSE. The approach we suggest is, in general, not possible to implement, so the optimal transport methodology of span is used, which directly approximates our strategy. We show that our theoretical improvements are achieved, even under approximation, in several numerical examples.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Given a hidden Markov chain , and observations , , we consider a probabilistic model such that for Borel , , for every ,

 P(Xn∈A|x0:n−1)=∫Af(xn−1,x)dx (1)

with Lebesgue measure and for Borel and all ,

 P(Yn∈B|y0:n−1,x0:n)=∫Bg(xn,y)dy, (2)

where we have used the compact notation for any and any sequence

with the convention that the resulting vector of objects is null if

. The model defined by (1) and (2) is termed a hidden Markov model. In this article, given , our objective is to compute for some real-valued, integrable functional and fixed, , which we refer to as large-lag smoothing. Hidden Markov models and the smoothing problem are found in many real applications, such as finance, genetics and engineering; see e.g.  and the references therein.

The smoothing problem is notoriously challenging. Firstly, is seldom available analytically and hence numerical methods are required. Secondly, if one wants to compute for several values of , i.e. potentially recursively, then several of the well-known methods for approximation of can fail. For instance the particle filter (e.g.  and the references therein) suffers from the well-known path degeneracy problem (see e.g. ). Despite this, several methods are available for the approximation of , such as particle Markov chain Monte Carlo  or the PaRIS algorithm , which might be considered the current state-of-the-art. The latter algorithm relies on approximating for some and is then justified on the basis of using forgetting properties of the smoother (see e.g. [2, 4]). We will extend this notion as will be explained below.

The main approach that is followed in this paper, is to utilize the multilevel Monte Carlo method (e.g. [7, 8]). Traditional applications of this method are associated to discretizations of continuum problems, but we adopt the framework in a slightly non-standard way. To describe the basic idea, suppose one is interested in for a probability, real-valued and bounded, but, one can only hope to approximate with a probability (assumed on the same space as ), and in some loose sense one has approaches as grows. Now, given

a sequence of increasingly more ‘precise’ probability distributions on the same space, one trivially has

 EπL[φ(X)]=Eπ0[φ(X)]+L∑l=1{Eπl[φ(X)]−Eπl−1[φ(X)]}.

The approach is now to sample dependent couplings of independently for and approximate the difference using Monte Carlo. The term is also approximated using Monte Carlo with i.i.d. sampling from . Then, given a ‘good enough’ coupling and a characterization of the bias, for many practical problems the cost to achieve a pre-specified MSE against i.i.d. sampling from and Monte Carlo, is significantly reduced.

We leverage the idea of MLMC where the ‘level’ corresponds to the time parameter and is some chosen , so as to achieve a given level of bias. The main issue is then how to sample from couplings which are good enough. We show that when (the dimension of the hidden state) that using the optimal coupling, in terms of squared Wasserstein distance, can yield significant improvements over the case where one directly approximates with Monte Carlo and i.i.d sampling from the smoother. That is, for given, to achieve a mean square error of , the cost is , whereas for the ordinary Monte Carlo method the cost is . The same conclusion with can be achieved using the Knothe-Rosenblatt rearrangement. The main issue with our approach is that it cannot be implemented for most problems of practical interest. However, using the methodology in , it can be approximated. We show that in numerical examples our predicted theory is verified, even under this approximation. We also compare our method directly with PaRIS, showing substantial improvement in terms of cost for a given level of MSE.

This article is structured as follows. In Section 2 we detail our approach and theoretical results. In Section 3 we demonstrate how our approach can be implemented in practice. In Section 4 we give our numerical examples. Section 5 summarizes the article. The appendix includes the assumptions, technical results and proofs of our main results.

### 1.1 Notations

Let be a measurable space. For we write and as the collection of bounded measurable and Lipschitz functions respectively. For , we write the supremum norm . For , and we write for the set of functions on such that . For , we write the Lipschitz constant . denotes the collection of probability measures on . For a measure on and a , the notation is used. Let be a Markov kernel and be a measure then we use the notations and for , For a sequence of Markov kernels we write

 K1:n(x0,dxn)=∫Xn−1n∏p=1Kp(xp−1,dxp).

For , the total variation distance is written . For the indicator is written .

denotes the uniform distribution on the set

.

is the one-dimensional Gaussian distribution of mean

and variance

.

## 2 Model and Approach

We are given a HMM and we seek to compute

 Eπn,0[φ(X0)|y0:n]=∫Xn+1φ(x0)∏np=0g(xp,yp)f(xp−1,xp)dx0:n∫Xn+1∏np=0g(xp,yp)f(xp−1,xp)dx0:n

where and for ease of simplicity we suppose that and is a compact subspace of the real line. is the probability density (we also use the same symbol for probability measure) of the smoother given observations at the co-ordinate at time 0. That is

 πn,0(x0|y0:n)∝∫Xnn∏p=0g(xp,yp)f(xp−1,xp)dx1:n.

Let be fixed, then we propose to consider

 Eπn∗,0[φ(X0)|y0:n∗]=Eπ0,0[φ(X0)|y0]+n∗∑p=1{Eπp,0[φ(X0)|y0:p]−Eπp−1,0[φ(X0)|y0:p−1]}.

### 2.1 Case X⊂R

Let us denote the CDF of as . An approximation of is

 1NpNp∑i=1[φ(Π−1p,0(Ui))−φ(Π−1p−1,0(Ui))]

where for , and is the (generalized) inverse CDF of . If we do this independently for each

and use an independent estimator

for one can estimate . The utility of the coupling is that it is optimal in terms of 2-Wasserstein distance. We have the following result, where the assumption and proof are in the appendix.

###### Theorem 2.1.

Assume (AA). Then there exists , such that for any , , , we have

 V\emph{ar}[1NpNp∑i=1[φ(Π−1p,0(Ui))−φ(Π−1p−1,0(Ui))]]≤Cρp−1∥φ∥2\emph{Lip}Np.

The main implication of the result is the following. In the approach to be considered later in this paper the cost of computing (an approximation of) is per time step. So the cost of this method is . Thus the MSE and cost associated to this algorithm are (at most in the first case)

 C(∥φ∥2∨∥φ∥2Lip)(1N0+n∗∑p=1ρp−1Np+ρ2n)

and

 C(n∗+n∗∑p=0Np). (3)

Let be given. To achieve an MSE of we can choose (here we of course mean , but this is omitted for simplicity) and for any yields that the associated cost is . If one just approximates using

 1NN∑i=1φ(Π−1n,0(Ui))

then, to achieve an MSE of the cost would be which is considerably larger if is large. That is, the cost of the ML approach is essentially w.r.t. . If one stops at and uses the estimate

 1NN∑i=1φ(Π−1n∗,0(Ui))

to achieve an MSE of , the cost is . A similar approach can show that these results are even true when smoothing for for fixed (and hence ). The strategy of choosing and detailed above, is the one used throughout the paper. Note that in practice, we do not know , so we choose a value such as which should lead to an which is large enough. This is also the reason for setting and not say.

It is remarked that the compactness of could be removed by using Kellerer’s extension of the Kantorovich-Rubenstein theorem (see  for a summary) and then, given that the latter theory is applicable, to show that there exists a , such that for any

 supφ∈Lip1(X)′|Eπp,0[φ(X0)|y0:p]−Eπp−1,0[φ(X0)|y0:p−1]|≤Cρp−1

where is the collection of functions such that for every , . This can be achieved using the techniques in . Such an extension is mainly of a technical nature and is not required in the continuing exposition. We now establish that the construction here can be extended to the case .

### 2.2 Case X⊂Rd

We consider the Knothe-Rosenblatt rearrangement, which is assumed to exist (see e.g. ). For simplicity of notation, we set for some compact . Denote by the conditional CDF of with . Note that here we are dealing with the dimensional co-ordinate at time zero and we considering conditioning on the first of these dimensions. Then to approximate , sample , where for , . Then we have the estimate for

 1NpNp∑i=1[φ(ξip,d)−φ(ξip−1,d)]

where for ease of notation, we have set , (resp. ) and , , (resp.  , ). We have the following result, whose proof and assumptions are in the appendix.

###### Theorem 2.2.

Assume (AA-A). Then there exists , such that for any , , , we have

 V\emph{ar}[1NpNp∑i=1[φ(ξip,d)−φ(ξip−1,d)]]≤Cρp−1∥φ∥2\emph{Lip}Np.

We end this section with some remarks. Firstly, the MLMC strategy could be debiased w.r.t. the time parameter using the trick in , which is a straightforward extension. One minor issue with this methodology, is that the variance can blow up in some scenarios. Secondly, the idea of using the approach in , when approximating has been adopted in . The authors use a conditional version of the coupled particle filter (e.g. [3, 11]) to couple smoothers, versus the optimal Wasserstein coupling. The goal in 

is unbiased estimation which is complementary to ideas in this article, where we focus upon reducing the cost of large lag smoothing.

## 3 Transport methodology

### 3.1 Standard Approach

The basic principle of the transport methodology introduced in  is to determine a mapping relating a base distribution

, e.g. the normal distribution, to a potentially sophisiticated target distribution

related to the problem of interest. The distribution should be easy to sample from so that, given the map , we can obtain samples from by simply mapping samples from via . More precisely, the considered mapping is characterised by

 T#η(x)=η(T−1(x))|det∇T−1(x)|=~π(x),

that is, the push-forward distribution of by is . Such a mapping can be approximated using deterministic or stochastic optimisation methods. However, the underlying optimisation problem is only amenable when the space on which is defined is of a low dimension, e.g. up to . This is not the case in general for the smoothing distributions introduced in the previous sections, especially as the number of observations increases. This is addressed in 

by identifying the dependence structure between the random variables of interest. In particular, for a hidden Markov model on

, it is possible to decompose the problem into transport maps of dimension , which does not depend on the number of observations that define the smoother. The problem at time can be solved by introducing a mapping of the form

 Tp(xp,xp+1)=[T0p(xp,xp+1)T1p(xp+1)]

which will transform the -dimensional base distribution into a target distribution related to the considered hidden Markov model, as detailed below. This target distribution can be expressed as

 ~πp(xp,xp+1)∝ηd(xp)f(T1p−1(xp),xp+1)g(xp+1,yp+1),

for any , which can be seen to be the 1-lag smoother. When , we simply define . The base distribution (resp. ) is the standard normal distribution of dimension (resp. ). The mapping can be embedded into the -dimensional identity mapping as

 ¯Tp(x0,…,xn)=(x0,…,xp−1,T0p(xp,xp+1),T1p(xp+1),xp+2,…,xn)t,

with denoting the matrix transposition. It follows that

 Tn=¯T0∘⋯∘¯Tn

is the map such that the pushforward

is equal to the probability density function of the smoother at time

. Obtaining samples from the smoothing distribution is then straightforward: it suffices to sample from and to map the obtained sample via .

Even in low dimension, the optimisation problem underlying the computation of the transport maps of interest is not trivial. One first has to consider an appropriate parametrisation of these maps, e.g. via polynomial representations. The parameters of the considered representation then have to be determined using the following optimisation problem

 T∗p=argminT−E[log~πp(T(X))+log(det∇T(X))−logη2d(X)], (4)

where the minimum is taken over the set of monotone increasing lower-triangular maps. This minimisation problem can be solved numerically by considering a parametrised family of maps and deterministic or stochastic optimisation methods. Let be any acceptable map in the minimisation (4) and denote by the th component of , which only depends on the th first variables, , then the considered parametrisation can be expressed as

 T(i)(x1,…,xi)=ai(x1,…,xi−1)+∫xi0bi(x1,…,xi−1,t)2dt

for some real-valued functions and on and respectively. It is assumed that the functions and are Hermite Probabilists’ functions extended with constant and linear components for any , and the function is also a Hermite Probabilists’ function which is only extended with a constant component. In particular, these functions take the form

 ai(x1,…,xi−1) =2d(omap+1)∑k=1ckΦk(x1,…,xi−1) bi(x1,…,xi−1,t) =2domap∑k=1c′kΨk(x1,…,xi−1,t)

with the map order, with and some collections of real coefficients and with and basis functions based on the above mentioned Hermite Probabilists’ functions. The expectation in (4) is then approximated using a Gauss quadrature of order in each dimension and the minimisation is solved via the Newton algorithm using the conjugate-gradient method for each step.

The desired function can be recovered through the relation

 Tp((xp,1,…,xp,d),(xp+1,1,…,xp+1,d))=(Sσ∘T∗p∘Sσ)(xp,1,…,xp,d,xp+1,1,…,xp+1,d),

where and is the linear map corresponding to the permutation matrix of , which verifies .

### 3.2 Fixed-Point Smoothing with Transport Maps

The approach described in Section 3.1 allows for obtaining samples from the distribution of given by simply retaining the first components of samples from after mapping them through . However, the computational cost associated with the mapping of samples by increases with , making the complexity of the method of the order .

This can however be addressed by considering as a parameter and by only propagating the transport map corresponding to the posterior distribution of . This approach has been suggested in [17, section 7.4]. We assume in the remainder of this section that observations start at time step instead of . When considering as a parameter, the elementary transport maps take the form

 Tp(x0,xp,xp+1)=⎡⎢ ⎢⎣TX0p(x0)T0p(x0,xp,xp+1)T1p(x0,xp+1)⎤⎥ ⎥⎦.

and the corresponding target distributions become

 ~π1(x0,x1,x2)∝p0(x0)f(x0,x1)f(x1,x2)g(x1,y1)g(x2,y2),

and

 ~πp(x0,xp,xp+1)∝η2d(x0,xp)f(T1p−1(x0,xp),xp+1)g(xp+1,yp+1),

for any . The transport map associated with the posterior distribution of is

 ^Tn(x0,xn)=[TX01∘⋯∘TX0n−1(x0)T1n−1(x0,xn)].

By recursively approximating the composition by a single map, the computation of samples from the posterior distribution of becomes linear in time. The pseudo-code for this approach is given in Algorithm 1.

## 4 Case Studies

### 4.1 Linear Gaussian

#### 4.1.1 Theoretical Result

The results in Section 2 do not apply to the linear Gaussian case. We extend our results to this scenario. We assume that the dynamical and observations models are one-dimensional as well as linear and Gaussian such that the state and observation random variables at time can be defined as

 Xn|xn−1 ∼ N(αxn−1,β2),n≥1 Yn|xn ∼ N(xn,τ2),n≥0

and , for some and some . We have the following result, whose proof is in the appendix.

###### Theorem 4.1.

Assuming that for all large enough, it holds that

 V\emph{ar}[1NpNp∑i=1[Π−1p,0(Ui)−Π−1p−1,0(Ui)]]=O(1Np(α+β2αγ2)−2p).

Theorem 4.1 shows that, under assumptions on the parameters of the model, the variance of the approximated multilevel term at level tends to exponentially fast in and with an order of for the number of samples. This theorem also indicates that the behaviour depends an all the parameters in the model, although implicitly in . For instance, if then one can consider in the above expression. The assumption about the variance of the filter can be justified in terms of reachability and observability of the system .

This rate can get extremely beneficial for the proposed approach when is large and is small, however it can also make it of little use in the opposite case. This does not come as a surprise since a large means that the initial condition is quickly forgotten so that obtaining a high number of samples from the smoother for large would be inefficient, whereas small values of incur a much higher dependency between the initial state and the observations at different time steps.

#### 4.1.2 Numerical Results

The performance of the proposed method is first assessed in the linear-Gaussian case where an analytical solution of the fixed-point smoothing problem is available, this solution being known as the Rauch-Tung-Striebel smoother . More specifically, we consider the following model:

 Xn|xn−1 ∼ N(αxn−1,β2),n≥1 Yn|xn ∼ N(xn,τ2),n≥0

with , where and . The transport maps of interest are approximated to the order while the expectation is approximated to the order and the minimisation is performed with a tolerance of . The number of samples at each time step as well as the time horizon is computed according to the method proposed in Section 2.1 with different values for the parameter . The performance of the proposed method is compared against the PaRIS algorithm introduced in  using the observations with a varying number of samples and with terms for the propagation of the estimate of . In the simulations, it always holds that to ensure the fairness of the comparison. The criteria for performance assessment is the MSE at the final time step, defined as

 1MM∑i=1(^xi−x∗)2

where is the number of Monte Carlo simulations, is the estimate of (with for the PaRIS algorithm) and where is the corresponding estimate given by the Rauch-Tung-Striebel smoother.

The values of the MSE at the final time obtained in simulations are shown in Figure 1 where the proposed approach displays smaller errors than the PaRIS algorithm for different values of and . The advantage when representing the probability distributions of interest with transport maps is that the computational effort required to obtain a sample is extremely limited once the maps have been determined. For instance, the highest and lowest considered values of in Figure 1 correspond to and samples respectively, which induces a comparatively small increase in computational time. Figure 1: Performance of the proposed method against the PaRIS algorithm with the linear-Gaussian model, averaged over 100 Monte Carlo simulations. The reference for the computation of the MSE is the Rauch-Tung-Striebel smoother. The displayed cost for the multilevel approach includes the computation of the transport maps.

### 4.2 Stochastic Volatility Model

In order to further demonstrate the performance of the proposed approach, the assessment conducted in the previous section is applied to a non-linear case. A stochastic volatility model is considered with

 Xn = μ+ϕ(Xn−1−μ)+Vn,n≥1,X0∼N(μ,11−ϕ2) Yn = Wnexp(12Xn),n≥0

with and , where , and . In the absence of an analytical solution, the reference is determined by the PaRIS algorithm with samples. Since the observation process of this model is generally less informative than the one of the Gaussian model, the PaRIS algorithm is given the observations up to the time step and, similarly, it is ensured that for the proposed approach. The other parameters are the same as in the linear-Gaussian case.

The MSE at the final time obtained for the two considered methods is shown in Figure 2. Once again, the error for the proposed approach is lower than for the PaRIS algorithm although the difference is less significant. In particular, the gain in accuracy between the lowest and the second lowest value of seem to indicate that simply increasing the number of samples would not allow for reducing the error much further. However, increasing the order of the transport maps or decreasing the tolerance in the optimisation could further reduce the error, although with a significantly higher computational cost. Figure 2: Performance of the proposed method against the PaRIS algorithm with the stochastic volatility model, averaged over 100 Monte Carlo simulations. The reference for the computation of the MSE is the PaRIS algorithm with 213 samples. The displayed cost for the multilevel approach includes the computation of the transport maps.

The computational costs obtained for the two models considered in simulations are shown in Figure 3 for different values of . These results confirm the order that was predicted in Section 2. Figure 3: Computational cost as a function of ϵ, averaged over 100 Monte Carlo simulations. The fitted curves are based on a function of the form ϵ↦−aϵ−2−blog(ϵ), with a and b some parameters, which is justified by the form of the cost (3).

## 5 Summary

In this article we have considered large lag smoothing for HMMs, using the MLMC method. We showed that under an optimal coupling when the hidden state is in dimension 1 or higher, but on a compact space that, essentially, the cost can be decoupled from the time parameter of the smoother. As this optimal method is not possible in practice, we showed how it could be approximated and established numerically that our theory still holds in this approximated case. Several extensions to the work are possible. Firstly, to extend our theoretical results to the case of the approximated coupling. Secondly, to investigate whether the coupling used in  can also yield, theoretically, the same improvements that have been seen in the work in this article.

### Acknowledgements

All authors were supported by Singapore Ministry of Education AcRF tier 1 grant R-155-000-182-114. AJ is affiliated with the Risk Management Institute, OR and analytics cluster and the Center for Quantitative Finance at NUS.

## Appendix A Variance Proofs

We write the density (or probability measure) of the smoother, at time , on the co-ordinate at time zero as and the associated CDF as (with generalized inverse ). Recall that throughout is a compact subspace of . Throughout the observations are fixed and often omitted from the notations. The appendix gives our main assumptions, followed by a technical Lemma (Lemma A.1) which features some technical results used in the proofs. Then the proof of Theorem 2.1 is given. The appendix is concluded by a second technical Lemma (Lemma A.2) followed by the proof of Theorem 2.2.

• There exists such that

 infx∈Xg(x,y0)f(x)∧infp≥1inf(x,x′)∈X2g(x′,yp)f(x,x′) ≥ C–– supx∈Xg(x,y0)f(x)∨supp≥1sup(x,x′)∈X2g(x′,yp)f(x,x′) ≤ ¯¯¯¯C.
• There exists such that for every

 |g(x,y0)−g(x′,y0)| ≤ C|x−x′| supz∈X|f(x,z)−f(x′,z)| ≤ C|x−x′| |f(x)−f(x′)| ≤ C|x−x′|.

Below