1.1 Background and motivation
Consider a process where
is a Markov chain with state spacewhose initial distribution and transition kernel admit densities and with respect to Lebesgue measure, and , each valued in a measurable space , are conditionally independent given and such that for any
, the conditional probability ofgiven can be written in the form , where and is a measure on .
Models of this form, going by the names of hidden Markov or state-space models, provide a flexible and interpretable framework for describing temporal dependence along data streams in terms of latent processes. They are applied in a wide variety of fields including econometrics, engineering, ecology, machine learning and neuroscience.
With a distinguished observed data sequence considered fixed throughout this paper, define:
where for any sequence , we shall use the shorthand . The posterior density at the path given , say is then proportional to . The maximum a-posteriori path estimation problem given is to find:
In addition to serving as a point estimate of the hidden trajectory, the solution of (2), or generally in practice some approximation to it obtained numerically, is of interest when calculating the Bayesian information criterion  with non-uniform priors over the hidden trajectory, can be used in initialization of Markov chain Monte Carlo algorithms to sample from , and for log-concave posterior densities is automatically accompanied by universal bounds on highest posterior density credible regions thanks to concentration of measure inequalities [14, 1].
The Viterbi process is a sequence such that for any ,
Its existence was first studied in the information theory literature, [4, 3], for models in which the state-space of is a set of a finite number of states and the convergence in (3) is with respect to the discrete metric. The “Viterbi process” name appeared later, in , inspired by the famous Viterbi decoding algorithm . We focus on the case of state-space . The only other work known to the author which considers the Viterbi process in the case of state-space is , discussed below. They considered convergence in (3) with respect to Euclidean distance.
In these studies the Viterbi process appears to be primarily of theoretical interest. Here we consider also a practical motivation in a similar spirit to distributed optimization methods, e.g., [12, 15, 16]: the existence of the limit in (3) suggests that (2) can be solved approximately using a collection of optimization algorithms which process data segments in parallel. To sketch the idea, with and integers, consider the index sets:
where is an “overlap” parameter. Suppose the optimization problems:
are solved in parallel, then in a post-processing step the components of the solutions of (4) indexed by the intersections between the ’s are discarded, and what remains concatenated to give an approximation to the solution of (2). If it takes time to solve (2) the speed-up from parallelization could be as much as a factor of . The main problem addressed in this paper is to study the rate of convergence to the Viterbi process in (3), and as a corollary we shall quantify the approximation error which trades off against the speed-up from parallelization as a function of , , , the ingredients of the statistical model and properties of the observation sequence.
1.2 Summary of the approach and relation to existing works
We shall approach the solutions of (2) indexed by and there tendency to the Viterbi process in an infinite dimensional Hilbert space, , where is a parameter depending on the model ingredients and which we shall relate to the rate of convergence to the Viterbi process. This approach is new and has three main benefits:
Via a new “decay-convexity” property of which may be of independent interest, our approach provides a characterization of the Viterbi process as the fixed point of an infinite dimensional ordinary differential equation which arises in the limit .
In turn this allows natural connections to be made to gradient descent algorithms, and estimates of their rates of convergence in easily obtained.
In totality, the collection of assumption we make is neither stronger nor weaker than the collection of assumptions of [5, Thm 3.1]. Comparisons between some individual assumptions are discussed in section A.5. One commonality is that both our assumptions (see the decay-convexity condition in Theorem 1 combined with Lemma 1) and the assumptions of [5, Thm 3.1] imply that is strongly log-concave, in the sense of .
From a statistical modelling perspective, this strong log-concavity might seem quite restrictive. However, the merit of assuming strong log-concavity must also take into account its attractive mathematical and computational consequences: strong-convexity of objective functions and strong-log concavity of target probability densities endows gradient-descent algorithms and certain families of diffusion Markov chain Monte Carlo algorithms with dimension-free convergence rates [2, 6] and plays a role in dimension-free contraction rates for the filtering equations of hidden Markov models . The notion of decay-convexity introduced here extends these dimension-free phenomena, in particular under our assumptions we shall illustrate that the parameter controlling the rate of convergence to the Viterbi process does not necessarily depend on .
The proof techniques of [5, Thm 3.1] are quite different to ours. There the existence of the limit (3) is established using a converging series argument to bound terms in a dynamic programming recursion. A quantitative bound on the Euclidean distance between and is given in [5, eqs. (3.13) and (3.15)]; we address a stronger notion of convergence on the Hilbert space . The proof of [5, Thm 3.1] is given only in the case , but the same approach may be applicable more generally.
Earlier works concerning discrete-state hidden Markov models [4, 11] establish the existence of the limit in (3) by identifying stopping times which divide the optimal path into unrelated segments. [5, Example 2.1] illustrates that this approach to existence of the limit can also be made to work when the state-space is , but it seems not to easily yield quantitative bounds.
In the broader literature on convex optimization, a theory of sensitivity of optimal points with respect to constraints in convex network optimization problems has been introduced by [15, 16]. The notions of scale-free optimization developed there are similar in spirit to the objectives of the present paper, but the results are not directly comparable to ours since they concern a constrained optimization problem. In the context of unconstrained convex optimization problems with separable objective functions which allow for the structure of (2),  addressed the convergence of a min-sum message passing algorithm. Again some aspects of their analysis are similar in spirit to ours, but their aims and results are quite different.
Amongst our main assumptions will be continuous differentiability of the terms on the right of (1). Considering (3) as a regularized maximum likelihood problem, where the regularization comes from and , it would be particularly interesting to relax the differentiability assumption in order to accommodate sparsity inducing Lasso-type regularizers , but this is beyond the scope of the present paper.
Lastly, a further comment about generality: whilst we restrict our attention to the objective functions in (1) associated with hidden Markov models where represents time, the techniques we develop could easily be generalized to objectives functions which are additive functionals across tuples (rather than just pairs) of variables, and to situations where the arguments of the objective function are indexed over some set with a spatio-temporal (rather than just temporal) interpretation. Indeed many of the techniques presented here are not specific to hidden Markov models at all and may be of wider interest.
2 Main results
2.1 Definitions and assumptions
considered fixed, we shall associate with a generic vectorthe vectors , each in , such that . With and the usual Euclidean inner product and norm on , define the inner product and norm on associated with a given ,
Let be the Hilbert space consisting of the set equipped with the inner-product and the usual element-wise addition and scalar multiplication of vectors over field . For each , denotes the subspace consisting of those such that for , with the convention that . Note that does not actually depend on , but this notation seems natural since we shall often encounter projections from to .
and let and be the vectors in whose th entries are the partial derivatives of and with respect to the th entry of (the existence of such derivatives is part of Condition 1 below).
Then for each , define the vector field:
With these definitions, the first elements of the vector are the partial derivatives of with respect to the elements of , whilst the other elements of the vector
are zero. This zero-padding ofto make an infinitely long vector is a mathematical convenience which will allow us to treat as a sequence of vector fields on .
a) , , and , , are everywhere strictly positive and continuously differentiable.
b) there exist constants such that , and for all
2.2 Viterbi process as the limit of a Cauchy sequence in
Assume that Condition 1 holds, and with as therein, let be any value in such that:
Then with any such that:
and any ,
Amongst all the vectors in , there is a unique vector such that , and
Since , the first elements of the vector solve the estimation problem (A.1), and since , the remaining elements of are zero.
When Condition 1 holds, there always exists satisfying (10) and satisfying (11) because and Condition 1 requires . The case is of interest because if the right hand side of (13) converges to zero as , then is a Cauchy sequence in , yielding the existence of the Viterbi process, as per the following corollary.
If in addition to the assumptions of Theorem 1, and , then there exists in such that .
2.3 Interpretation of the decay-convexity condition
To explain the “decay” part of decay-convexity, let us now address the case . It is well known that strong convexity of a continuously differentiable function is closely connected to exponential contraction properties of the associated gradient-flow ODE. This connection underlies convergence analysis of gradient-descent algorithms, see for example [13, chapter 2]. The inequality (12) can be interpreted similarly for any : using standard arguments for finite-dimensional ODE’s (a more general Hilbert space setting is given a full treatment in Proposition 1 in section A.2), it can be shown that when Condition 1a) and (12) hold, there is a unique, globally-defined flow which solves:
Here is a vector in , and the derivative with respect to time is element-wise. Noting the zero-padding of in (7), the first elements of together constitute the gradient flow associated with , whilst each of the remaining elements is the identity mapping on . Thus for , can be written as a sum of finitely many terms and by simple differentiation,
Since implies , it follows from (12) that
To see the signficance of the case , suppose that the initial conditions are such that for all . Then writing for the first elements of the vector , it follows from (15) that
Thus when , (12) ensures that as , the influence of on decays as with rate given by .
Turning to the inequalities in (10), observe that if is fixed, these inequalities are satisfied if is large enough. Further observe that if Condition 1b) is satisfied in the extreme case , then each (respectively ) is (respectively )-strongly concave in and then it is immediate that (12) holds. Discussion of how condition 1b) relates to the model ingredients , , is given in section 3.
Before moving on to an ODE perspective on the Viterbi process, the following lemma addresses the relationship between the cases and , hence explaining the conjunction of “decay” and “convexity” in the name we give to (12).
If is twice continuously differentiable and (12) holds for some and , then it also holds with that same and .
2.4 Viterbi process as the fixed point of an ODE on
Now define the vector field:
An important note here about notation and interpretation: element-wise, the vector is the limit as of the vector . Indeed it can be read off from (7) that each element of the vector is constant in for all large enough. However, may not be interpreted as the gradient of the limit of the sequence of functions , because the pointwise limit
is in general not well-defined. This reflects the fact that on an infinite time horizon, the prior and posterior probability measures over the entire state sequenceare typically singular, so that a density “” does not exist. Hence there is no sense in characterizing the Viterbi process as: “”, the correct characterization is: , as Theorem 2 shows via a counter-part of (14)-(15) in the case .
In addition to the assumptions of Theorem 1 and with as therein, assume a)-c):
a) there exists a finite constant such that for all and ,
c) is continuous in .
The assumptions a)-b) in Theorem 2 ensure that maps to itself. Combined with the continuity in assumption c) and (17), this allows an existence and uniqueness result of  for dissipative ordinary differential equations on Banach spaces to be applied in the proof of Theorem 2. It is from here that the Fréchet derivative (18) arises. Background information about Fréchet derivatives is given in section A.1.
3.1 Bound on the segment-wise error in the parallelized scheme
If the assumptions of Theorem 2 hold,
Recall that here is the “overlap” between segments. To see the impact of the observations , recall from (5), (6) and (8) that depends on only through where is gradient with respect to . Therefore whether or not the right hand side of (20) converges to zero as depends on the behaviour of as . There are a wide variety of assumptions on which would suffice for convergence to zero. In the particular case of stationary observations, the rate or convergence is exponential:
be any -valued, stationary stochastic process such that
In particular, need not be distributed
according the hidden Markov model described in section 1.1.
Then for any , there exists a stationary
process such that
almost surely, and such that if in (1)-(2)
the sequence is replaced by
the random variables
is replaced by the random variables, then for any ,
The proof is given in section A.3 and the interested reader can deduce an explicit expression for from the details there.
Bounds on the errors associated with the other segments in the parallelization scheme can be obtained by very similar arguments to those in proof of Theorem 2. Presenting all the details would involve substantial repetition. However, as a rough approximation, due to additivity of the squared Euclidean distance, the overall error can be expected to scale like (which is the number of segments) times the error in (20).
3.2 Verifying Condition 1 and dimension independence of and
The following lemma provides an example of a model class satisfying Condition 1, allowing us to illustrate that the constants and appearing in Theorems 1 and 2 do not necessarily have any dependence on the dimension of the state-space,
. Here the smallest and largest eigenvalues of a real, symmetric matrix, say, are denoted , .
Assume a) and b):
a) The unobserved process satisfies
where for , is independent of other random variables, , and are positive definite, is a matrix and is a length- vector.
b) For each , is strictly positive, continuously differentiable and there exists such that for all and ,
Then, if the inequality is satisfied by:
then Condition 1 holds.
The proof is in section A.4.
The condition (22) is called semi-log-concavity of , generalizing log-concavity by allowing rather than only .
The fact that , and in (23)-(25) depend only on eigenvalues of , and and the semi-concavity parameter means they, and consequently and , do not necessarily depend on dimension. As a simple example consider the case: , and , with and . In this situation holds, the inequalities in (10) are satisfied with , and satisfies (10).
The condition can be interpreted as balancing the magnitude of temporal correlation in (21) against the size of the fluctuations of and the degree to which the likelihood is informative about . As the mapping becomes more strongly log-concave, and by inspection of (23)-(25) the condition can always be achieved if takes a large enough negative value, with other quantities on the right of the equations (23)-(25) held constant. On the other hand, if , which implies for any value of , the condition can be achieved if is small enough.
3.3 Gradient descent algorithms
In finite dimensions, it is well known that with suitably small step size, gradient algorithms associated with strongly convex and gradient-Lipschitz objective functions converge exponentially fast [13, Ch. 2.]. Analogous conclusions hold on for the vector field of which the Viterbi process is the fixed point in Theorem 2: starting from some , define:
where indexes algorithm time and is a step size. If assumption c) of Proposition 1 holds then for all . If also with some constant the vector field is Lipschitz continuous:
and satisfies the dissipative assumption b) of Proposition 1, then using one may estimate:
So is sufficient for convergence.
Appendix A Appendix
a.1 Fréchet derivatives
The following definitions can be found in [9, App. A]. For Banach spaces over , with respective norms , , a function has a directional derivative at in direction if there exists such that
The function is Gâteaux differentiable at if exists for all and is a bounded linear operator from to , in which case is called the Gâteaux derivative at . The function is additionally Fréchet differentiable at if
in which case the operator is called the Fréchet derivative at .
a.2 ODE’s on the Hilbert space
In the following proposition the operator of orthogonal projection from to is written .
For a given triple consisting of a constant , a mapping and , assume that a)-c) hold:
a) is continuous with respect to the norm on ,
b) there exists such that for all ,
c) for all , and the image of by is .
Then there exists a unique and globally defined flow solving the Fréchet ODE,
This flow has a unique in fixed point, , and for all and .
The proof is postponed.
The term in Proposition 1 is an application of the Fréchet derivative of with respect to , that is in (26), is equipped with the Euclidean norm, is the Hilbert space , and is the map , where in the latter the argument is regarded as fixed. Similarly with fixed, and denoting the Fréchet derivative of at by , the quantity is precisely . Thus in particular,
which, in general, is a stronger condition than the element-wise convergence of to .
The following Lemma will be used in the proof of Proposition 1.
If a triple satisfies the assumptions of Proposition 1, then with as therein and any ,
In the case , assumption c) of Proposition 1 implies that only the first elements of the vector depend on
, and in that case the lemma can be proved by the chain rule of elementary differential calculus. The following proof is valid for anyand uses the chain rule of Fréchet differentiation.
Pick any , write them as , with each . The first step is to prove that the mapping is Fréchet differentiable everywhere in , with Fréchet derivative .
Consider the existence of directional derivatives. For let denote the vector in