The Viterbi process, decay-convexity and parallelized maximum a-posteriori estimation

10/08/2018 ∙ by Nick Whiteley, et al. ∙ 0

The Viterbi process is the limiting maximum a-posteriori estimate of the unobserved path in a hidden Markov model as the length of the time horizon grows. The existence of such a process suggests that approximate estimation using optimization algorithms which process data segments in parallel may be accurate. For models on state-space R^d satisfying a new "decay-convexity" condition, we approach the existence of the Viterbi process through fixed points of ordinary differential equations in a certain infinite dimensional Hilbert space. Quantitative bounds on the distance to the Viterbi process show that approximate estimation via parallelization can indeed be accurate and scaleable to high-dimensional problems because the rate of convergence to the Viterbi process does not necessarily depend on d.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Background and motivation

Consider a process where

is a Markov chain with state space

whose initial distribution and transition kernel admit densities and with respect to Lebesgue measure, and , each valued in a measurable space , are conditionally independent given and such that for any

, the conditional probability of

given can be written in the form , where and is a measure on .

Models of this form, going by the names of hidden Markov or state-space models, provide a flexible and interpretable framework for describing temporal dependence along data streams in terms of latent processes. They are applied in a wide variety of fields including econometrics, engineering, ecology, machine learning and neuroscience.

With a distinguished observed data sequence considered fixed throughout this paper, define:

(1)

where for any sequence , we shall use the shorthand . The posterior density at the path given , say is then proportional to . The maximum a-posteriori path estimation problem given is to find:

(2)

In addition to serving as a point estimate of the hidden trajectory, the solution of (2), or generally in practice some approximation to it obtained numerically, is of interest when calculating the Bayesian information criterion [18] with non-uniform priors over the hidden trajectory, can be used in initialization of Markov chain Monte Carlo algorithms to sample from , and for log-concave posterior densities is automatically accompanied by universal bounds on highest posterior density credible regions thanks to concentration of measure inequalities [14, 1].

The Viterbi process is a sequence such that for any ,

(3)

Its existence was first studied in the information theory literature, [4, 3], for models in which the state-space of is a set of a finite number of states and the convergence in (3) is with respect to the discrete metric. The “Viterbi process” name appeared later, in [10], inspired by the famous Viterbi decoding algorithm [20]. We focus on the case of state-space . The only other work known to the author which considers the Viterbi process in the case of state-space is [5], discussed below. They considered convergence in (3) with respect to Euclidean distance.

In these studies the Viterbi process appears to be primarily of theoretical interest. Here we consider also a practical motivation in a similar spirit to distributed optimization methods, e.g., [12, 15, 16]: the existence of the limit in (3) suggests that (2) can be solved approximately using a collection of optimization algorithms which process data segments in parallel. To sketch the idea, with and integers, consider the index sets:

where is an “overlap” parameter. Suppose the optimization problems:

(4)

are solved in parallel, then in a post-processing step the components of the solutions of (4) indexed by the intersections between the ’s are discarded, and what remains concatenated to give an approximation to the solution of (2). If it takes time to solve (2) the speed-up from parallelization could be as much as a factor of . The main problem addressed in this paper is to study the rate of convergence to the Viterbi process in (3), and as a corollary we shall quantify the approximation error which trades off against the speed-up from parallelization as a function of , , , the ingredients of the statistical model and properties of the observation sequence.

1.2 Summary of the approach and relation to existing works

We shall approach the solutions of (2) indexed by and there tendency to the Viterbi process in an infinite dimensional Hilbert space, , where is a parameter depending on the model ingredients and which we shall relate to the rate of convergence to the Viterbi process. This approach is new and has three main benefits:

  1. It allows interpretable quantitative bounds accompanying (3) to be obtained, measuring the distance to the Viterbi process in a norm on which gives a stronger notion of convergence than the pointwise convergence in (3).

  2. Via a new “decay-convexity” property of which may be of independent interest, our approach provides a characterization of the Viterbi process as the fixed point of an infinite dimensional ordinary differential equation which arises in the limit .

  3. In turn this allows natural connections to be made to gradient descent algorithms, and estimates of their rates of convergence in easily obtained.

In totality, the collection of assumption we make is neither stronger nor weaker than the collection of assumptions of [5, Thm 3.1]. Comparisons between some individual assumptions are discussed in section A.5. One commonality is that both our assumptions (see the decay-convexity condition in Theorem 1 combined with Lemma 1) and the assumptions of [5, Thm 3.1] imply that is strongly log-concave, in the sense of [17].

From a statistical modelling perspective, this strong log-concavity might seem quite restrictive. However, the merit of assuming strong log-concavity must also take into account its attractive mathematical and computational consequences: strong-convexity of objective functions and strong-log concavity of target probability densities endows gradient-descent algorithms and certain families of diffusion Markov chain Monte Carlo algorithms with dimension-free convergence rates [2, 6] and plays a role in dimension-free contraction rates for the filtering equations of hidden Markov models [21]. The notion of decay-convexity introduced here extends these dimension-free phenomena, in particular under our assumptions we shall illustrate that the parameter controlling the rate of convergence to the Viterbi process does not necessarily depend on .

The proof techniques of [5, Thm 3.1] are quite different to ours. There the existence of the limit (3) is established using a converging series argument to bound terms in a dynamic programming recursion. A quantitative bound on the Euclidean distance between and is given in [5, eqs. (3.13) and (3.15)]; we address a stronger notion of convergence on the Hilbert space . The proof of [5, Thm 3.1] is given only in the case , but the same approach may be applicable more generally.

Earlier works concerning discrete-state hidden Markov models [4, 11] establish the existence of the limit in (3) by identifying stopping times which divide the optimal path into unrelated segments. [5, Example 2.1] illustrates that this approach to existence of the limit can also be made to work when the state-space is , but it seems not to easily yield quantitative bounds.

In the broader literature on convex optimization, a theory of sensitivity of optimal points with respect to constraints in convex network optimization problems has been introduced by [15, 16]. The notions of scale-free optimization developed there are similar in spirit to the objectives of the present paper, but the results are not directly comparable to ours since they concern a constrained optimization problem. In the context of unconstrained convex optimization problems with separable objective functions which allow for the structure of (2), [12] addressed the convergence of a min-sum message passing algorithm. Again some aspects of their analysis are similar in spirit to ours, but their aims and results are quite different.

Amongst our main assumptions will be continuous differentiability of the terms on the right of (1). Considering (3) as a regularized maximum likelihood problem, where the regularization comes from and , it would be particularly interesting to relax the differentiability assumption in order to accommodate sparsity inducing Lasso-type regularizers [19], but this is beyond the scope of the present paper.

Lastly, a further comment about generality: whilst we restrict our attention to the objective functions in (1) associated with hidden Markov models where represents time, the techniques we develop could easily be generalized to objectives functions which are additive functionals across tuples (rather than just pairs) of variables, and to situations where the arguments of the objective function are indexed over some set with a spatio-temporal (rather than just temporal) interpretation. Indeed many of the techniques presented here are not specific to hidden Markov models at all and may be of wider interest.

2 Main results

2.1 Definitions and assumptions

With

considered fixed, we shall associate with a generic vector

the vectors , each in , such that . With and the usual Euclidean inner product and norm on , define the inner product and norm on associated with a given ,

Let be the Hilbert space consisting of the set equipped with the inner-product and the usual element-wise addition and scalar multiplication of vectors over field . For each , denotes the subspace consisting of those such that for , with the convention that . Note that does not actually depend on , but this notation seems natural since we shall often encounter projections from to .

For define

(5)
(6)

and let and be the vectors in whose th entries are the partial derivatives of and with respect to the th entry of (the existence of such derivatives is part of Condition 1 below).

Then for each , define the vector field:

(7)

With these definitions, the first elements of the vector are the partial derivatives of with respect to the elements of , whilst the other elements of the vector

are zero. This zero-padding of

to make an infinitely long vector is a mathematical convenience which will allow us to treat as a sequence of vector fields on .

Define also

(8)
(9)
Condition 1.

a) , , and , , are everywhere strictly positive and continuously differentiable.

b) there exist constants such that , and for all

and

2.2 Viterbi process as the limit of a Cauchy sequence in

Theorem 1.

Assume that Condition 1 holds, and with as therein, let be any value in such that:

(10)

Then with any such that:

(11)

and any ,

(12)

Amongst all the vectors in , there is a unique vector such that , and

(13)

The proof of Theorem 1 is in section A.3.

Remark 1.

Since , the first elements of the vector solve the estimation problem (A.1), and since , the remaining elements of are zero.

Remark 2.

When Condition 1 holds, there always exists satisfying (10) and satisfying (11) because and Condition 1 requires . The case is of interest because if the right hand side of (13) converges to zero as , then is a Cauchy sequence in , yielding the existence of the Viterbi process, as per the following corollary.

Corollary 1.

If in addition to the assumptions of Theorem 1, and , then there exists in such that .

The assumptions of Corollary 1 on , and implicitly involve the observation sequence . A more explicit discussion of the impact of is given in section 3.

2.3 Interpretation of the decay-convexity condition

From hereon, (12) will be referred to as “decay-convexity” of . To explain the “convexity” part of this term, note that when , (12) says exactly that is -strongly log-concave, in the sense of [17].

To explain the “decay” part of decay-convexity, let us now address the case . It is well known that strong convexity of a continuously differentiable function is closely connected to exponential contraction properties of the associated gradient-flow ODE. This connection underlies convergence analysis of gradient-descent algorithms, see for example [13, chapter 2]. The inequality (12) can be interpreted similarly for any : using standard arguments for finite-dimensional ODE’s (a more general Hilbert space setting is given a full treatment in Proposition 1 in section A.2), it can be shown that when Condition 1a) and (12) hold, there is a unique, globally-defined flow which solves:

(14)

Here is a vector in , and the derivative with respect to time is element-wise. Noting the zero-padding of in (7), the first elements of together constitute the gradient flow associated with , whilst each of the remaining elements is the identity mapping on . Thus for , can be written as a sum of finitely many terms and by simple differentiation,

Since implies , it follows from (12) that

(15)

To see the signficance of the case , suppose that the initial conditions are such that for all . Then writing for the first elements of the vector , it follows from (15) that

Thus when , (12) ensures that as , the influence of on decays as with rate given by .

Turning to the inequalities in (10), observe that if is fixed, these inequalities are satisfied if is large enough. Further observe that if Condition 1b) is satisfied in the extreme case , then each (respectively ) is (respectively )-strongly concave in and then it is immediate that (12) holds. Discussion of how condition 1b) relates to the model ingredients , , is given in section 3.

Before moving on to an ODE perspective on the Viterbi process, the following lemma addresses the relationship between the cases and , hence explaining the conjunction of “decay” and “convexity” in the name we give to (12).

Lemma 1.

If is twice continuously differentiable and (12) holds for some and , then it also holds with that same and .

The proof is given in section A.3. Lemma 1 can perhaps be generalized from twice to once continuous differentiability by function approximation arguments, but this is a rather technical matter which it is not our priority to pursue.

2.4 Viterbi process as the fixed point of an ODE on

Now define the vector field:

(16)

An important note here about notation and interpretation: element-wise, the vector is the limit as of the vector . Indeed it can be read off from (7) that each element of the vector is constant in for all large enough. However, may not be interpreted as the gradient of the limit of the sequence of functions , because the pointwise limit

is in general not well-defined. This reflects the fact that on an infinite time horizon, the prior and posterior probability measures over the entire state sequence

are typically singular, so that a density “” does not exist. Hence there is no sense in characterizing the Viterbi process as: “”, the correct characterization is: , as Theorem 2 shows via a counter-part of (14)-(15) in the case .

Theorem 2.

In addition to the assumptions of Theorem 1 and with as therein, assume a)-c):

a) there exists a finite constant such that for all and ,

b) ,

c) is continuous in .

Then with as in Theorem 1,

(17)

and there exists a unique and globally defined flow which solves the Fréchet ordinary differential equation,

(18)

This flow has a unique fixed point, , and for all and . Furthermore with as in Theorem 1,

(19)

The proof of Theorem 2 is in section A.3.

The assumptions a)-b) in Theorem 2 ensure that maps to itself. Combined with the continuity in assumption c) and (17), this allows an existence and uniqueness result of [7] for dissipative ordinary differential equations on Banach spaces to be applied in the proof of Theorem 2. It is from here that the Fréchet derivative (18) arises. Background information about Fréchet derivatives is given in section A.1.

3 Discussion

3.1 Bound on the segment-wise error in the parallelized scheme

The error associated with the first segment in the parallelization scheme described in section 1.1 can be bounded using (A.1) or (19), we focus on the latter for simplicity of presentation.

Corollary 2.

If the assumptions of Theorem 2 hold,

(20)

Recall that here is the “overlap” between segments. To see the impact of the observations , recall from (5), (6) and (8) that depends on only through where is gradient with respect to . Therefore whether or not the right hand side of (20) converges to zero as depends on the behaviour of as . There are a wide variety of assumptions on which would suffice for convergence to zero. In the particular case of stationary observations, the rate or convergence is exponential:

Lemma 2.

Let be any -valued, stationary stochastic process such that . In particular, need not be distributed according the hidden Markov model described in section 1.1. Then for any , there exists a stationary process such that almost surely, and such that if in (1)-(2) the sequence

is replaced by the random variables

, then for any ,

The proof is given in section A.3 and the interested reader can deduce an explicit expression for from the details there.

Bounds on the errors associated with the other segments in the parallelization scheme can be obtained by very similar arguments to those in proof of Theorem 2. Presenting all the details would involve substantial repetition. However, as a rough approximation, due to additivity of the squared Euclidean distance, the overall error can be expected to scale like (which is the number of segments) times the error in (20).

3.2 Verifying Condition 1 and dimension independence of and

The following lemma provides an example of a model class satisfying Condition 1, allowing us to illustrate that the constants and appearing in Theorems 1 and 2 do not necessarily have any dependence on the dimension of the state-space,

. Here the smallest and largest eigenvalues of a real, symmetric matrix, say

, are denoted , .

Lemma 3.

Assume a) and b):

a) The unobserved process satisfies

(21)

where for , is independent of other random variables, , and are positive definite, is a matrix and is a length- vector.

b) For each , is strictly positive, continuously differentiable and there exists such that for all and ,

(22)

Then, if the inequality is satisfied by:

(23)
(24)
(25)

then Condition 1 holds.

The proof is in section A.4.

Remark 3.

The condition (22) is called semi-log-concavity of , generalizing log-concavity by allowing rather than only .

Remark 4.

The fact that , and in (23)-(25) depend only on eigenvalues of , and and the semi-concavity parameter means they, and consequently and , do not necessarily depend on dimension. As a simple example consider the case: , and , with and . In this situation holds, the inequalities in (10) are satisfied with , and satisfies (10).

Remark 5.

The condition can be interpreted as balancing the magnitude of temporal correlation in (21) against the size of the fluctuations of and the degree to which the likelihood is informative about . As the mapping becomes more strongly log-concave, and by inspection of (23)-(25) the condition can always be achieved if takes a large enough negative value, with other quantities on the right of the equations (23)-(25) held constant. On the other hand, if , which implies for any value of , the condition can be achieved if is small enough.

Remark 6.

Considering the case for ease of presentation, a likelihood function which satisfies (22) for some , but not for any is the -centered Student’s t-density, for example in the particular case of degree of freedom: .

Remark 7.

If the likelihood functions are sufficiently strongly log-concave, that is if (22) holds with a sufficiently large negative value, neither log-concavity of the distribution nor linearity of the evolution equation (21) is necessary for Condition 1 to hold – an example is presented in section A.4.

3.3 Gradient descent algorithms

In finite dimensions, it is well known that with suitably small step size, gradient algorithms associated with strongly convex and gradient-Lipschitz objective functions converge exponentially fast [13, Ch. 2.]. Analogous conclusions hold on for the vector field of which the Viterbi process is the fixed point in Theorem 2: starting from some , define:

where indexes algorithm time and is a step size. If assumption c) of Proposition 1 holds then for all . If also with some constant the vector field is Lipschitz continuous:

and satisfies the dissipative assumption b) of Proposition 1, then using one may estimate:

So is sufficient for convergence.

Appendix A Appendix

a.1 Fréchet derivatives

The following definitions can be found in [9, App. A]. For Banach spaces over , with respective norms , , a function has a directional derivative at in direction if there exists such that

The function is Gâteaux differentiable at if exists for all and is a bounded linear operator from to , in which case is called the Gâteaux derivative at . The function is additionally Fréchet differentiable at if

(26)

in which case the operator is called the Fréchet derivative at .

a.2 ODE’s on the Hilbert space

In the following proposition the operator of orthogonal projection from to is written .

Proposition 1.

For a given triple consisting of a constant , a mapping and , assume that a)-c) hold:

a) is continuous with respect to the norm on ,

b) there exists such that for all ,

c) for all , and the image of by is .

Then there exists a unique and globally defined flow solving the Fréchet ODE,

This flow has a unique in fixed point, , and for all and .

The proof is postponed.

The term in Proposition 1 is an application of the Fréchet derivative of with respect to , that is in (26), is equipped with the Euclidean norm, is the Hilbert space , and is the map , where in the latter the argument is regarded as fixed. Similarly with fixed, and denoting the Fréchet derivative of at by , the quantity is precisely . Thus in particular,

which, in general, is a stronger condition than the element-wise convergence of to .

The following Lemma will be used in the proof of Proposition 1.

Lemma 4.

If a triple satisfies the assumptions of Proposition 1, then with as therein and any ,

(27)
Proof.

In the case , assumption c) of Proposition 1 implies that only the first elements of the vector depend on

, and in that case the lemma can be proved by the chain rule of elementary differential calculus. The following proof is valid for any

and uses the chain rule of Fréchet differentiation.

Pick any , write them as , with each . The first step is to prove that the mapping is Fréchet differentiable everywhere in , with Fréchet derivative .

Consider the existence of directional derivatives. For let denote the vector in