 # An Imprecise Probabilistic Estimator for the Transition Rate Matrix of a Continuous-Time Markov Chain

We consider the problem of estimating the transition rate matrix of a continuous-time Markov chain from a finite-duration realisation of this process. We approach this problem in an imprecise probabilistic framework, using a set of prior distributions on the unknown transition rate matrix. The resulting estimator is a set of transition rate matrices that, for reasons of conjugacy, is easy to find. To determine the hyperparameters for our set of priors, we reconsider the problem in discrete time, where we can use the well-known Imprecise Dirichlet Model. In particular, we show how the limit of the resulting discrete-time estimators is a continuous-time estimator. It corresponds to a specific choice of hyperparameters and has an exceptionally simple closed-form expression.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Continuous-time Markov chains (CTMCs) are mathematical models that describe the evolution of dynamical systems under (stochastic) uncertainty . They are pervasive throughout science and engineering, finding applications in areas as disparate as medicine, mathematical finance, epidemiology, queueing theory, and others. We here consider time-homogeneous CTMCs that can only be in a finite number of states.

The dynamics of these models are uniquely characterised by a single transition rate matrix . This describes the (locally) linearised dynamics of the model, and is the generator of the semi-group of transition matrices

that determines the conditional probabilities

. In this expression, denotes the uncertain state of the system at time , and so contains the probabilities for the system to move from any state at time zero to any state at time .

In this work, we consider the problem of estimating the matrix from a single realisation of the system up to some finite point in time. This problem is easily solved in both the classical frequentist and Bayesian frameworks, due to the likelihood of the corresponding CTMC belonging to an exponential family; see e.g. the introductions of [7, 3]. The novelty of the present paper is that we instead consider the estimation of in an imprecise probabilistic [14, 1] context.

Specifically, we approach this problem by considering an entire set of Bayesian priors on the likelihood of , leading to a set-valued estimator for . In order to obtain well-founded hyperparameter settings for this set of priors, we recast the problem by interpreting a continuous-time Markov chain as a limit of discrete-time Markov chains. This allows us to consider the imprecise-probabilistic estimators of these discrete-time Markov chains, which are described by the popular Imprecise Dirichlet Model (IDM) . The upshot of this approach is that the IDM has well-known prior hyperparameter settings which can be motivated from first principles [15, 4].

This leads us to the two main results of this work. First of all, we show that the limit of these IDM estimators is a set of transition rate matrices that can be described in closed-form using a very simple formula. Secondly, we identify the hyperparameters of our imprecise CTMC prior such that the resulting estimator is equivalent to the estimator obtained from this discrete-time limit. The proofs of our results can be found in the appendix.

The immediate usefulness of our results is two-fold. From a domain-analysis point of view, where we are interested in the parameter values of the process dynamics, our imprecise estimator provides prior-insensitive information about these values based on the data. If we are instead interested in robust inference about the future behaviour of the system, our imprecise estimator can be used as the main parameter of an imprecise continuous-time Markov chain [13, 5, 8, 6].

## 2 A Brief Refresher on Stochastic Processes

Intuitively, a stochastic process describes the uncertainty in a stochastic system’s behaviour as it moves through some state space as time progresses over some time dimension . A fundamental choice is whether we are considering processes in discrete time, in which case typically , or in continuous time, in which case . Here we write for the natural numbers, and let . The real numbers are denoted by , the positive reals by , and the non-negative reals by . We briefly recall the basic definitions of stochastic processes below; for an introductory work we refer to e.g. .

Formally, a realisation of a stochastic process is a sample path, which is a map . Here represents the state of the process at time . We collect all sample paths in the set and, when , these paths are assumed to be càdlàg under the discrete topology on . With this domain in place, we then consider some abstract underlying probability space , where is some appropriate (-)algebra on , and where is a (countably-)additive probability measure.

The stochastic process can now finally be defined as a family of random variables

associated with this probability space. In particular, for fixed , the quantity is a random variable . Conversely, for a fixed realisation , is a deterministic map .

Well-known and popular kinds of stochastic processes are Markov chains:

###### Definition 1 (Markov Chain)

Fix , and let be a stochastic process. We call this process a Markov chain if, for all for which , it holds that for all . If then , we call a discrete-time Markov chain (DTMC). If instead , we call it a continuous-time Markov chain (CTMC).

Furthermore, attention is often restricted to homogenous Markov chains:

###### Definition 2 (Homogeneous Markov Chain)

Let be a Markov chain. We call this Markov chain (time-)homogeneous if, for all , , and all , it holds that .

This homogeneity property makes such processes particularly easy to describe.

In what follows, we will say that a matrix is a transition matrix

, if it is a real-valued and row stochastic matrix, i.e. if

and for all . We write for the space of all transition matrices. The elements of can be used to describe the single-step conditional probabilities of a (homogeneous) DTMC:

###### Proposition 1 ()

Let be a homogeneous DTMC. Then this process is completely and uniquely characterised by a probability mass function on and some . In particular, and, for all and all , , where is the matrix power of .

On the other hand, to describe CTMCs we need the concept of a (transition) rate matrix: a real-valued matrix with non-negative off-diagonal elements and zero row-sums, i.e. and for all such that . We write for their entire space. A rate matrix describes the “speed” with which a CTMC moves between its states:

###### Proposition 2 ()

Let be a homogeneous CTMC. Then this process is completely and uniquely characterised by a probability mass function on and some . In particular, and, for all and all , , where is the matrix exponential of . Furthermore, for small enough and all , it holds that , where

is the identity matrix.

## 3 Estimation of a CTMC’s Rate Matrix

In what follows, we will derive methods to estimate the rate matrix of a homogeneous CTMC from a realisation that was observed up to some finite point in time . We denote with the restriction of to this interval , and we consider this (finite-duration) observation to be fixed throughout the remainder of this paper.

For any such that , we let denote the number of transitions from state to state in . Furthermore, we let denote the total duration spent in state , that is, we let , where is the indicator of , defined by if and otherwise. We assume in the remainder that for all . Finally, for notational convenience, we define for all .

### 3.1 Precise Estimators

Under the assumption that the realisation was generated by a homogeneous continuous-time Markov chain with rate matrix

, it is well known that the process dynamics can be modelled using exponentially distributed random variables whose parameters are given by the elements of Q. For various of such interpretations, we refer to e.g.

. What matters to us here is that, regardless of the interpretation, we can use this to obtain the following likelihood result (see e.g. ): for a given , the likelihood for a rate matrix is

 L(˜ω|Q)=∏x,y∈Xx≠y(qxy)nxye−qxydx. (1)

The corresponding maximum-likelihood estimator is easily found : if and , where the final expression follows from the (implicit) constraint that the rows of a rate matrix should sum to zero.

Inspection of the likelihood in (1

) reveals that it belongs to an exponential family. This implies that there exists a conjugate prior for the rate matrix

, such that its posterior distribution, given

, belongs to the same family as this prior. This prior is given by a product of Gamma distributions, specifically on the off-diagonal elements

, , of the corresponding rate matrix . We here use a slightly more general joint prior on whose “density” is given by

 f(Q|α,β)\coloneqq∏x,y∈Xx≠y(qxy)αxy−1e−qxyβx∝∏x,y∈Xx≠yGamma(qxy|αxy,βx),−3pt (2)

with shapes and rates in ; we write for the joint parameters.

Note that we have only defined the prior to equal a product of Gamma distributions up to normalisation, so that the prior may be improper. This has the advantage that it allows us to close the parameter domains and allow prior hyperparameters and

, for which the Gamma distribution is not properly defined. We acknowledge that the use of such improper priors is not entirely uncontroversial, and that their interpretation as a prior probability (which it indeed is not) leaves something to be desired. We will nevertheless, in this specific setting, be able to motivate their use here as a consequence of Theorem

4.1 further on.

Also, despite being improper, we can of course combine the prior (2) with the likelihood (1) and fix the normalisation in the posterior. As is well known, the means of the marginals of this posterior are then of the form111The assumption prevents division by zero in (3). However, might be zero and, if then also , the posterior cannot be normalised and will still be improper. Nevertheless, using an intuitive (but formally cumbersome) argument we can still identify this posterior for with the (discrete) distribution putting all mass at zero. Alternatively, we can motivate (3) by continuous extension from the cases where , similarly yielding the estimate at .

 E[qxy|α,β,˜ω]=αxy+nxyβx+dx∀x,y∈X,x≠y. (3)

Furthermore, the (joint) posterior mean is well-known to be a Bayes estimator for under quadratic loss and given the prior  .

The question now remains of how to a priori settle on a “good” choice for these hyperparameters , in the sense that they should adequately represent our prior beliefs. This is a non-trivial problem, and no general solution can be given. A popular (but not uncontroversial) attempt to characterise a non-informative prior consists in choosing the improper prior with ; the posterior mean (Bayes) estimator then equals .

### 3.2 An Imprecise Probabilistic Estimator

Generalising the above Bayesian approach, we here suggest an imprecise probabilistic treatment. Following for example[14, 11], this approach consists in using an entire set of prior distributions. Specifically, we consider a set of the form

 {f(⋅|α,β)∣∣(α,β)∈C}, (4)

with as in (2), and where is a set of possible prior parameters. In this way, we do not have to restrict our attention to one specific choice of the parameters ; rather, we can include all the parameter settings that we deem reasonable, by collecting them in . Inference from is then performed by point-wise updating each of these priors; we thereby obtain a set of posterior distributions on the space of all rate matrices. Each of these posteriors has a mean of the form (3), which is a Bayes estimator for under a specific prior in the set (4). This leads us to consider the imprecise, i.e., set-valued, estimator

 QC\coloneqq{ Q∈Q∣∣∣(∀x,y∈X,x≠y:qxy=αxy+nxyβx+dx),(α,β)∈C}.

Note that even in this imprecise probabilistic approach, we still need to somehow specify the (now set-valued) prior model. That is, we need to be specific about the set . Inspired by the well-known imprecise Dirichlet model , we may choose an “imprecision parameter” , which can be interpreted as a number of “pseudo-counts”, to constrain for all , and to then vary all over their domain . Unfortunately, similar to what is noted in , this leads to undesirable behaviour. For example, as is readily seen from e.g. (3), including unbounded allows the off-diagonal elements to get arbitrarily close to zero, causing the model to a posteriori believe that transitions leaving may be impossible, no matter the number of such transitions that we actually observed in ! Hence, we prefer a different choice of .

One way to circumvent this undesired behaviour is to constrain the range within which each may be varied, to some interval , say. The downside is that this introduces a large number of additional hyperparameters; we then need to (“reasonably”) choose a value for each . Fortunately, our main result – Theorem 4.1 further on – suggests that setting (and therefore ) is in fact a very reasonable choice. This identification is obtained in the next section, using a limit result of discrete-time estimators, for which the hyperparameter settings follow entirely from first principles.

In summary, we keep the “imprecision parameter” and the constraint for all , and simply set for all . We then define to be the largest set of parameters that satisfies these properties. Every in this set can be conveniently identified with the off-diagonal elements of a matrix , with a transition matrix. Our set-valued estimator can thus be written as

 Qs\coloneqq{Q∈Q∣∣∣(∀x,y∈X,x≠y:qxy=sA(x,y)+nxydx),A∈T}. (5)

## 4 Discrete-Time Estimators and Limit Relations

A useful intuition is that we can consider a CTMC as a limit of DTMCs, where we assign increasingly shorter durations to the time steps at which the latter operate. In this section, we will use this connection to relate estimators for DTMCs to estimators for CTMCs. We start by discretising the observed path.

Because the realisation was only observed up to some time , we can discretise the (finite-duration) realisation into a finite number of steps. For any , we write , and we define the discretised path as for all .

For any and , we let denote the number of transitions from state to in , and we let denote the total number of time steps that started in state .

### 4.1 Discrete-Time Estimators

For fixed , we can interpret the discretised path as a finite-duration ( steps long) realisation of a homogeneous discrete-time Markov chain with transition matrix , with keeping track of the discretisation level. Each transition along the path , from state to , say, is then a realisation of a categorical distribution with parameters . The likelihood for , given , is therefore proportional to a product of independent multinomial likelihoods. Hence, the maximum likelihood estimator follows straightforwardly and as expected: for all ; see  for details.

In a Bayesian analysis, and following e.g. , for fixed we can model our uncertainty about the unknown by putting independent Dirichlet priors on the rows . We write this prior as , where is a “prior strength” parameter, and is a prior location parameter. Note that we take in the interior of – under the metric topology on – so that each row corresponds to a strictly positive probability mass function.

After updating with , the posterior mean is an estimator for that is Bayes under quadratic loss and for the specific prior ; due to conjugacy, the posterior is again a product of independent Dirichlet distributions , whence the elements of the posterior mean are

 E[T(x,y)∣∣s,A,w(m)]=sA(x,y)+n(m)xys+n(m)x∀x,y∈X.

What remains is again to determine a good choice for and . However, in an imprecise probabilistic context we do not have to commit to any such choice: the popular Imprecise Dirichlet Model generalises the above approach using a set of Dirichlet priors. This set is given by and can be motivated from first principles [15, 4]. Observe that only a parameter remains, which controls the “degree of imprecision”. In particular, we no longer have to commit to a location parameter ; instead this parameter is freely varied over its entire domain .

Element-wise updating with yields a set of posteriors which, due to conjugacy, are again independent products of Dirichlet distributions. The corresponding set of posterior means thus contains estimators for that are Bayes for a specific prior from the IDM, and is easily verified to be

 T(m)s=⎧⎨⎩T∈T∣∣∣⎛⎝∀x,y∈X:T(x,y)=sA(x,y)+n(m)xys+n(m)x⎞⎠,A∈int(T)⎫⎬⎭.

### 4.2 Limits of Discrete-Time Estimators

As noted in Proposition 2, a rate matrix is connected to the transition probabilities in the sense that for small . Hence, for small , we have that . This becomes exact in the limit for going to zero.

This interpretation can also be used to connect discrete-time estimators for to estimators for . For example, if we let , then . Similarly, we can connect our set-valued estimators for the discretised models to the set-valued continuous-time estimator in (5):

###### Theorem 4.1

For all , let . Then the Painlevé-Kuratowski limit exists, and equals .

## 5 Discussion

We have derived a set-valued estimator for the transition rate matrix of a homogeneous CTMC. It can be motivated both as a set of posterior means of a set of Bayesian models in continuous-time, and as a limit of set-valued discrete-time estimators based on the Imprecise Dirichlet Model. The only parameter of the estimator is a scalar that controls the degree of imprecision. In the special case where there is no imprecision, and then .

The set-valued representation is convenient when one is interested in the numerical values of the transition rates, e.g. for domain-analysis. If one aims to use the estimator to describe an imprecise CTMC [13, 8], a representation using the lower transition rate operator is more convenient. This operator is the lower envelope of a set of rate matrices; for it is given, for all , by

 [Q––h](x)\coloneqqinfQ∈Qs∑y∈XQ(x,y)h(y)=sdxminy∈X(h(y)−h(x))+∑y∈X∖{x}nxydx(h(y)−h(x)),

for all . Hence, is straightforward to evaluate. This implies that when our estimator is used to learn an imprecise CTMC from data, the lower expectations of this imprecise CTMC can be computed efficiently .

## Acknowledgements

The work in this paper was partially supported by H2020-MSCA-ITN-2016 UTOPIAE, grant agreement 722734. The authors wish to thank two anonymous reviewers for their helpful comments and suggestions.

## References

•  Augustin, T., Coolen, F.P.A., De Cooman, G., Troffaes, M.C.M. (eds.): Introduction to Imprecise Probabilities. John Wiley & Sons (2014)
•  Berger, J.O.: Statistical decision theory and Bayesian analysis. Springer (1985)
•  Bladt, M., Sørensen, M.: Statistical inference for discretely observed Markov jump processes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(3), 395–410 (2005)
• 

De Cooman, G., De Bock, J., Diniz, M.A.: Coherent predictive inference under exchangeability with imprecise probabilities. Journal of Artificial Intelligence Research 52, 1–95 (2015)

•  De Bock, J.: The limit behaviour of imprecise continuous-time markov chains. Journal of Nonlinear Science 27(1), 159–196 (2017)
•  Erreygers, A., De Bock, J.: Imprecise continuous-time markov chains: Efficient computational methods with guaranteed error bounds. In: Proceedings of ISIPTA 2017. pp. 145–156 (2017)
•  Inamura, Y.: Estimating continuous time transition matrices from discretely observed data. Bank of Japan (2006)
•  Krak, T., De Bock, J., Siebes, A.: Imprecise continuous-time Markov chains. International Journal of Approximate Reasoning 88, 452–528 (2017)
•  Norris, J.R.: Markov chains. Cambridge university press (1998)
•  Quaeghebeur, E.: Learning from samples using coherent lower previsions. Ph.D. thesis
•  Quaehebeur, E., De Cooman, G.: Imprecise probability models for inference in exponential families. In: Proceedings of ISIPTA 2005 (2005)
•  Rockafellar, T.R., Wets, R.J.B.: Variational Analysis. Springer (1997)
•  Škulj, D.: Efficient computation of the bounds of continuous time imprecise Markov chains. Applied Mathematics and Computation 250(C), 165–180 (2015)
•  Walley, P.: Statistical reasoning with imprecise probabilities. Chapman and Hall, London (1991)
•  Walley, P.: Inferences from multinomial data: learning about a bag of marbles. Journal of the Royal Statistical Society, Series B 58, 3–57 (1996)
•  Whitt, W.: Stochastic-process limits: an introduction to stochastic-process limits and their application to queues. Springer Science & Business Media (2002)

## Appendix 0.A Proofs of Main Results

In this appendix, we will assume that the paths are functions in continuous-time, that is, that . As stated in Section 2, we then assume all these paths to be càdlàg under the discrete topology on .

Therefore, and because the realisation is only observed up to time , there are only a finite number of state transitions in and, furthermore, each distinct visit lasts for a strictly positive (but finite) duration; the lemma below makes this formal. The result is essentially well-known, but we had some trouble finding a satisfactory reference for the elementary case where takes at most finitely many values; we therefore prove it as a (somewhat trivial) specialisation of [16, Theorem 12.2.1].

###### Lemma 1

Let be càdlàg. Then there is a finite collection of time points , , with , , and if , such that is constant on for all , and for all .

###### Proof

By assumption is càdlàg on under the discrete topology on . Since is finite, we can identify it without loss of generality with the set , with the number of states. Now let be defined as for all , so that takes values in . Then is simply with its co-domain replaced by ; is therefore by construction càdlàg under the discrete topology on .

For any , any open neighbourhood of in the Euclidean topology contains the set , which is an open neighbourhood of in the discrete topology. It follows that any sequence in that is convergent with limit in the discrete topology, is also convergent with limit in the Euclidean topology; the sequence is eventually in for any open neighbourhood of in the Euclidean topology. Therefore the left-sided limits and right-continuity of under the discrete topology, hold identically under the Euclidean topology; so is also càdlàg under the Euclidean topology on .

By [16, Theorem 12.2.1], has at most a finite number of discontinuities, under the Euclidean norm on , on the interval . Denote these points of discontinuity as , and assume without loss of generality that for all for which . We next include the endpoints of the interval. Note first that , because time cannot be a point of discontinuity due to the càdlàg property; we can therefore introduce . For the endpoint we need to consider two cases, because there is possibly a discontinuity there. If there is no such discontinuity, whence we introduce and set ; otherwise we simply let . We will now verify the properties in the lemma’s statement.

Clearly, by construction, we have that for all , that and and that if . Now fix any ; we know that has no discontinuities on the interval under the Euclidean norm on , and therefore, since is right-continuous at , it has no discontinuities on either. Since only takes values in , it follows that must be constant on . This implies that also is constant on .

Finally, choose any . There is then a discontinuity in at . Since, as we have just shown, is constant on both and , this implies that , and therefore also .∎

In other words, for , the time points are the distinct time points on which state-changes occurred in , and the intervals are time intervals during which the process remained in the same state it had at time . The boundaries and constitute special cases that are included for use in the proof of the next lemma. Notably, there is never a state-change at , and there might, but need not be, a state-change at time .

The above guarantees that the properties of the discretised realisations converge to the properties of the original . Unfortunately, a straightforward statement of the results that we need is (as before, apparently) so elementary that we are unable to find a satisfactory reference. We therefore provide an explicit proof below.

###### Lemma 2

For any càdlàg and all :

1. [label=()]

2. for all ;

3. .

###### Proof

Let , be the finite set of time points whose existence is guaranteed by Lemma 1. Let be the minimum distance between these time points.

We start by proving Property 1. Fix such that . Because , there is some such that, for all , . Fix any such .

Recall that is the number of transitions from to in . Thus,

 n(m)xy=m∑j=1Ix(w(m)(j−1))Iy(w(m)(j)).

Let consist of the indices of the time points at which the actual switches from state to occurred in ; so, let

 Ixy\coloneqq{i∈{1,…,M}:˜ω(ti−1)=x,˜ω(ti)=y}.

Then, clearly, is the true number of transitions from to .

Choose any . Clearly, since , there is a unique such that

 ti−1≤ti−Δ<(ji−1)δ(m)

Then and because and . Therefore . Because this holds for all , and because each has a unique , this implies that .

Conversely, it trivially holds that because the discretisation cannot introduce more state switches. Therefore we must have that . Because this holds for all , it holds that , which concludes the proof for Property 1.

We next prove property Property 2; choose any , and recall that

 dx\coloneqq∫tmax0Ix(˜ω(t))dt, (6)

where is the indicator of . By Lemma 1, has only finitely many discontinuities on the interval . Therefore, the composite function also has only finitely many discontinuities on this interval. It follows that the integral in (6) can be interpreted in the Riemann sense.

Now consider any . Then we have

 δ(m)n(m)x=m−1∑i=0Ix(w(m)(i))δ(m)=m−1∑i=0Ix(˜ω(imtmax))tmaxm,

which we see is a Riemann sum whose limit defines the integral in (6); we immediately conclude that , as claimed. ∎

Proof of Theorem 4.1. We need to show that exists in the Painlevé-Kuratowski sense , and that it is equal to . This requires us to consider the inner limit of —the set of limit points of sequences , with for all —and the outer limit—the set of all accumulation points of such sequences—and to show that they are equal to each other and to . We start by considering the inner limit.

Fix any . It then follows from (5) that and that there is some such that

 Q(x,y)=sA(x,y)+nxydx for all x,y∈X such % that x≠y. (7)

Since , we know that there must be sequence such that . Consider any such sequence.

For all , we now let be the element of that corresponds to , and we let be the corresponding element of . Consider now any such that . For all , we then find that

 Qm(x,y)=sAm(x,y)+n(m)xyδ(m)s+δ(m)n(m)x−I(x,y)δ(m)=sAm(x,y)+n(m)xyδ(m)s+δ(m)n(m)x, (8)

because implies . Furthermore, we also know that

 limm→+∞Am(x,y)=A(x,y),limm→+∞n(m)xy=nxy,limm→+∞δ(m)s=0,limm→+∞δ(m)n(m)x=dx,

making use of Lemma 2 for the equalities that involve and . Therefore, the numerator and denominator converge separately, and we find that

 limm→+∞Qm(x,y)=sA(x,y)+nxydx=Q(x,y).

It remains to look at the diagonal elements. Fix any . Then

 limm→+∞Qm(x,x) =limm→+∞−∑y∈X∖{x}Qm(x,y) =−∑y∈X∖{x}limm→+∞Qm(x,y)=−∑y∈X∖{x}Q(x,y)=Q(x,x),

where the first equality follows from the fact that each is a transition rate matrix, the third equality follows from our earlier result that converges to , and the last equality follows because is a transition rate matrix. We conclude that . Since for all , this implies that , where is the inner limit of . Because this holds for all , we conclude that

 Qs⊆liminfm→+∞Q(m)s. (9)

Next, consider any element of the outer limit. By definition, there is then a sequence , with for all , and a subsequence , such that .

For every , since , we know that there is some that satisfies (8) for all such that . Furthermore, because and is compact, it follows from the Bolzano-Weierstrass theorem that the sequence has a convergent subsequence whose limit belongs to . Hence, without loss of generality, we can assume that , with .

Using completely analogous argumentation as that in the first part of this proof, it now follows that satisfies (7). It follows that the off-diagonal elements of are real-valued and, similar to what we found above, that the rows of sum to zero; hence, the diagonal elements are real-valued as well. Therefore is a transition rate matrix and, since it satisfies (7), this implies that . Since was arbitrary, we conclude that . Since the inner limit is trivially included in the outer one, the result now follows from (9).∎