1 Introduction
Continuoustime Markov chains (CTMCs) are mathematical models that describe the evolution of dynamical systems under (stochastic) uncertainty [9]. They are pervasive throughout science and engineering, finding applications in areas as disparate as medicine, mathematical finance, epidemiology, queueing theory, and others. We here consider timehomogeneous CTMCs that can only be in a finite number of states.
The dynamics of these models are uniquely characterised by a single transition rate matrix . This describes the (locally) linearised dynamics of the model, and is the generator of the semigroup of transition matrices
that determines the conditional probabilities
. In this expression, denotes the uncertain state of the system at time , and so contains the probabilities for the system to move from any state at time zero to any state at time .In this work, we consider the problem of estimating the matrix from a single realisation of the system up to some finite point in time. This problem is easily solved in both the classical frequentist and Bayesian frameworks, due to the likelihood of the corresponding CTMC belonging to an exponential family; see e.g. the introductions of [7, 3]. The novelty of the present paper is that we instead consider the estimation of in an imprecise probabilistic [14, 1] context.
Specifically, we approach this problem by considering an entire set of Bayesian priors on the likelihood of , leading to a setvalued estimator for . In order to obtain wellfounded hyperparameter settings for this set of priors, we recast the problem by interpreting a continuoustime Markov chain as a limit of discretetime Markov chains. This allows us to consider the impreciseprobabilistic estimators of these discretetime Markov chains, which are described by the popular Imprecise Dirichlet Model (IDM) [10]. The upshot of this approach is that the IDM has wellknown prior hyperparameter settings which can be motivated from first principles [15, 4].
This leads us to the two main results of this work. First of all, we show that the limit of these IDM estimators is a set of transition rate matrices that can be described in closedform using a very simple formula. Secondly, we identify the hyperparameters of our imprecise CTMC prior such that the resulting estimator is equivalent to the estimator obtained from this discretetime limit. The proofs of our results can be found in the appendix.
The immediate usefulness of our results is twofold. From a domainanalysis point of view, where we are interested in the parameter values of the process dynamics, our imprecise estimator provides priorinsensitive information about these values based on the data. If we are instead interested in robust inference about the future behaviour of the system, our imprecise estimator can be used as the main parameter of an imprecise continuoustime Markov chain [13, 5, 8, 6].
2 A Brief Refresher on Stochastic Processes
Intuitively, a stochastic process describes the uncertainty in a stochastic system’s behaviour as it moves through some state space as time progresses over some time dimension . A fundamental choice is whether we are considering processes in discrete time, in which case typically , or in continuous time, in which case . Here we write for the natural numbers, and let . The real numbers are denoted by , the positive reals by , and the nonnegative reals by . We briefly recall the basic definitions of stochastic processes below; for an introductory work we refer to e.g. [9].
Formally, a realisation of a stochastic process is a sample path, which is a map . Here represents the state of the process at time . We collect all sample paths in the set and, when , these paths are assumed to be càdlàg under the discrete topology on . With this domain in place, we then consider some abstract underlying probability space , where is some appropriate ()algebra on , and where is a (countably)additive probability measure.
The stochastic process can now finally be defined as a family of random variables
associated with this probability space. In particular, for fixed , the quantity is a random variable . Conversely, for a fixed realisation , is a deterministic map .Wellknown and popular kinds of stochastic processes are Markov chains:
Definition 1 (Markov Chain)
Fix , and let be a stochastic process. We call this process a Markov chain if, for all for which , it holds that for all . If then , we call a discretetime Markov chain (DTMC). If instead , we call it a continuoustime Markov chain (CTMC).
Furthermore, attention is often restricted to homogenous Markov chains:
Definition 2 (Homogeneous Markov Chain)
Let be a Markov chain. We call this Markov chain (time)homogeneous if, for all , , and all , it holds that .
This homogeneity property makes such processes particularly easy to describe.
In what follows, we will say that a matrix is a transition matrix
, if it is a realvalued and row stochastic matrix, i.e. if
and for all . We write for the space of all transition matrices. The elements of can be used to describe the singlestep conditional probabilities of a (homogeneous) DTMC:Proposition 1 ([9])
Let be a homogeneous DTMC. Then this process is completely and uniquely characterised by a probability mass function on and some . In particular, and, for all and all , , where is the matrix power of .
On the other hand, to describe CTMCs we need the concept of a (transition) rate matrix: a realvalued matrix with nonnegative offdiagonal elements and zero rowsums, i.e. and for all such that . We write for their entire space. A rate matrix describes the “speed” with which a CTMC moves between its states:
Proposition 2 ([9])
Let be a homogeneous CTMC. Then this process is completely and uniquely characterised by a probability mass function on and some . In particular, and, for all and all , , where is the matrix exponential of . Furthermore, for small enough and all , it holds that , where
is the identity matrix.
3 Estimation of a CTMC’s Rate Matrix
In what follows, we will derive methods to estimate the rate matrix of a homogeneous CTMC from a realisation that was observed up to some finite point in time . We denote with the restriction of to this interval , and we consider this (finiteduration) observation to be fixed throughout the remainder of this paper.
For any such that , we let denote the number of transitions from state to state in . Furthermore, we let denote the total duration spent in state , that is, we let , where is the indicator of , defined by if and otherwise. We assume in the remainder that for all . Finally, for notational convenience, we define for all .
3.1 Precise Estimators
Under the assumption that the realisation was generated by a homogeneous continuoustime Markov chain with rate matrix
, it is well known that the process dynamics can be modelled using exponentially distributed random variables whose parameters are given by the elements of Q. For various of such interpretations, we refer to e.g.
[9]. What matters to us here is that, regardless of the interpretation, we can use this to obtain the following likelihood result (see e.g. [7]): for a given , the likelihood for a rate matrix is(1) 
The corresponding maximumlikelihood estimator is easily found [7]: if and , where the final expression follows from the (implicit) constraint that the rows of a rate matrix should sum to zero.
Inspection of the likelihood in (1
) reveals that it belongs to an exponential family. This implies that there exists a conjugate prior for the rate matrix
, such that its posterior distribution, given, belongs to the same family as this prior. This prior is given by a product of Gamma distributions, specifically on the offdiagonal elements
, , of the corresponding rate matrix [3]. We here use a slightly more general joint prior on whose “density” is given by(2) 
with shapes and rates in ; we write for the joint parameters.
Note that we have only defined the prior to equal a product of Gamma distributions up to normalisation, so that the prior may be improper. This has the advantage that it allows us to close the parameter domains and allow prior hyperparameters and
, for which the Gamma distribution is not properly defined. We acknowledge that the use of such improper priors is not entirely uncontroversial, and that their interpretation as a prior probability (which it indeed is not) leaves something to be desired. We will nevertheless, in this specific setting, be able to motivate their use here as a consequence of Theorem
4.1 further on.Also, despite being improper, we can of course combine the prior (2) with the likelihood (1) and fix the normalisation in the posterior. As is well known, the means of the marginals of this posterior are then of the form^{1}^{1}1The assumption prevents division by zero in (3). However, might be zero and, if then also , the posterior cannot be normalised and will still be improper. Nevertheless, using an intuitive (but formally cumbersome) argument we can still identify this posterior for with the (discrete) distribution putting all mass at zero. Alternatively, we can motivate (3) by continuous extension from the cases where , similarly yielding the estimate at .
(3) 
Furthermore, the (joint) posterior mean is wellknown to be a Bayes estimator for under quadratic loss and given the prior [2].
The question now remains of how to a priori settle on a “good” choice for these hyperparameters , in the sense that they should adequately represent our prior beliefs. This is a nontrivial problem, and no general solution can be given. A popular (but not uncontroversial) attempt to characterise a noninformative prior consists in choosing the improper prior with ; the posterior mean (Bayes) estimator then equals .
3.2 An Imprecise Probabilistic Estimator
Generalising the above Bayesian approach, we here suggest an imprecise probabilistic treatment. Following for example[14, 11], this approach consists in using an entire set of prior distributions. Specifically, we consider a set of the form
(4) 
with as in (2), and where is a set of possible prior parameters. In this way, we do not have to restrict our attention to one specific choice of the parameters ; rather, we can include all the parameter settings that we deem reasonable, by collecting them in . Inference from is then performed by pointwise updating each of these priors; we thereby obtain a set of posterior distributions on the space of all rate matrices. Each of these posteriors has a mean of the form (3), which is a Bayes estimator for under a specific prior in the set (4). This leads us to consider the imprecise, i.e., setvalued, estimator
Note that even in this imprecise probabilistic approach, we still need to somehow specify the (now setvalued) prior model. That is, we need to be specific about the set . Inspired by the wellknown imprecise Dirichlet model [15], we may choose an “imprecision parameter” , which can be interpreted as a number of “pseudocounts”, to constrain for all , and to then vary all over their domain . Unfortunately, similar to what is noted in [11], this leads to undesirable behaviour. For example, as is readily seen from e.g. (3), including unbounded allows the offdiagonal elements to get arbitrarily close to zero, causing the model to a posteriori believe that transitions leaving may be impossible, no matter the number of such transitions that we actually observed in ! Hence, we prefer a different choice of .
One way to circumvent this undesired behaviour is to constrain the range within which each may be varied, to some interval , say. The downside is that this introduces a large number of additional hyperparameters; we then need to (“reasonably”) choose a value for each . Fortunately, our main result – Theorem 4.1 further on – suggests that setting (and therefore ) is in fact a very reasonable choice. This identification is obtained in the next section, using a limit result of discretetime estimators, for which the hyperparameter settings follow entirely from first principles.
In summary, we keep the “imprecision parameter” and the constraint for all , and simply set for all . We then define to be the largest set of parameters that satisfies these properties. Every in this set can be conveniently identified with the offdiagonal elements of a matrix , with a transition matrix. Our setvalued estimator can thus be written as
(5) 
4 DiscreteTime Estimators and Limit Relations
A useful intuition is that we can consider a CTMC as a limit of DTMCs, where we assign increasingly shorter durations to the time steps at which the latter operate. In this section, we will use this connection to relate estimators for DTMCs to estimators for CTMCs. We start by discretising the observed path.
Because the realisation was only observed up to some time , we can discretise the (finiteduration) realisation into a finite number of steps. For any , we write , and we define the discretised path as for all .
For any and , we let denote the number of transitions from state to in , and we let denote the total number of time steps that started in state .
4.1 DiscreteTime Estimators
For fixed , we can interpret the discretised path as a finiteduration ( steps long) realisation of a homogeneous discretetime Markov chain with transition matrix , with keeping track of the discretisation level. Each transition along the path , from state to , say, is then a realisation of a categorical distribution with parameters . The likelihood for , given , is therefore proportional to a product of independent multinomial likelihoods. Hence, the maximum likelihood estimator follows straightforwardly and as expected: for all ; see [7] for details.
In a Bayesian analysis, and following e.g. [10], for fixed we can model our uncertainty about the unknown by putting independent Dirichlet priors on the rows . We write this prior as , where is a “prior strength” parameter, and is a prior location parameter. Note that we take in the interior of – under the metric topology on – so that each row corresponds to a strictly positive probability mass function.
After updating with , the posterior mean is an estimator for that is Bayes under quadratic loss and for the specific prior ; due to conjugacy, the posterior is again a product of independent Dirichlet distributions [10], whence the elements of the posterior mean are
What remains is again to determine a good choice for and . However, in an imprecise probabilistic context we do not have to commit to any such choice: the popular Imprecise Dirichlet Model generalises the above approach using a set of Dirichlet priors. This set is given by and can be motivated from first principles [15, 4]. Observe that only a parameter remains, which controls the “degree of imprecision”. In particular, we no longer have to commit to a location parameter ; instead this parameter is freely varied over its entire domain .
Elementwise updating with yields a set of posteriors which, due to conjugacy, are again independent products of Dirichlet distributions. The corresponding set of posterior means thus contains estimators for that are Bayes for a specific prior from the IDM, and is easily verified to be
4.2 Limits of DiscreteTime Estimators
As noted in Proposition 2, a rate matrix is connected to the transition probabilities in the sense that for small . Hence, for small , we have that . This becomes exact in the limit for going to zero.
This interpretation can also be used to connect discretetime estimators for to estimators for . For example, if we let , then . Similarly, we can connect our setvalued estimators for the discretised models to the setvalued continuoustime estimator in (5):
Theorem 4.1
For all , let . Then the PainlevéKuratowski[12] limit exists, and equals .
5 Discussion
We have derived a setvalued estimator for the transition rate matrix of a homogeneous CTMC. It can be motivated both as a set of posterior means of a set of Bayesian models in continuoustime, and as a limit of setvalued discretetime estimators based on the Imprecise Dirichlet Model. The only parameter of the estimator is a scalar that controls the degree of imprecision. In the special case where there is no imprecision, and then .
The setvalued representation is convenient when one is interested in the numerical values of the transition rates, e.g. for domainanalysis. If one aims to use the estimator to describe an imprecise CTMC [13, 8], a representation using the lower transition rate operator is more convenient. This operator is the lower envelope of a set of rate matrices; for it is given, for all , by
for all . Hence, is straightforward to evaluate. This implies that when our estimator is used to learn an imprecise CTMC from data, the lower expectations of this imprecise CTMC can be computed efficiently [6].
Acknowledgements
The work in this paper was partially supported by H2020MSCAITN2016 UTOPIAE, grant agreement 722734. The authors wish to thank two anonymous reviewers for their helpful comments and suggestions.
References
 [1] Augustin, T., Coolen, F.P.A., De Cooman, G., Troffaes, M.C.M. (eds.): Introduction to Imprecise Probabilities. John Wiley & Sons (2014)
 [2] Berger, J.O.: Statistical decision theory and Bayesian analysis. Springer (1985)
 [3] Bladt, M., Sørensen, M.: Statistical inference for discretely observed Markov jump processes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(3), 395–410 (2005)

[4]
De Cooman, G., De Bock, J., Diniz, M.A.: Coherent predictive inference under exchangeability with imprecise probabilities. Journal of Artificial Intelligence Research 52, 1–95 (2015)
 [5] De Bock, J.: The limit behaviour of imprecise continuoustime markov chains. Journal of Nonlinear Science 27(1), 159–196 (2017)
 [6] Erreygers, A., De Bock, J.: Imprecise continuoustime markov chains: Efficient computational methods with guaranteed error bounds. In: Proceedings of ISIPTA 2017. pp. 145–156 (2017)
 [7] Inamura, Y.: Estimating continuous time transition matrices from discretely observed data. Bank of Japan (2006)
 [8] Krak, T., De Bock, J., Siebes, A.: Imprecise continuoustime Markov chains. International Journal of Approximate Reasoning 88, 452–528 (2017)
 [9] Norris, J.R.: Markov chains. Cambridge university press (1998)
 [10] Quaeghebeur, E.: Learning from samples using coherent lower previsions. Ph.D. thesis
 [11] Quaehebeur, E., De Cooman, G.: Imprecise probability models for inference in exponential families. In: Proceedings of ISIPTA 2005 (2005)
 [12] Rockafellar, T.R., Wets, R.J.B.: Variational Analysis. Springer (1997)
 [13] Škulj, D.: Efficient computation of the bounds of continuous time imprecise Markov chains. Applied Mathematics and Computation 250(C), 165–180 (2015)
 [14] Walley, P.: Statistical reasoning with imprecise probabilities. Chapman and Hall, London (1991)
 [15] Walley, P.: Inferences from multinomial data: learning about a bag of marbles. Journal of the Royal Statistical Society, Series B 58, 3–57 (1996)
 [16] Whitt, W.: Stochasticprocess limits: an introduction to stochasticprocess limits and their application to queues. Springer Science & Business Media (2002)
Appendix 0.A Proofs of Main Results
In this appendix, we will assume that the paths are functions in continuoustime, that is, that . As stated in Section 2, we then assume all these paths to be càdlàg under the discrete topology on .
Therefore, and because the realisation is only observed up to time , there are only a finite number of state transitions in and, furthermore, each distinct visit lasts for a strictly positive (but finite) duration; the lemma below makes this formal. The result is essentially wellknown, but we had some trouble finding a satisfactory reference for the elementary case where takes at most finitely many values; we therefore prove it as a (somewhat trivial) specialisation of [16, Theorem 12.2.1].
Lemma 1
Let be càdlàg. Then there is a finite collection of time points , , with , , and if , such that is constant on for all , and for all .
Proof
By assumption is càdlàg on under the discrete topology on . Since is finite, we can identify it without loss of generality with the set , with the number of states. Now let be defined as for all , so that takes values in . Then is simply with its codomain replaced by ; is therefore by construction càdlàg under the discrete topology on .
For any , any open neighbourhood of in the Euclidean topology contains the set , which is an open neighbourhood of in the discrete topology. It follows that any sequence in that is convergent with limit in the discrete topology, is also convergent with limit in the Euclidean topology; the sequence is eventually in for any open neighbourhood of in the Euclidean topology. Therefore the leftsided limits and rightcontinuity of under the discrete topology, hold identically under the Euclidean topology; so is also càdlàg under the Euclidean topology on .
By [16, Theorem 12.2.1], has at most a finite number of discontinuities, under the Euclidean norm on , on the interval . Denote these points of discontinuity as , and assume without loss of generality that for all for which . We next include the endpoints of the interval. Note first that , because time cannot be a point of discontinuity due to the càdlàg property; we can therefore introduce . For the endpoint we need to consider two cases, because there is possibly a discontinuity there. If there is no such discontinuity, whence we introduce and set ; otherwise we simply let . We will now verify the properties in the lemma’s statement.
Clearly, by construction, we have that for all , that and and that if . Now fix any ; we know that has no discontinuities on the interval under the Euclidean norm on , and therefore, since is rightcontinuous at , it has no discontinuities on either. Since only takes values in , it follows that must be constant on . This implies that also is constant on .
Finally, choose any . There is then a discontinuity in at . Since, as we have just shown, is constant on both and , this implies that , and therefore also .∎
In other words, for , the time points are the distinct time points on which statechanges occurred in , and the intervals are time intervals during which the process remained in the same state it had at time . The boundaries and constitute special cases that are included for use in the proof of the next lemma. Notably, there is never a statechange at , and there might, but need not be, a statechange at time .
The above guarantees that the properties of the discretised realisations converge to the properties of the original . Unfortunately, a straightforward statement of the results that we need is (as before, apparently) so elementary that we are unable to find a satisfactory reference. We therefore provide an explicit proof below.
Lemma 2
For any càdlàg and all :

[label=()]

for all ;

.
Proof
Let , be the finite set of time points whose existence is guaranteed by Lemma 1. Let be the minimum distance between these time points.
We start by proving Property 1. Fix such that . Because , there is some such that, for all , . Fix any such .
Recall that is the number of transitions from to in . Thus,
Let consist of the indices of the time points at which the actual switches from state to occurred in ; so, let
Then, clearly, is the true number of transitions from to .
Choose any . Clearly, since , there is a unique such that
Then and because and . Therefore . Because this holds for all , and because each has a unique , this implies that .
Conversely, it trivially holds that because the discretisation cannot introduce more state switches. Therefore we must have that . Because this holds for all , it holds that , which concludes the proof for Property 1.
We next prove property Property 2; choose any , and recall that
(6) 
where is the indicator of . By Lemma 1, has only finitely many discontinuities on the interval . Therefore, the composite function also has only finitely many discontinuities on this interval. It follows that the integral in (6) can be interpreted in the Riemann sense.
Now consider any . Then we have
which we see is a Riemann sum whose limit defines the integral in (6); we immediately conclude that , as claimed. ∎
Proof of Theorem 4.1. We need to show that exists in the PainlevéKuratowski sense [12], and that it is equal to . This requires us to consider the inner limit of —the set of limit points of sequences , with for all —and the outer limit—the set of all accumulation points of such sequences—and to show that they are equal to each other and to . We start by considering the inner limit.
Fix any . It then follows from (5) that and that there is some such that
(7) 
Since , we know that there must be sequence such that . Consider any such sequence.
For all , we now let be the element of that corresponds to , and we let be the corresponding element of . Consider now any such that . For all , we then find that
(8) 
because implies . Furthermore, we also know that
making use of Lemma 2 for the equalities that involve and . Therefore, the numerator and denominator converge separately, and we find that
It remains to look at the diagonal elements. Fix any . Then
where the first equality follows from the fact that each is a transition rate matrix, the third equality follows from our earlier result that converges to , and the last equality follows because is a transition rate matrix. We conclude that . Since for all , this implies that , where is the inner limit of . Because this holds for all , we conclude that
(9) 
Next, consider any element of the outer limit. By definition, there is then a sequence , with for all , and a subsequence , such that .
For every , since , we know that there is some that satisfies (8) for all such that . Furthermore, because and is compact, it follows from the BolzanoWeierstrass theorem that the sequence has a convergent subsequence whose limit belongs to . Hence, without loss of generality, we can assume that , with .
Using completely analogous argumentation as that in the first part of this proof, it now follows that satisfies (7). It follows that the offdiagonal elements of are realvalued and, similar to what we found above, that the rows of sum to zero; hence, the diagonal elements are realvalued as well. Therefore is a transition rate matrix and, since it satisfies (7), this implies that . Since was arbitrary, we conclude that . Since the inner limit is trivially included in the outer one, the result now follows from (9).∎