# Simple, fast and accurate evaluation of the action of the exponential of a rate matrix on a probability vector

Given a time-homogeneous, finite-statespace Markov chain with a rate matrix, or infinitesimal generator of Q, an initial distribution of ν and a time interval of t, the distribution across states at time t, ν^ (t):=ν^ (Q t) is typically required for one or many time intervals, t, either as a contribution to the likelihood to enable statistical inference, or to understand the evolution of the process itself. When, as in many statistical applications in, for example, epidemics, systems biology and ecology, Q arises from a reaction network, or when Q corresponds to a negative graph Laplacian, it is usually sparse. Building on a relatively recent development for the action of the exponential of a general sparse matrix on a vector, we use the special properties of rate matrices to create the Single Positive Series method, which accurately and quickly evaluates the distribution across states at a single time point in O((r+1)ρ d) operations, where d is the dimension of the statespace, ρ=_i=1,...,d|Q_ii| and r is the average number of positive entries in the rows of Q. We also present the Multiple-Use Single Positive Series algorithm which simultaneously evaluates the distribution vector at k different time points in O((k+r)ρ d) operations. We demonstrate across a range of examples that the Single Positive Series method is both more accurate and more than an order of magnitude more efficient than the nearest generic and generally available competitor, and that the Multiple-Use Single Positive Series algorithm improves upon this still further.

## Authors

• 14 publications
• ### Convergence of Contrastive Divergence Algorithm in Exponential Family

This paper studies the convergence properties of contrastive divergence ...
03/17/2016 ∙ by Tung-yu Wu, et al. ∙ 0

• ### Efficient construction of an HSS preconditioner for symmetric positive definite ℋ^2 matrices

In an iterative approach for solving linear systems with ill-conditioned...
11/15/2020 ∙ by Xin Xing, et al. ∙ 0

• ### A Monte Carlo method for computing the action of a matrix exponential on a vector

A Monte Carlo method for computing the action of a matrix exponential fo...
04/29/2019 ∙ by Juan A. Acebron, et al. ∙ 0

• ### Variable and Fixed Interval Exponential Smoothing

Exponential smoothers are a simple and memory efficient way to compute r...
02/11/2015 ∙ by Javier R. Movellan, et al. ∙ 0

• ### On the Computational Complexity of Linear Discrepancy

Many problems in computer science and applied mathematics require roundi...
07/31/2020 ∙ by Lily Li, et al. ∙ 0

• ### Estimating entropy rate from censored symbolic time series: a test for time-irreversibility

In this work we introduce a method for estimating entropy rate and entro...
09/23/2020 ∙ by Raul Salgado-Garcia, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

For many applications, statistical methods require the exponentiation of large, sparse, matrices. Such calculations are needed, for example, to evaluate the likelihood for continuous-time Markov models

(e.g. Andersson and Britton, 2000) or to evaluate certain metrics for networks (e.g. Hammond et al., 2013). Standard algorithms for evaluating the matrix exponential (such as expm from the Matrix package in R) scale poorly with dimension, . However, in many applications interest lies only in the action of the matrix exponential on a vector; furthermore, when the matrix to be exponentiated is sparse, the calculation can be performed in operations. Recently, Al-Mohy and Higham (2011) introduced an algorithm which in the vast majority of examples considered was both faster and more accurate than its competitors, but this has largely been ignored in the statistics literature. Here we present further developments of the algorithm of Al-Mohy and Higham (2011) which take advantage of the special structure of matrix exponential calculations that are common in statistics. Our key algorithm is more straightforward to implement than that of Al-Mohy and Higham (2011), yet faster, and at least as accurate, as well as being an order of magnitude faster than its nearest widely-available competitor, making accurate likelihood-based inference feasible and straightforward for systems with statespace sizes in the tens of thousands.

The exponential of a square matrix, is defined via its infinite series: . As might be anticipated from the definition, algorithms for evaluating take operations see Moler and Van Loan (2003) for a review of many such methods. However, for a -vector, , the product is the solution to the initial value problem , , and is the key component of the solution to more complex differential equations such as . For this reason the numerical evaluation of the action of a matrix exponential on a vector has received considerable attention of itself (e.g. Gallopoulos and Saad, 1992; Saad, 1992; Sidje, 1998; Al-Mohy and Higham, 2011).

When is dense, evaluating

 eMv=∞∑i=01i!Miv (1)

takes operations. However, motivated by the examples we detail in Section 1.1 our interest lies in large sparse matrices, and the number of operations then reduces to , where is the average number of entries in each row of . In this case, as exemplified in the popular Expokit FORTRAN routines Sidje (1998)

, it is usual to estimate

via its projection on to the Krylov subspace , where .

With double-precision arithmetic, real numbers are stored to an accuracy of approximately . Thus, when a scalar, , is very large and negative, naive numerical evaluation of the series for can be innaccurate due to the cancellation of very large positive and negative terms. Evaluation of is, typically, similarly affected, and to control these errors, algorithms for evaluating (e.g. Sidje, 1998) use the identity

 eMv=(eM/s)sv=eM/seM/s…eM/sv, (2)

where the integer is chosen so that is small enough (for a suitable choice of norm) that cancellation of positive and negative terms which are ‘too large’ does not occur. The calculation on the right of (2) typically involves many more numerical operations than the direct calculation on the left and so should be the smallest integer that mitigates against the cancellation of large positive and negative terms, so as to minimise the accumulation of rounding errors.

The relatively recent, ground-breaking paper Al-Mohy and Higham (2011) presents a new algorithm for evaluating the action of the exponential of a sparse matrix on a vector. Rather than reducing the dimension using Krylov subspaces and then applying standard techniques, it uses (2) directly, together with straightforward truncation of the series (1) for (first with , then with etc.). The three key contributions which make this algorithm typically both faster and more accurate than that of Sidje (1998) are detailed in Section 2.1; however, to motivate our own contribution, we briefly note two of these now. Firstly, a detailed error analysis allows for the choice of sensible values for and the truncation point of each series, . The process by which these values are chosen is, however, relatively complex, requiring a look-up table and other bespoke code from Al-Mohy and Higham (2009), and may relate to the reason why the only implementation we were able to find was the original matlab code from the authors. Secondly, the identity

 eMv=eaeM−aIdv (3)

for any scalar, , is used to reduce the Frobenius norm of by setting its trace to .

Our interest lies specifically in the left action of the exponential of a scalar multiple of the rate matrix (or infinitesimal generator), , of a Markov chain, on a probability vector, :

 ν(t)⊤=ν⊤eQt=∞∑i=0tii!ν⊤Qi. (4)

Our key observation is that, since only the diagonal elements of are non-positive, it is possible to shift using (3) so that no elements are negative. Since probability vectors are also non-negative, all elements of all vectors in the series in (4) then become non-negative. Hence there is no cancellation of large positive and negative terms and no need to scale by a factor . This Single Positive Series algorithm (SPS) reduces the number of matrix-vector operations compared to that in the algorithm of Al-Mohy and Higham (2011), typically by a factor of , concomittantly reducing the opportunities for rounding errors and speeding up the evaluation. The SPS algorithm is straightforward to implement, since

is simply a high quantile of the Poisson distribution; we do, however, provide a method for evaluating this quantile which is typically faster than standard methods. Our second contribution lies in the evaluation of (

4) for multiple time points, , for example to chart the evolution of the Markov chain. These distributions are typically evaluated sequentially, since . Using the fact that, in our case, , we provide the Multiple-Use Single Positive Series algorithm (MUSPS) which evaluates all at once from a single evaluation of the series, rather than sequentially.

With our algorithm, or that of Al-Mohy and Higham (2011), for a Markov chain with thousands of states, evaluation of for values of that see a notable change in

takes only a small fraction of a second, enabling direct maximum likelihood estimation of the rate parameters when the network is observed precisely at a discrete set of time points, and making Bayesian inference (via latent variables for the states at observation times) feasible for partially observed networks. As well as being faster than the algorithm of

Al-Mohy and Higham (2011), the SPS algorithm is short, self-contained and straightforward to code should our C++ implementation not fit with the user’s preferred platform. For common epidemic models in particular, the alternative statespace formulation of Ho et al. (2018) (see Section 2.2) makes Bayesian inference feasible for relatively large populations; for example, Ho et al. (2018) perform inference for the SIR model using data from the Ebola outbreak in regions of Guinea. This reduced statespace is usable, and indeed improvable, by the SPS algorithm. In systems biology and elsewhere, reaction networks often have a countably infinite statespace. Georgoulas et al. (2017) provides a method for dealing with such cases using matrix exponentiation by applying the random truncation algorithm of Glynn and Rhee (2014) to the statespace itself; this method would also benefit from using the Single Positive Series algorithm, rather than the algorithm of Al-Mohy and Higham (2011).

In the remainder of this article, Section 1.1 provides several motivating examples, which will form the basis of the numerical comparisons of SPS and MUSPS against competitors in Section 4. Section 2 provides further background on the algorithm of Al-Mohy and Higham (2011) and aspects of the algorithm of Ho et al. (2018), whilst the SPS and MUSPS algorithms themselves are detailed in Section 3.

### 1.1 Examples and motivation

Both by way of motivation and because we shall use them later to illustrate our method, we now present three examples of continuous-time Markov processes, and one example motivated by networks, where a finite, sparse rate matrix contains all of the information about the dynamics.

For each Markov process, the set of possible states can be placed in one-to-one correspondance with a subset of the non-negative integers . The off-diagonal elements of the rate matrix, , are all non-negative, and the th diagonal element is . A chain that is currently in state leaves this state upon the first event of a Poisson process with a rate of ; the state to which it transitions is with a probability of . Whilst the rate matrix, , is a natural description of the process, the likelihood for typical observation regimes involves the transition matrix, , the th element of which is exactly .

Given a network, or graph with vertices (or nodes), the adjacency matrix, is the matrix with and, for , if nodes and are connected then is some positive weight representing the strength of the connection, else . The graph Laplacian is , where is the diagonal matrix with . The negative Laplacian, , has the same properties as a rate matrix and, as we shall see in Example 4, this connection can be related to certain key properties of a graph.

###### Example 1.

The SIR model for epidemics. The SIR model for a disease epidemic has ‘species’: those who are susceptible to the epidemic, , those both infected and infectious, , and those who have recovered from the epidemic and play no further part in the dynamics, . The non-negative counts of each species are denoted by , , and . For relatively short epidemics the population, , is assumed to be fixed, and so the state of the Markov chain, represented by , is subject to the additional constraint of . The two possible reactions and their associated rates are:

 S+IβSI⟶2I,   and   IγI⟶R.
###### Example 2.

The SEIRS model for epidemics. The SEIRS epidemic model starts from the SIR model and inserts an additional state of infected but not yet infections, (‘exposed’). The number of these individuals is denoted by and the state is subject to the constraint that . Removed subjects also gradually lose their immunity so that the reactions are:

 S+IβSI⟶E+I,   EδE⟶I,   IγI⟶R   and   RηI⟶S.
###### Example 3.

The Moran model for allele frequency descibes the time evolution of the frequency of two alleles, and in a population with a fixed size of . Individuals with allele reproduce at a rate of , and those with reproduce at a rate of . When an individual dies it is replaced by the offspring of a parent chosen uniformly at random from the whole population (including the individual that dies). The allele that the parent passes to the offspring usually matches its own, however as it is passed down an allele may mutate; allele switching to with a probability of , and switching to with a probability of . Let be the number of individuals with allele . The two reactions are

 A1λN⟶A2   and   A2μN⟶A1.

Setting , the corresponding infinitesimal rates are

 λN=(1−fN)[αfN(1−u)+β(1−fN)v]   and   μN=fN[β(1−fN)(1−v)+αfNu],

where the unit of time is the expectation of the exponentially distributed time for an individual to die and be replaced. ∎

Our fourth example provides a characterisation of the discrepancy between two graphs in terms of the transmission of information. Given two graphs, and

with the same vertices but different edges (e.g. representing different observations of the connections, possible imputations given incomplete information, or observations of two different types of connections), and Laplacians

and , Hammond et al. (2013) defines the graph diffusion distance between and as:

 ξ(G1,G2;t):=||exp(−L1t)−exp(−L2t)||fr,

where is the Frobenius norm. The idea is that two graphs might be deemed similar if they enable similar patterns of information transmission. The negative Laplacian is a rate matrix, and the th row of represents the distribution over the network after time of a unit of information that was placed on node at time . The Frobenius norm, therefore, represents an average over all possible initial starting nodes of the discrepancies in the time- distributions.

Calculating and storing the full matrix exponential for a set of large graphs is prohibitive both in computational time and storage space; moreover, some initial conditions will be of much more interest than others.

###### Example 4.

Random graphs Given a particular, shared initial distribution across nodes,

, the evolution of the discrepancy between the probability distributions can provide a useful measure of the difference in the transmission properties of the graphs. We have:

 dν(G1,G2;t):=||νTexp(−L1t)−νTexp(−L2t)||1. (5)

Other examples of finite reaction networks include the dimerisation reactions and the Michaelis-Menten reaction kinetics (e.g. Wilkinson, 2012) and other epidemic models such as the SEIR and SIRS models (e.g. Andersson and Britton, 2000).

## 2 Further Background

Redefining , we require , where and are probability -vectors and is a rate matrix for a reaction network. We now detail the algorithm of Al-Mohy and Higham (2011) and summarise the key numerical results from that article. We then briefly describe the alternative statespace formulation for epidemic models, from Ho et al. (2018).

### 2.1 The algorithm of Al-Mohy and Higham

As stated in the introduction the algorithm presented in Al-Mohy and Higham (2011), which we, henceforth refer to as AMH, uses simple series truncation to evaluate using the decomposition in (2). AMH benefits from three key contributions. Firstly, an error analysis allows for sensible choices of and the value, , at which the Taylor series should be truncated; more detail on this is given below. Secondly, the identity (3) is used to reduce the Frobenius norm of by setting its trace to . Finally, the evaluation of each series for is truncated early, at the th term, if the sum of the sizes of the two most recent terms, (where ), is less than the product of some tolerance and the size of the partial sum, . This leads to a problem-specific speed up which can be substantial.

When is required for multiple time points, , the natural procedure is to calculate and then etc. However, when the set of time points is dense this can lead to a large accumulation of rounding errors. AMH reduces the build up of rounding errors by using, for example, whenever there would be no additional computational cost over using .

The complexity of the AMH algorithm, and perhaps the reason why we could only find it in the authors’ original matlab code, lies in the calculation of appropriate values for and , which, in particular, requires further, bespoke matlab code (by the same authors) for approximating the induced norm of a matrix (and its powers) using one or several matrix-vector multiplications for each power. With the hyper parameters for the algorithm given by the authors and hard-coded into the matlab code, except when the approximation of is found to be sufficiently small with ‘sufficient’ depending on the error tolerance chosen (double or single), is required for each between and , and each of these is approximated using a single matrix-vector multiplication.

In Al-Mohy and Higham (2011) the new algorithms are compared against many competitors across a very wide variety of examples. Whilst they do not dominate all other algorithms across all examples, against a given competitor the relevant new algorithm is both more accurate and faster in the vast majority of cases. In particular, the single-time algorithm is found to be both more accurate and typically between and times faster than expv. On the only occasion when the new algorithm is found to slower than expv the relative slow down is due to the matrix norm estimation, which is not required by SPS.

### 2.2 Birth formulation of the SIR and SEIR processes

Ho et al. (2018) reformulate the SIR and SEIR models in terms of births of the (for SEIR only), and species. A recursive formula for the Laplace transform of the transition probability to a given new state in terms of transition probabilities for old states then permits estimation of the transition vector from a known initial starting point in operations, where is the dimension of the statespace. When exact observations are available, the size of the statespace can then be reduced dramatically.

Consider, first, an SIR model with observations of and at times of and . The number of births is , and the number of births is . The total size of the statespace of possible events given the observations is therefore . For an SEIR model a similar argument leads to a product of the number of births of , and .

We may use the same statespace formulation as Ho et al. (2018), provided we include an additional coffin state, , with for all . Any births that would leave the statespace (and hence contradict the observation at time ) instead go to .

## 3 New algorithms

Given a vector , a matrix and a precision , the Single Positive Series (SPS) algorithm estimates , whilst the Multiple-Use Single Positive Series (MUSPS) algorithm estimates for a sequence of times . Unlike any previous algorithms, however, both the SPS and MUSPS algorithms operate by truncating a single series none of whose terms can be negative, rather than truncating multiple series where terms may change sign. Given an , our algorithms calculate such that the (guaranteed to be non-negative) true missing probability mass in each of the dimensions is controlled:

 0<1−d∑i=1^μ∗i<ϵ   and   0<1−d∑i=1^ν(tj)∗i<ϵ   (j=1,…,n),

where and are the probability vectors that would be calculated if there were no rounding errors, and the only errors were due to the truncation of the infinite series. Typically we aim for the error to be similar to the machine’s precision. We control the absolute truncation error since with any truncation of the power series it is impossible to obtain general control of the relative error in a given component of , . Consider, for example, a Moran process (Example 3), where the matrix is tridiagonal. Then is also banded, with a band width of . For any given , and , set . The truncated approximation to gives a transition probability of for all states above , yet, in truth there is a non-zero probability of such a transition.

### 3.1 The Single Positive Series algorithm

Let

 ρ:=maxi=1,…,d|Qii|   and   P=(1/ρ)Q+I. (6)

P is a Markov transition matrix. However, . Thus

 μ⊤=ν⊤eQ=e−ρν⊤∞∑i=0ρii!Pi≈e−ρm∑i=0ρii!ν⊤Pi=:^μ∗⊤.

Since no element of or is negative, there is no possibility for cancellation of terms of opposite sign. Now

 d∑i=1^μ∗i=^μ∗⊤1=e−ρm∑i=0ρii!ν⊤Pi1=e−ρm∑i=0ρii!.

So the absolute relative error, or missing probability mass, due to truncation is

 rm(ρ):=e−ρ∞∑i=m+1ρii!,

the tail probability of a random variable. Of direct interest to us is

 mϵ(ρ):=inf{m∈N:rm(ρ)≤ϵ},

the smallest required to achieve an error of at most , or, essentially, the quantile function for a random variable, evaluated at . Theorem 1, which is proved in Appendix A, provides relatively tight bounds on for any and that might be used.

###### Theorem 1.

If , , and if , . More generally: , where

 m+:=ρ−13logϵ{1+(1−18ρlogϵ)1/2}−1. (7)

Furthermore,

 ⌊m−⌋≤mϵ(ρ)≤⌈m++⌉,

where both inequalities require and the latter also requires and , where

 m− :=ρ+{2ρ}1/2{−log(ϵ√2π)−32logA+log(A−1)}1/2, (8) m++ :=ρ+B−logϵ3{1+(1+18ρB−logϵ)1/2}, (9)
 A:=2ρh(m++1ρ),   B:=−12log4πρh(m−ρ),

and .

The bound (7) arises from a standard argument, whereas those in (8) and (9) are derived from extremely sharp but intractable bounds on in Short (2013); our bounds use only elementary functions and so are much quicker to compute than the quantile upper bound in Short (2013), yet from Figure 1 they are still very sharp. The bounds in (8) and (9) together with the alternative form in (10) permit a simple but fast binary search for (see also Section 3.4). Equation (10) follows from the equivalence between at least events of a Poisson process with a unit rate occuring by time and the time until the th event being at most :

 rm(ρ)=1Γ(m+1)∫ρ0xme−xdx. (10)

As well as providing a tight strict upper bound on the number of sparse vector-matrix multiplications required by our algorithm, the bounds lead directly to the following asymptotic cost:

###### Corollary 1.

For any fixed error tolerance, , .

The Single Positive Series (SPS) algorithm is presented as Algorithm 1. For large values of , although there is no problem with large positive and negative terms cancelling, it is possible that the partial sum might exceed the largest floating point number storable on the machine. Our algorithm circumvents this problem by occasionally renormalising the vector partial sum when the most recent contribution is large, and compensating for this at the end; see lines 5, 12 and 14. One could, alternatively, renormalise when the partial sum itself becomes large, but the translation of the algorithm to multiple time points (Section 3.3) becomes both messier and computationally slower, with no benefits for the accuracy of the algorithm for any values for which the algorithm might feasibly be used.

### 3.2 Options

We now describe two optional extensions to the SPS algorithm: renormalisation and two-tailed truncation.

Since there is no need to keep track of the logarithmic offset ( in Algorithm 1). Instead the final vector ( in Algorithm 1) is renormalised at the end so that its components sum to . Renormalisation when a rate matrix is being exponentiated is included as an option in expoRkit Sidje (1998), for example. In experiments detailed in Section 4.1 and 4.2 we find that it improves the accuracy of the algorithm by between a factor of and several orders of magnitude at no additional computational cost.

In contrast to the general applicability of renormalisation, two-tailed truncation is unique to the SPS algorithm and reduces the computational cost with no loss of accuracy. When is moderate or large, the total mass of probability from the initial value of and the early values accumulated into (Steps 6 and 10 of Algorithm 1) is negligible (has a relative value smaller than , say) compared with the sum of the later values. In such cases may be initialised to and step 10 omitted for values of beneath some . Proposition 1 (see Appendix B for the proof) shows that if is chosen such that then setting ensures that the missing probability mass is no more than . For large , , so with two-tailed truncation the cumulative cost of Step 10 dwindles compared with the other costs, all of which are repeated times.

###### Proposition 1.

Given , let , and let . Then for , .

### 3.3 The Multiple-Use SPS algorithm

The SPS algorithm evaluates for a single time, . Suppose, now, that transitions for multiple time points, , are required. We use the identity

 ν⊤eQt=e−ρtν⊤eρPt=e−ρt∞∑i=0(tt∗)i×ρiti∗i!ν⊤Pi,

where is as defined in (6), and .

Then, with , and as in Algorithm 1, for each additional time interval, , since , we initialise . Within the iterations, we update and as in Algorithm 1 but for some relevant (rather than ) and, effectively, include the additional step for each : . We, therefore, avoid the potential build up of errors from partitioning the total time interval into sub-intervals. Furthermore, we reduce the cost since the vector-matrix multiplication, which involves multiplications and additions is only performed once for each ; for all times , the only update is to , and this involves a single addition and division. Moreover, although the loop over is repeated times, for any given , is only updated when the contribution to the eventual total would be non-negligible.

Both renormalisation and two-tailed truncation can be applied to the MUSPS algorithm if desired, and the algorithm with both of these options (MUSPS2r) typically provides the best combination of speed and accuracy. The bare bones of MUSPS2r is provided in Algorithm 2; for simplicity of presentation the version does not deal with numerical underflow or overflow, and the necessary additional steps are provided in Appendix C.

### 3.4 Implementation details

C++ code for SPS and MUSPS together with some working examples is (for now) besides the link to this paper at https://www.maths.lancs.ac.uk/sherlocc/Publications/publications.html

For a single time point, our binary search algorithm homes in on the required using the upper and lower bounds of Theorem 1 together with the identity (10), the right hand side of which can be evaluated quickly and accurately using the standard C++ toolbox, boost. This is quicker than the standard implementation of the Poisson quantile function (e.g. as implemented in boost), which uses the Cornish-Fisher expansion to approximate the quantile, hence needing an expensive evaluation of , and then conducts a local search. When evaluating for multiple times , presented in ascending order, our function can be speeded up still further for in by using as a lower bound for .

The speed of a vector multiplication by a sparse-matrix depends on the implementation of the sparse matrix algorithm. In R R Core Team (2018) and in C++ Armadillo Sanderson and Curtin (2016, 2018), sparse matrices are stored in column-major order. Hence pre-multiplication of the sparse matrix by a vector, , is much quicker than post multiplcation, . In other languages, such as Matlab, sparse matrices are stored in row-major order and post-multiplication is the quicker operation, so should be stored and used, rather than .

## 4 Numerical comparisons

In Al-Mohy and Higham (2011) the new algorithm (henceforth referred to as AMH) is compared across many examples against state-of-the-art competitors, including, in particular, the expokit function expv Sidje (1998). In most of the experiments, for single times and multiple time points, AMH is found to give comparable or superior accuracy together with superior computational speed. Given these existing comparisons and that comparing direct timings of algorithms coded in different languages confounds the efficiencies of the algorithms and of the compilers, we perform a relatively short comparison of accuracy and clock time between SPS, expv and a bespoke algorithm for epidemic processes, before comparing SPS and AMH in terms of accuracy and computational complexity across a range of examples. We then move on to multiple time points and exhibit additional gains in both speed in accuracy, via the MUSPS algororithm.

The highest accuracy available in C++ using sparse matrices and the armadillo linear algebra library is double precision, which we used throughout in our implementation of both our algorithm and that of Al-Mohy and Higham (2011). For the SPS algorithm we consider two settings: (for accuracy comparable with the algorithm of Ho et al. (2018); see Section 2.2) and , where the missing probability mass is just below the available precision. For the SPS algorithm, following the four options in Section 3.2; the label ‘r’ indicates that renormalisation was used and the label ‘2’ indicates that two-tailed truncation was used.

### 4.1 Comparison with state-of-the art algorithms other than AMH

We consider the collection (see the first three rows of Table 1) of (susceptible and infected) values for the Eyam plague that originated in Raggett (1982) and were used in Ho et al. (2018) to examine the speed and accuracy of the algorithm for the birth representation of the SIR model briefly described in Section 2.2. We set the parameters to their maximum-likelihood estimates, and consider the likelihood for the data in Table 1. In addition, to mimic the size of potential changes between observation times and the size of the elements of the rate matrix from a larger population, we also evaluated the likelihood for the jump directly from the data at time to the data at time . The final two rows of Table 1 refer to the rate matrix for the transition between consecutive observations and provide the dimension the matrix and the absolute value of is largest entry ().

For the SPS algorithm, with and , the algorithm of Ho et al. (2018) and the FORTRAN routine expv from the expoRkit package Sidje (1998) we found the CPU time for estimations of the likelihood ( estimates for the likelihood for the jump from to ); each experiment was performed three times, and the range of timings is given. We also recorded the error in the evaluation of the log likelihood. Given that the true likelihood is not known, the error using SPS with was approximately bounded by examining the discrepancy from the result produced by AMH with chosen for double-precision arithmetic.

The results are presented in Table 2. Since using renormalisation and two-tailed truncation together produced the fastest and most accurate evaluations, we only considered this combination when using . When using expv we did not include the (considerable) cost of creating the sparse rate matrix for each transition, since this was necessarily performed in R. The timings for the SPS algorithm do, however, include matrix-creation costs, which accounted for around of the total cost.

As will be discussed further in Section 4.2, the choice of tolerance, , typically has only a small effect on the speed of SPS. For the full likelihood evaluation, SPS() is over times as fast as HCS and more accurate. The more accurate SPS() is roughly times faster than HCS as well as being more accurate even than expv and over twice as fast as expv; however this timing does not take into account that expv was using eight times the processor power of the other algorithms (expv is automatically optimised to the machine on which it is compiled, and attempting to restrict it to a single core caused it to slow down by a factor of , rather than ). This factor also does not account for the fact that FORTRAN is known to deal more efficiently with matrices than does C or C++ (e.g. https://modelingguru.nasa.gov/docs/DOC-2625).

For the single large jump between observations, we see the same pattern in terms of accuracy. The gain in efficiency by using two-tailed-truncation is larger because is larger ( and ), but despite this, HCS is now more efficient than SPS, although considerably less accurate than SPS(). However, in addition to the savings in state-space size made by Ho et al. (2018), further savings, are possible for SPS and expv. For example, for the SIR model, the constraint that: implies that . For the single big observation jump this gives that , almost halfing the number of states that need to be considered. With this saving, SPS would be of a similar speed to HCS. Analogous savings of up to a factor of would be possible for an SEIR model.

We consider a further example, Example 5, chosen for tractability so that the values can be calculated using greater precision and the true errors ascertained.

###### Example 5.

Immigration-death A single species has members in a confined space with available slots. The species dies out with a rate proportional to the current frequency, and immigration occurs with a rate proportional to the number of empty slots.

 XμX⟶∅   and   ∅γ(n−X)⟶X.

Transition probabilities from are derived in Appendix D. ∎

Setting , , and , we timed calculations of the probability vector and examined . Using 8 cores at , expv took seconds, with an error discrepancy of , whereas on a single core, SPS2r() took seconds with a discrepancy of ; the accuracy of SPSr was almost identical, whereas the discrepancies of both SPS and SPS2 where ; the timings were all within of that of SPS2r, which was the quickest. We then repeated the experiment but with and repetitions. Using cores at , expv took seconds, with an accuracy of , whereas using a single core, SPS2r() took seconds with an accuracy of (SPS2r/SPSr) and (SPS/SPS2).

Even allowing for a reduction in efficiency because of communication costs between cores, the results for the immigration-death model and those from Table 2 suggest that SPS is at least an order of magnitude faster than its nearest generic and easily available competitor, expv from expoRkit, as well as several orders of magnitude more accurate.

### 4.2 Comparison with the algorithm of Al-Mohy and Higham

We now compare the computational complexity and the accuracy of our method and that of Al-Mohy and Higham (2011) across four reaction networks. Complexity is measured in terms of , the number of (sparse) vector-matrix multiplications required in evaluating . The first example is chosen specifically because its transition probabilities are tractable and exact errors in the transition probability vectors can, therefore, be provided. Since an absolute truth is not available for the second and third examples, for a transition of time from the initial condition, accuracy for these examples is indicated by the absolute discrepancy in the probability vector: , where and represent two different estimates of the vector . We refer to the double and single precision error tolerance settings for the algorithm of Al-Mohy and Higham (2011) as AMHd and AMHs respectively (double and single precision arithmetic correspond to relative errors in the representation of any real number of approximately and ). AMHd and AMHs differ only in the choice of and , with AMHs being set up for single precision arithmetic and requiring fewer multiplications in all.

We compare the algorithms on four reaction networks, evaluating transitions of time from an initial condition to a set of logarithmically-spaced values up to a maximum of , which, in all examples, is a natural, large but reasonable value to consider. The first two examples lead to a rate matrix with a maximum (at ) and ; the third leads to a rate matrix with a maximum and the fourth to a matrix with a maximum .

Our first example is the tractable immigration-death model (Example 5), where we used the same parameters as in Section 4.1, with . Our second example is the Moran model (Example 3) with , and . The parameter values imply that the state is absorbing and, more importantly, that the rate matrix is poorly conditioned and exponentiation is prone to producing large errors (see Crawford and Suchard, 2012). Finally, ; i.e., by the final time most of the population has the new allele. The third example is an SIR model (Example 1) for a population of with . Setting , so that at the start of the epidemic an infective is expected to infect approximately four susceptibles, we find that ; there is a high probability that the epidemic is over by the final time considered. The final example is an SEIRS model with , and so that , and epidemics that have not died out have typically suffered from two ‘waves’ of the disease.

In Figure 2 we first examine the immigration-death model, where accuracies are, effectively, exact. For values of , AMHs is much less accurate than AMHd, SPS2 and SPS2r, which all have similar accuracy. However, for larger values of , AMHs, AMHd and SPS2 all have comparable accuracy, which degrades approximately linearly (on the log scale) as , and hence the number of vector-matrix multiplcations and, thus, the accumulated error, increases. The renormalisation step in SPS2r appears to mitigate against much of the degradation with . Accuracy results for SPS with single-tailed truncation are indistinguishable from those for SPS with two-tailed truncation and so are not shown. For the four combinations of SPS with (not shown), for accuracies are between and , with greater accuracy for . SPSr is the least accurate and SPS2 is the most accurate because arithmetic is still carried out to double precision so errors are entirely due to the premature truncation. SPS2 truncates with the upper tail at and the lower strictly below and so enjoys greater accuracy; the renormalisation in SPSr and SPS2r redistributes the missing mass over the existing entries rather than placing some of it on the entries which are currently zero yet should not be.

For small values of , SPS() and AMHd have similar complexities, as do AMHs and SPS(), this latter explaining the relative innacuracy of AMHs compared with AMHd for low values. For moderate and large the complexities of SPS() and SPS() are very similar (since for both) and are roughly half of an order of magnitude smaller than the complexities of both AMHd and AMHs. Indeed, at the largest values the ratio for AMHd to SPS() is . The main reason for the efficiency gain is that to preserve precision in the face of potential cancellation of large positive and negative terms, the algorithm of Al-Mohy and Higham (2011) increases so as to limit the highest series power to . For large this leads to a cost of approximately and for AMHd and AMHs respectively, whereas for our algorithm, from Corollary 1, the cost is approximately . The true cost of AMHs and AMHd, visible in Figure 2, is less than suggested above because, as described in Section 2.1, the algorithms include a stopping rule for early truncation of each series expansion.

Largely the same pattern as above is observed for the Moran, SIR and SEIRS models and so Figure 2 only includes the plots for the SIR model. The following minor differences are evident: (i) for low values, AMHs is (Moran), (SIR) and (SEIRS) rather than orders of magnitude less accurate than AMHd and SPS(); (ii) with no ground truth we are unable to ascertain the relative accuracies of AMHd and SPS() but can observe that their discrepancy increases with rho and that they agree to within approximately for ; (iii) for SPS() renormalisation improves the accuracy slightly for the SIR and SEIRS models but reduces it slightly for the Moran and the Immigration-Death models. Using two-tailed truncation always improves the accuracy slightly. At the largest values the ratio of complexities for AMHd to SPS() are (Moran), (SIR) and (SEIRS).

For the algorithm of Al-Mohy and Higham (2011), because of the further complication of including the norm-approximation algorithm of Al-Mohy and Higham (2009), we did not approximate the norms , but calculated them exactly. The cost of these evaluations was not included in the algorithm complexity; the vector-matrix multiplications that would have been used for these calculations in Al-Mohy and Higham (2009) were also not included in the complexity.

### 4.3 Simultaneous evaluation at multiple time points

Finally, we investigate the MUSPS algorithm of Section 3.3; its accuracy, speed and usefulness. We examine the immigration-death and SEIRS algorithms with the parameterisation as specified in Section 4.2 and four variations on Barábasi-Albert random network model that will be described, briefly, below. For each system we considered time points, equally spaced between and , where for the immigration-death system, for the SEIRS system, and for each of the four the random networks; in each case the upper bound, , was sensible given the system and (for SEIRS and random graphs) application of interest. Using time points for the intractable systems was not feasible as the cost of obtaining a set of appoximate truths using SPS (see below) is quadratic in .

We created a set of four random networks based upon a pair of independent realisations from the Barabasi-Albert model (see Appendix E), and , with parameters . Each of and has nodes, with an average degree of , a minimum degree of and the probability that the degree of a randomly chosen node is being proportional to (e.g. Durrett, 2007, Chapter 4). Our four graphs are obtained by connecting and through a single edge with a weight of , as follows. Let and be the nodes in and with the highest degree, and let and be the last node added to and respectively (this has the lowest possible degree). Joining to gives , joining to gives , joining to gives and joining to gives . We took and .

For all three applications we ran MUSPS and a sequential SPS which evolved a current vector by a time of to obtain a new current vector times. We compared the speeds of the two algorithms and their accuracies across all of the times from to ; for the immigration-death system, accuracy was relative to the known true transition probabilities; for the other two systems it was relative to an individual, direct calculation for each using SPS. We used MUSPS2r and SPS2r throughout.

Each application was run times; Table 3 shows the dimension and value for each system as well as the range of (i) times for MUSPS and (ii) ratios of the time for MUSPS to the time for sequential SPS. For the random networks model, for a given realisation of and , the variation in timings for was less than , so we only show the results for . Total time is roughly proportional to , as expected. For SEIRS and the Barabási-Albert model MUSPS is more efficient than SPS. For the immigration-death system MUSPS is more expensive than sequential SPS for two reasons. Firstly, the dimension is small so the relative cost of the -dependent overheads of the MUSPS algorithm is large; moreover, has only entries per column and so each vector-matrix multiplication has a similar cost to the vector-scalar multiplication and vector-vector addition that MUSPS must still perform. The additional cost may, however, be justified if accuracy is important, as we shall see.

Figure 3 plots the accuracy (base logarithm of the discrepancy from the ‘truth’, with removed as all algorithms produce the same probability vector) of MUSPS against that of SPS for the immigration-death system, the SEIRS system and the first set of realisations from the random-network model (the pattern was persistent across replications). Whenever there is a substantial difference in the accuracies, MUSPS is more accurate than sequential SPS.

The left and central panels of Figure 4 demonstrates an example use of MUSPS within an SEIRS model by plotting the probability of extinction against (left panel) and the expected number of people infected or exposed against amongst infections which have not died out (central panel). The kink in the first graph around corresponds to the peak in the second. However, the central panel shows a dip in the presence of infections before (conditional) equilibrium is reached. Interestingly, the solution to the initial value problem with and

 dSdt=−θ1SI+θ4(npop−S−E−I), dEdt=θ1SI−θ2E, dIdt=θ2E