 # Low-rank tensor methods for Markov chains with applications to tumor progression models

Continuous-time Markov chains describing interacting processes exhibit a state space that grows exponentially in the number of processes. This state-space explosion renders the computation or storage of the time-marginal distribution, which is defined as the solution of a certain linear system, infeasible using classical methods. We consider Markov chains whose transition rates are separable functions, which allows for an efficient low-rank tensor representation of the operator of this linear system. Typically, the right-hand side also has low-rank structure, and thus we can reduce the cost for computation and storage from exponential to linear. Previously known iterative methods also allow for low-rank approximations of the solution but are unable to guarantee that its entries sum up to one as required for a probability distribution. We derive a convergent iterative method using low-rank formats satisfying this condition. We also perform numerical experiments illustrating that the marginal distribution is well approximated with low rank.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The dynamics of cancer can be studied using tumor progression models, cf. [Beerenwinkel2014]. These models describe the evolving genotype of a tumor as a continuous-time Markov chain. A model includes genomic loci that may or may not be mutated. It starts out with all mutations absent and then progressively accumulates mutations. The number of possible states of the tumor is thus . In typical applications one is interested in probability distributions over this state space that are far from stationary. Extrapolating the future course of a given tumor requires the computation of transient distributions. However, since the age of a tumor and thus the time point of an observation of the Markov chain is generally unknown, we study the time-marginal distribution, which is defined as the solution of a certain linear system. Today at least genes are known to drive tumor progression, cf. [Bailey2018], i.e., a single distribution would require the storage of more entries than there are atoms in the visible universe. This phenomenon is called state-space explosion [Buchholz2007] and renders classical methods for calculating or storing distributions infeasible. This problem is also known in other application areas, e.g., chemical-reaction networks [Anderson2010], the chemical master equation [Kazeev2014], Hamiltonian dynamics [Lubich2016], queuing networks [Chan1987]

, or stochastic neural networks

[Yamanaka1997]. To overcome the state-space explosion, we describe such Markov chains via Stochastic Automata Networks [Plateau2000] which provide a sparse representation of the infinitesimal generator as a short sum of Kronecker products. The main idea is to split up a large Markov chain into smaller ones, the so-called automata, which are interconnected. In tumor progression, for example, each automaton represents a driver mutation and encodes whether it is present or absent. When describing transitions, there are effects between individual automata, e.g., some mutations may favor or inhibit the occurrence of others. This leads to entries within the sparse representation of the generator that are functions of the current state. The explicit evaluation of these functions at each state would result in exponential cost. However, if these functions are separable, the structure as a short sum of Kronecker products without functional entries can be preserved and the exponential cost avoided. Assuming this separability, we represent the generator and also the operator of the system as a short sum of Kronecker products. We make the common assumption that the Markov chain starts always at a certain state, which implies that the right-hand side of the system is a Kronecker product. To determine the marginal distribution, i.e., to solve the linear system, in a sparse way we exploit this structure and turn to so-called low-rank tensor formats. These are known to handle the state-space explosion by reducing the exponential cost in the number of automata to linear cost. In the context of low-rank tensors, a sum of Kronecker products is referred to as CANDECOMP/PARAFAC (CP) format [Carroll1970, Harshman1970]. For large Markov chains, the CP format has been used extensively to approximate, in particular, stationary distributions, e.g., in [Kulkarni2011], and to derive conditions for their existence, e.g., in [Fourneau2008]. However, the problem of finding a CP decomposition of a given tensor is NP-hard in general, cf. [Hastad1990]. Therefore we focus on tree tensor networks, especially on the tensor-train [Oseledets2009, Ostlund1995, White1992, Loan2008] and the hierarchical Tucker format [Grasedyck2010, Hackbusch2009]. There are several known methods to determine an approximate solution of a linear system using low-rank tensor formats, see, e.g., [Grasedyck2013] for an overview. These methods can roughly be divided into optimization-based approaches and iterative procedures including truncation. Truncation, i.e., approximation of a tensor with one of lower rank, is needed to keep computational and storage costs low during the iteration. The tensor-train format was successfully used for the computation of, e.g., transient distributions [Johnson2010], mean times to failure [Masetti2019], and stationary distributions [Kressner2016, Kressner2014]. There mainly optimization-based approaches were presented. In [Buchholz2016] the hierarchical Tucker format was applied to reduce the storage cost for distributions and the computational cost for performing basic operations. Continuing in [Buchholz2017] adaptive truncation strategies for the computation of stationary distributions using iterative methods were presented. To allow for a probabilistic interpretation of an approximation, the latter should be a probability distribution itself, i.e., all entries should be non-negative, and the sum of all entries should be equal to one. Similar to [Kressner2016, Buchholz2017, Kressner2014, Kulkarni2011, Masetti2019] we neglect the non-negativity to ensure an error-controlled approach, see, e.g., [Kim2014] for an overview. We focus on an iterative solver to determine time-marginal distributions which guarantees that the sum of all entries of the approximation is one. In [Kressner2014]

a power iteration based on a formulation of the stationary solution as an eigenvalue problem was derived, where after each application of the operator in the power iteration the current approximation was rescaled to ensure that it sums up to one. However, the marginal distribution we are interested in is not stationary, i.e., the right-hand side of the linear system is non-zero. Rescaling the current approximation is not an option in this case since the final result would then solve the rescaled system rather than approximate the marginal distribution. Alternatively, one could formulate the system as an eigenvalue problem, but this is not straightforward because it is unclear how the right-hand side should be modified for this purpose. Other popular iteration methods, e.g., projection methods, which are compatible with low-rank tensors do not allow us to implement our normalization condition either since the convergence of these methods cannot be guaranteed if rescaling takes place after every step. To overcome this problem, we derive a novel iterative method based on the Neumann series

[Dubois1979] and the uniformization method [Grassmann1977] using low-rank tensor formats. We verify that this method results in an approximation that sums up to one and prove its convergence. In our numerical experiments, we focus on the concept of Mutual Hazard Networks [Schill2019] for tumor progression. Our experiments illustrate that the marginal distribution can be approximated by low-rank tensors using the new algorithm. This work is organized as follows. In creftype 2 we derive the linear system that defines the time-marginal distribution of a continuous-time Markov chain. We represent its operator and right-hand side in a sparse way based on Stochastic Automata Networks. In creftype 3 we introduce the concept of tensors and tree tensor networks. We present the tensor-train and the hierarchical Tucker format. Using these formats we derive an iterative method and prove its convergence. In creftype 4 we perform numerical experiments based on the concept of Mutual Hazard Networks. In creftype 5 we conclude.

## 2 Statement of the problem

First we explain the main problem we address.

### 2.1 Time-marginal distribution

A continuous-time Markov chain is defined by its state space , its infinitesimal generator and an initial distribution . In this paper we assume that the state space is discrete. The generator is an operator which stores the rates of transition from state to another state in and the rates of staying in a state in . The probability distribution at time is defined by

 ddt\p(t)=\Q\pN or \p(t)=exp(t\Q)\pN. (1)

We make the common assumption that every trajectory starts at the same state, i.e., the initial distribution

is a canonical unit vector. Here, we assume that the observation time

is unavailable, such as, e.g., in tumor progression modeling, and thus must be treated as a random variable. Therefore, we are interested in a so-called

time-marginal distribution which is independent of the time and which we will call marginal distribution for brevity. Each entry of the marginal distribution indicates the probability of observing a state

at a random time. We follow the common assumption that the sampling time is an exponentially distributed random variable with mean

, i.e., . Similar approaches can be found, e.g., in [Beerenwinkel2009, Hjelm2006, Schill2019]. Thus, we have

 \p=∞∫0exp(−t)\p(t) dt=∞∫0exp(t(\Q−\Id))\pN dt. (2)

In the following we assume that the spectrum is contained in the negative complex half plane . In this case the improper integral exists and

 \p=(\Id−\Q)−1\pN. (3)

Hence, the marginal distribution is defined as the unique solution of a linear system, since the operator is regular. To overcome the state-space explosion, we focus on the concept of Stochastic Automata Networks [Plateau2000] which offer a sparse representation of the infinitesimal generator .

### 2.2 Stochastic Automata Networks

For a continuous-time Markov chain of interacting processes the discrete state space factorizes in a natural way into the state spaces of the individual processes. Each process is itself a Markov chain over its own state space and is called a stochastic automaton . The set of stochastic automata is called a Stochastic Automata Network, cf. [Plateau2000]. The full state space consists of all possible combinations of states in each automaton, i.e., for and it is given by

 (4)

Each state specifies its state in each automaton . We now consider transitions between states in the full state space , which are given by one or more transitions between states within the individual state spaces . Such transitions are called functional/constant and synchronized/local according to the following definitions.

• A transition is called functional if the transition rate is a function of the current state in . Otherwise it is called constant.

• A transition in a state space may force or prevent transitions in another state space at the same time. Such transitions in are called synchronized. Otherwise they are called local.

It is known that the infinitesimal generator of a Stochastic Automata Network with automata can be represented as a sum of Kronecker products, cf. [Plateau2000]. To this end, the infinitesimal generator is split into two parts describing local and synchronizing transitions, respectively, i.e., . Since the local transitions only change the state in their corresponding automaton, we represent the local part as

 (5)

where denotes the identity in and describes the local transition in . The synchronized transitions are represented similarly. Instead of one term of Kronecker products, two terms are needed for each synchronized transition, cf. [Plateau2000]. The whole infinitesimal generator for automata and synchronized transitions is given by

 \Q=\Qlocal+\Qsync=2s+d∑ν=1 d⨂μ=1\Q(μ)ν. (6)

Instead of entries, one only has to store entries. Following the current state of the art in tumor progression models we only consider local transitions from now on.

### 2.3 Description of functional transitions

For Markov chains describing interacting processes, i.e., automata, we are interested in functional transitions, for which the entries of are functions. Every time such an entry is needed one would have to evaluate the function, which requires knowledge of the states of all automata. To analyze this problem more precisely, we concentrate on Markov chains satisfying the following assumptions:

1. The continuous-time Markov chain can be represented as a Stochastic Automata Network with automata. Each automaton has a state space with , and the state space of the network is given by .

2. There are only local and functional transitions.

3. For each automaton there exists an ordering of the states such that there is no transition from state to state for all . We denote the set of possible transitions in by .

4. The transition rate from state to in , where and differ only in automaton and satisfy , is separable and can be represented with parameters by

 Q[y,x]=d∏μ=1\PM(x[ν],y[ν]),x[μ].
5. The time is an exponentially distributed random variable with mean , i.e., .

In the following we discuss these assumptions in more detail. Due to assumption 2 and 3, all possible transitions are covered in assumption 4. All other transition rates are zero. Each diagonal entry of the generator is given by minus its off-diagonal column sum, since each column sum is zero. Hence, the whole infinitesimal generator is already defined by the parameters for all states and , and we denote the corresponding generator as . The parameter with can be interpreted as a direct multiplicative effect of the state on the local transition from to in , while is a baseline rate of transition from to in . If a state has no direct effect on the transition from to in , then . As usually assumed, an automaton does not affect many others directly, and therefore many multiplicative effects, i.e., parameters , are equal to one. This corresponds to a sparsity of the interactions between the automata. Assumption 4 allows us to separate the functional transition rates and to transfer the parameters to the matrices corresponding to automaton within each Kronecker product in creftype 5. Thus, we represent all functional transitions within the Kronecker product structure without any functional entries:

 \QPM =d∑ν=1 ∑(i,j)∈Tν d⨂μ=1\QCore (7) with \QCore

where has only one non-zero entry . By assumption 3, the generator is similar to a lower triangular matrix where the similarity transformation is obtained by permuting states following the order of each automaton. Hence, the spectrum consists of all diagonal entries

 σ(\QPM−\Id)={−1−d∑ν=1 ∑(x[ν],j)∈Tν d∏μ=1\PM(x[ν],j),x[μ] ∣∣∣ x∈S}⊆\R≤−1

and the improper integral in creftype 3 exists.

### 2.4 Mutual Hazard Networks

A class of Markov chains satisfying assumptions 1 to 5 occurs in the context of tumor progression, the Mutual Hazard Network model [Schill2019]. There tumor progression is modeled as a Stochastic Automata Network over a discrete state space , and a state represents the genotype of a tumor. Each automaton represents a genomic event such as a point mutation, a copy number alteration, or a change in DNA methylation. The state space of each automaton is given by , where indicates that the genomic event has not occurred yet and that it has. Hence, the state space of the Markov chain is given by . All events are assumed to occur one after another and irreversibly, i.e., there are only local transitions from state to in . If an event has not yet occurred, this has no direct effect on all other events, i.e., for all . A parameter is interpreted as a promoting effect and as an inhibiting effect. Furthermore, time is considered as a random variable , cf. [Schill2019]. In summary, all assumptions 1 to 5 are satisfied. Similar conclusions can be obtained for several types of progression models, e.g., [Beerenwinkel2009, Hjelm2006]. The Mutual Hazard Network with genomic events corresponds to parameters and for all . Thus, the parameters can be collected in a matrix with . To allow for visualization and interpretation of the network, one assumes sparsity of the Mutual Hazard Network, i.e., many automata or events do not interact directly. In [Schill2019] optimal parameters

are found using maximum-likelihood estimation for a data set of tumors. Since the age of a detected tumor is unknown, the likelihood is given by the marginal distribution

, i.e., each entry is the probability under the model that a tumor has the genotype when it is observed at a random time. This requires for fixed parameters in each optimization step. Due to the state-space explosion, the computation of the marginal distribution in [Schill2019] based on classical methods was limited to about automata.

### 2.5 Structure of the linear system

As the improper integral in creftype 2 exists, the time-marginal distribution depending on parameters is uniquely defined by

 (\Id−\QPM)\pPM=\pN. (8)

Similar to the representation in creftype 7, the identity is a Kronecker product of smaller identities corresponding to the state spaces . The initial distribution is a canonical unit vector and thus a Kronecker product of canonical unit vectors corresponding to each . Hence, the operator and the right-hand side of creftype 8 have a representation as a short sum of Kronecker products, which allows for efficient storage.

## 3 Low-rank method

We now compute the marginal distribution in a sparse way. To avoid losing sparse structures when performing arithmetic operations we keep the Kronecker products unexecuted and understand our operators and distributions as tensors.

### 3.1 Low-rank tensor formats

We view tensors as multidimensional generalizations of vectors and matrices, i.e., one-dimensional and two-dimensional tensors.

###### Definition 1 (tensor)

Let and be a Cartesian product of discrete index sets . An object is called a tensor of dimension . Each direction is called a mode of , and the cardinality of the -th index set is called the -mode size.

In our case, the index set corresponds to the state space , our distributions , , are tensors of dimension , and the automata correspond to the modes with sizes . Similar to the reshaping of the distribution vectors into tensors, one can reshape a tensor into a matrix. This concept, called matricization, corresponds to an unfolding of the tensor and is defined, in analogy to [Grasedyck2010], as follows.

###### Definition 2 (matricization)

Let and with and . The matricization of corresponding to is defined as with and

 B(t)[(iμ)μ∈t,(iμ)μ∈s]=B[i1,…,id]

for all . In particular .

In the language of tensors, the structure of in creftype 7 can be generalized to the so-called CANDECOMP/PARAFAC (CP) format introduced in [Carroll1970, Harshman1970].

###### Definition 3 (CP format)

A tensor has a CP representation if there exist such that

 B=r∑ν=1d⨂μ=1b(μ)ν. (9)

Then is called the CP representation rank, and the are called the CP cores of .

The infinitesimal generator in creftype 7 has CP representation rank , while the identity as well as the right-hand side have rank . A core advantage of the CP format is the data sparsity in case of small representation rank : The representation creftype 9 of a tensor has storage cost in in contrast to for . The problem of finding conditions that guarantee the existence of a low-rank approximation of a given tensor is an active research topic of its own, see, e.g., [Bachmayr2017, Kressner2016, Plateau2007]. Since this goes beyond the scope of this article, we concentrate on numerical experiments which indicate that the marginal distribution can be approximated with low rank. To this end, we have to solve a linear system whose operator and right-hand side have a CP representation. For an operator with CP rank it is unknown how to calculate its inverse analytically. Hence, we need a solver and arithmetic operations to compute the solution numerically. Most of these arithmetic operations lead to an increase in the representation rank and thus in storage. To avoid such an increase one would like to truncate a given CP tensor to lower rank, i.e., approximate the tensor with one of lower rank. As the set of tensors with CP rank at least is not closed for , low-rank approximation within the CP format is an ill-posed problem, cf. [Silva2008]. To overcome this drawback, we use the following low-rank tensor formats which allow for truncation in an error-controlled way.

### 3.2 Tree tensor network graphs

Tensors and many interactions between them can be illustrated via an undirected graph as a tensor network. Each vertex of a tensor network graph represents a particular tensor. The dimension of this tensor is given by the number of its edges and half-edges, where a half-edge is an edge that is connected to only one vertex. creftypecap 1 shows some examples of tensor network graphs. Figure 1: Tensor network graphs: vector, matrix, three-dimensional and five-dimensional tensor (from left to right).

An edge between two vertices can be interpreted as a contraction over the corresponding mode, i.e., one sums over the index set of this mode. The resulting network can again be viewed as a tensor whose dimension is given by the number of half-edges. For example, the contraction of a two-dimensional tensor and a one-dimensional tensor is the matrix-vector product with

 (B⋅C)[x]=∑y∈\I2B[x,y]C[y]

for all . The corresponding tensor network graph is shown in creftype 2 where the mode sizes are denoted by . Figure 2: Matrix-vector product B⋅C∈\R\I1 as a tensor network graph with B∈\R\I1×\I2, C∈\R\I2, and nμ=|\Iμ|.

This concept can be extended to higher dimensions. For a two-dimensional tensor , i.e., a matrix, one can use these networks to illustrate the singular value decomposition as shown in creftype 3. There, the mode sizes corresponding to the half-edges equal the original ones of , and the mode sizes corresponding to the edges are given by the matrix rank .111For a two-dimensional tensor the matrix rank equals the minimal CP representation rank, cf. [Hackbusch2012]. Figure 3: Singular value decomposition B=UΣVT∈\R\I1×\I2 as a tensor network graph with matrix rank r and nμ=|\Iμ|.

Again this can be extended to higher dimensions, i.e., higher-dimensional tensors can also be represented as contractions of auxiliary tensors. In analogy to the two-dimensional case, the mode sizes of the half-edges correspond to those of the original tensor, and the mode sizes of the edges generalize the concept of matrix rank. This generalized rank allows for an error-controlled way of truncation in tree tensor networks. Several tensor formats based on tree tensor networks are known. Here, we focus on the tensor-train format [Oseledets2009, Ostlund1995, White1992, Loan2008] and the hierarchical Tucker format [Grasedyck2010, Hackbusch2009].

### 3.3 Tensor-train and hierarchical Tucker format

The tensor-train format was first introduced to the numerical analysis community in [Oseledets2009]. It is also known in other areas as matrix product states [Ostlund1995, White1992] or as linear tensor network [Loan2008]. A tensor of dimension in the tensor-train format is represented as contractions of so-called core tensors. Each core tensor represents one mode and is connected to its forward and backward neighbors. The core tensors themselves are of dimension three, or two in case of the first and the last mode. For a given tensor-train representation, its rank is defined as the tuple of all mode sizes corresponding to edges. creftype 4 shows an illustrative example. Figure 4: Tensor-train representation of a tensor with dimension d=4, mode sizes all equal to n and representation rank r=(r,r,r).

Instead of the original tensor one needs to store the core tensors. Thus, the required storage cost for a tensor-train representation with dimension , mode sizes all equal to and representation rank component-wise bounded by is in . In our case the dimension of the distribution tensors is equal to the number of automata, and the mode sizes are given by the number of states in each automaton. Note that in general the representation rank can depend on the ordering of the modes/automata, cf. [Grasedyck2011, Oseledets2009].
The hierarchical Tucker format was introduced in [Hackbusch2009] and further analyzed in [Grasedyck2010]. As the name suggests, the hierarchical Tucker format is based on a hierarchical binary tree structure defining the arrangement of the auxiliary tensors. Similar to the tensor-train format there are auxiliary tensors each of which represents one mode of the original tensor. These two-dimensional tensors, the so-called frames, build the leaves of the binary tree. The two-dimensional tensor at the root of the tree and the three-dimensional tensors at all other levels are called transfer tensors. For a given hierarchical Tucker representation, its rank is defined as the tuple of all mode sizes corresponding to the edges of the tree, i.e., where denotes the given tree. creftype 5 shows an illustrative example. Figure 5: Hierarchical Tucker representation of a tensor with dimension d=4, mode sizes all equal to n and representation rank r=(r)t∈T using a balanced tree T.

Note that in the hierarchical Tucker format we have many options for the structure of the binary tree and that the rank depends on our choice, cf. [Grasedyck2011]. Instead of the original tensor the frames and transfer tensors have to be stored. Thus, the storage cost for a hierarchical Tucker representation of a -dimensional tensor with maximum mode size and a representation rank component-wise bounded by is in . In our case the dimension of the distribution tensors is again equal to the number of automata, and the mode sizes are the number of states for each automaton.
Thus, in both the tensor-train and the hierarchical Tucker format, the storage cost grows only linearly in the dimension . This allows for efficient storage and prevents the state-space explosion provided we have low ranks. It is easily possible to convert a CP representation of rank to a tensor-train or a hierarchical Tucker representation, where all rank components are bounded by , cf. [Hackbusch2012]. Similar to the CP format both formats provide arithmetic operations like addition, application of operators, or scalar products. Since the storage cost increases with the square or cube of the rank and arithmetic operations also increase the rank, we need a way to truncate a tensor to lower rank in an error-controlled way. In tree tensor networks we use a generalization of the singular value decomposition to truncate in a quasi-optimal way, see [Oseledets2009] for the tensor-train and [Grasedyck2010] for the hierarchical Tucker format. There it is shown that the error in the Frobenius norm caused by the truncation of a tensor to rank for a given tree is bounded by

 ∥B−τr(B)∥2≤∑t∈T∑m>rtσ2t,m≤ Cd ∥B−Bbest∥2 (10)

where denotes the truncation operator, is the -th singular value of the matricization , is a best rank- approximation, and is a small constant. According to creftype 10, a steep decay in the singular values of the corresponding matricization leads to a small relative truncation error, i.e., the tensor can be well approximated with one of low rank. Inspired by [Hackbusch2012]creftype 1 summarizes some arithmetic operations and their cost in the tensor-train and the hierarchical Tucker format, see [Grasedyck2010, Hackbusch2012, Oseledets2009]creftypecap 1 shows that using those low-rank tensor formats reduces the exponential computational cost in the dimension to linear cost.

Since the hierarchical Tucker format is based on a binary tree, it is possible to do many of the arithmetic operations and especially the costly truncation in parallel level by level of the tree as described in [Grasedyck2017]. Thus, for a balanced tree and dimension the runtime can often be reduced from to , cf. [Grasedyck2017].

### 3.4 Low-rank method

We now make use of these low-rank tensor formats to approximate the marginal distribution as the solution of creftype 8. Since the exact solution is a probability distribution, its entries sum up to one, i.e., it fulfills

 ⟨\e,\pPM⟩=1 (11)

where is the tensor of all ones. To allow for a probabilistic interpretation of an approximation of , we need to treat creftype 11 as an additional constraint. To this end, we now present an iterative method based on the Neumann series [Dubois1979] and the uniformization method [Grassmann1977]. Let be a bound on the diagonal entries of the infinitesimal generator, then

 \pPM=(\Id−\QPM)−1\pN=11+γ(\Id−(γ1+γ\Id+11+γ\QPM))−1\pN.

Owing to the ordering assumption 3 and the fact that all diagonal entries of are non-positive, the spectral radius is smaller than one, i.e., , and so the Neumann series converges. Hence, we can represent the marginal distribution as

 \pPM=11+γ ∞∑m=0 (γ1+γ)m(\Id+1γ\QPM)m\pN. (12)

Alternatively we can derive creftype 12 using the uniformization method. Here, the idea is to describe a continuous-time Markov chain by a discrete-time Markov chain with a time increment that is exponentially distributed. With this interpretation, we write the time-dependent probability distribution in creftype 1 as

 \pPM(t)=∞∑m=0(γt)mm!exp(−γt)(\Pgamma)m\pNwith\Pgamma=\Id+1γ\QPM

for a time . Note that is the transition probability matrix of a discrete-time Markov chain, since is a bound on the diagonal entries of . Marginalization of time similar to creftype 2 and substitution leads to creftype 12:

 \pPM =∞∫0exp(−t) \pPM(t) dt = ∞∑m=0 γmm! ∞∫0tm exp(−(γ+1)t) dt (\Pgamma)m\pN =∞∑m=0 γm Γ(m+1)m!(1+γ)m+1 (\Pgamma)m\pN = 11+γ ∞∑m=0 (γ1+γ \Pgamma)m\pN

where denotes the gamma function. A natural approximation based on creftype 12 would be

 \pktilde=11+γ k∑m=0 (γ1+γ \Pgamma)m\pN

for . Based on the properties of the Neumann series this sequence converges linearly to , but does not satisfy the normalization condition creftype 11 for any . As is a transition probability matrix, its application to a probability distribution leads to a probability distribution again satisfying creftype 11, and thus

 ⟨\e,k∑m=0 (γ1+γ \Pgamma)m\pN⟩=k∑m=0 (γ1+γ)m=(1+γ)k+1−γk+1(1+γ)k

for all . By scaling each element of the sequence, we obtain

 \pk=(1+γ)k(1+γ)k+1−γk+1 k∑m=0 (γ1+γ \Pgamma)m\pN (13)

for , which is now an approximation to that satisfies creftype 11. We prove its linear convergence to in the following theorem.

###### Theorem 1

Let be some given parameters, be the solution of the corresponding linear system creftype 8, and be defined by creftype 13 for all . Then converges linearly to as approaches infinity, i.e.,

 limk→∞\pk=\pPMwith∥∥\pk−\pPM∥∥≤c⋅(γ1+γ)kfor all k∈\N

where .

###### Proof

For , and any we have

 |αk−α| =∣∣∣γk+1(1+γ)((1+γ)k+1−γk+1)∣∣∣<γ1+γ |αk−1−α| <(γ1+γ)k |α0−α|=(γ1+γ)k+1.

Furthermore we obtain

 ∥∥\pk−\pPM∥∥ ≤∥∥\pk−\pktilde∥∥+∥∥\pktilde−\pPM∥∥

Since , the linear convergence follows.

In order to employ low-rank tensor formats, we extend the iterative method corresponding to creftype 13 by truncation. As shown in [Hackbusch2008], a convergent iterative method combined with truncation still converges if the truncation error is sufficiently small. However, the truncation could violate condition creftype 11. Therefore, we have to extend our method to ensure that

 ⟨\e,τ((\Pgamma)m\pN)⟩=1and⟨\e,τ(\pk)⟩=1

for all where denotes the truncation operator. We implement these conditions by rescaling after truncation as shown in creftype 1, see lines and . The procedure should be continued until the norm of the relative residual is smaller than a given tolerance .

In creftype 1 we have to choose a bound on the diagonal entries of . The cost for computing all diagonal entries to find the maximum is in . In case of parameters that vary little,

 γ=d∑ν=1 maxiν⎛⎝∑(iν,jν)∈Tν d∏μ=1 maxiμ \PM(iν,jν),iμ⎞⎠ (14)

can be used as an inexpensive bound. For the computation of creftype 14 we need comparisons and evaluations222Strictly speaking, the computational cost of the method is therefore quadratic, rather than linear, in . However, only needs to be precomputed once, the computational cost of which is negligible. of the parameter where is the maximum number of possible transitions in an automaton. In the case of strongly varying parameters creftype 14 may be a gross overestimation for the diagonal entries of . The question of how to determine a tighter bound with polynomial effort using low-rank tensor formats will be dealt with in future work.

## 4 Numerical experiments

We illustrate our method for the computation of time-marginal distributions in numerical experiments using the following setting based on the model of Mutual Hazard Networks.

### 4.1 Setting

We consider automata and sparse parameters with a particular block-diagonal form. Each block of size characterizes a subset of automata which directly affect one another. There are no direct effects between automata from different blocks. In each block, we would like the effects between neighboring automata to be stronger than those between automata that are farther apart. Therefore, we draw the entries

of the blocks from a normal distribution with mean

(and restrict them to ). Two possible examples for automata and block size are given below:

Parameters of the same color follow the same distribution. For the application of creftype 1 we always choose the bound as in creftype 14, which for Mutual Hazard Networks can be reduced to

 γ=d∑ν=1 d∏μ=1max{1,\PM[ν,μ]}. (15)

We use a canonical balanced tree tensor network for the application of the hierarchical Tucker format, i.e., the automata are assigned to the leaves following their ordering, see creftype 6. We perform all experiments for randomly generated sample parameters for each combination of block size and number of automata. Unless stated otherwise, we compute low-rank approximations of the marginal distribution using creftype 1 with a maximum relative truncation error of . The algorithm stops when the norm of the relative residual is smaller than a tolerance value of . The mean values we compute are arithmetic means.

### 4.2 Study of singular values

One fundamental assumption for solving linear systems using low-rank tensor methods is that a solution can be well approximated by a tensor of low rank. According to the error bound in creftype 10, the truncation error is determined by the singular values of the matricization corresponding to the chosen tree network. A fast decay of those singular values indicates that a tensor can be approximated accurately by one of low rank. To analyze this issue, we solve creftype 8 using classical matrix methods of MATLAB [Matlab2019] and compute the singular values of the corresponding matricization using the htucker toolbox [Kressner2012]creftypecap 6 shows the decay of the singular values of each matricization corresponding to the canonical balanced tree for automata and blocks of size . For each vertex the semi-logarithmic plot displays the means of the singular values. Figure 6: Mean of singular values of matricizations of \pPM for the canonical balanced tree and 100 sample parameters with d=8 automata and block size b=4.

We observe that the singular values, especially those close to the root, exhibit an exponential decay. The two matricizations closest to the root are transposes of each other, and therefore their singular values are identical. The smallest of the singular values are indistinguishable from zero and therefore cannot be displayed in the semi-logarithmic plot. The exponential decay of the singular values indicates that the marginal distribution is well approximated with low rank.

### 4.3 Study of the tree structure

In general, for tree tensor networks the rank depends on the structure of the tree. As already observed for large Markov chains using the tensor-train format, a change in the ordering of the automata within the tensor-train tree network can affect the rank, cf. [Masetti2019]. In the following, we focus on the hierarchical Tucker format and study how the ordering of the automata in the leaves of the tree affects the low-rank approximability. We preserve the balanced binary structure of the tree because this is advantageous for parallelization, cf. [Grasedyck2017]. We already studied the singular values of the matricizations corresponding to the canonical balanced tree, see creftype 6. We change only the arrangement of the automata in the leaves of the tree and compute the marginal distribution again. The decay of the singular values for each matricization is shown in creftype 7 where the semi-logarithmic plot at each vertex displays the means of the singular values. Figure 7: Mean of singular values of matricizations of \pPM for a modified balanced tree and 100 sample parameters with d=8 automata and block size b=4.

Comparing creftypeplural 76, we observe that the choice of the canonical binary tree, i.e., the original ordering in the leaves, results in a significantly faster decay of the singular values close to the root. creftypecap 7 shows that the singular values closest to the root also have an exponential decrease, but at a much slower rate. Note that blocks of size indicate that the automata interact directly with one another and are thus highly correlated. The same holds for the automata . There are no direct interactions between automata of different blocks, i.e., these are only weakly correlated. Hence, the modified balanced tree separates highly correlated automata, which explains the slower decline of the singular values in level (level being the root). In contrast, the canonical balanced tree separates weakly correlated automata, leading to a faster decline of the singular values. We will make similar observations when studying the approximation rank.

### 4.4 Study of the approximation rank

As the rank in a tree tensor network is a tuple depending on the underlying tree , we consider the maximum rank and the effective rank . The effective rank of a tensor representation is defined such that the storage cost for this representation equals the cost to store one with rank . Since intuitively a rank should be an integer, is rounded up to the nearest integer. We compute low-rank approximations of the marginal distribution using creftype 1 as described in creftype 4.1

. In the plots we only show the mean values of the ranks, since their variance among the

realizations of was very small. creftypepluralcap 88 show the maximum and effective approximation rank of as a function of the number of automata for blocks of size .

We observe that is approximated to a tolerance of with low rank in all cases. In particular, for all block sizes considered, the effective rank increases less than linearly in the number of automata and is bounded by . As expected, the maximum rank reacts much more sensitively to changes in and . For constant block sizes and no smooth increase in the number of automata can be detected. Instead, maximum and effective rank oscillate in the number of automata. Particularly low ranks typically occur when the strongly correlated automata are not separated by the tree structure. For example, in case of automata with a block size of the strongly correlated automata are split only in the lower levels of the tree, and we observe a very low rank. For automata with a block size of , however, the strongly correlated automata , for example, are separated already in the first level of the tree, and we observe a much higher maximum rank. A closer look at the rank components reveals an increase especially at the points where the blocks are split up. In the case of the block size increases linearly in the number of automata. In creftype 8 we see a strong increase of the maximum rank for automata. However, the effective rank grows smoothly and logarithmically in the number of automata, see creftype 8. This indicates that the maximum rank occurs only sporadically. A closer look shows that the maximum ranks again occur only in isolated cases at the vertices where highly correlated automata are separated. This observation and the fact that the effective rank is small implies that most rank components are small as well. These results suggest that neither the number of automata nor the size of the blocks alone are responsible for an increase in rank, but that especially the distribution of the automata in the tree has a large effect. How to construct an appropriate tree in order to keep ranks low using a-priori information on the parameters is a topic of ongoing research.
We also analyze how the approximation rank is influenced by the maximum relative truncation error allowed within the algorithm. To this end, we compute a low-rank approximation of the marginal distribution for automata and blocks of size using creftype 1. The iteration stops when the norm of the relative residual for creftype 8 is smaller than . In order to achieve this for all sample parameters, has to be chosen smaller than . If we allow for a higher truncation error, the iteration stagnates in some cases. We again restrict ourselves to plotting mean values since the variances are small.  creftypepluralcap 99 show semi-logarithmic plots of the mean of the maximum and effective rank of the approximation of with automata and block size as a function of the maximum relative truncation error for several values of the tolerance .

For maximum and effective rank are nearly constant with and . For , we observe that there are only small differences in maximum and effective rank, respectively. In these cases, the maximum rank increases logarithmically in the maximum relative truncation error , see creftype 9, and the effective rank is nearly constant for all truncation errors, see creftype 9. For very small all curves converge. This observation suggests, on the one hand, that can be accurately approximated with a tensor of maximum rank and effective rank , since a more accurate truncation, i.e., smaller , has only very small impact on the approximation ranks. On the other hand, the conclusion that the ranks are low and nearly independent of the tolerance value indicates that the ranks during the iteration are also low. This allows not only for efficient storage of the resulting approximation but also for efficient computation using creftype 1.

### 4.5 Study of the method

We now study the convergence of creftype 1.  creftypecap 10 and creftype 10 show semi-logarithmic plots of the norm of the relative residual as a function of the iteration step for block size and automata.  creftypecap 10 displays the mean value of the relative residual and creftype 10 additionally the corresponding box plot illustrating the variances for automata.

We observe a linear convergence of the method for all values of . For larger number of automata the convergence slows down. This can be explained by the fact that the upper bound increases in , i.e., the convergence rates are closer to one. Especially for large and strongly varying parameters a tighter estimation for the maximum diagonal entries using low-rank formats might be advantageous to speed up the convergence. In creftype 10 the ranges for

given by the boxes are small, which indicates that there are only a few outliers given by the whiskers. We have confirmed our results using different numbers

of automata and block sizes (not shown). The size of the blocks as well as the ranks during the iteration have smaller impact on the convergence. In our tests we observe that the number of iteration steps needed to achieve a certain tolerance grows linearly in the number of automata but is independent of the block size . This indicates that the number of iteration steps is independent of changes in the ordering of automata and consequently of changes in the rank.

## 5 Conclusion and future work

Inspired by current research in tumor progression models, we considered a class of continuous-time Markov chains that describe interacting processes and suffer from the problem of state-space explosion. The corresponding time-marginal distribution is uniquely defined as the solution of a certain linear system. By representing the Markov chain via a Stochastic Automata Network with separable functional transitions, we obtained a low-rank tensor representation of the operator and the right-hand side of this linear system. This enabled us to derive an iterative method to compute a low-rank tensor approximation of the time-marginal distribution and hence to overcome the state-space explosion. The method guarantees that the entries of the approximation sum up to one as required for a probability distribution. We proved the convergence of the method. In numerical experiments focused on the concept of Mutual Hazard Networks we illustrated that the time-marginal distribution is well approximated with low rank. The method allows for consistently low ranks during the iteration, and linear convergence was observed independently of the number of processes/automata. A probability distribution, in addition to being normalized to one, must be non-negative. An approximation of a probability distribution should also satisfy these conditions. How to guarantee non-negativity and at the same time convergence will be part of our future research. Moreover, we observed that the approximation rank for the time-marginal distribution depends strongly on the structure of the tree tensor network and on the effects between automata. To minimize the approximation rank we plan to develop a strategy to determine an optimal tree tensor network structure a-priori.