1 Introduction
The dynamics of cancer can be studied using tumor progression models, cf. [Beerenwinkel2014]. These models describe the evolving genotype of a tumor as a continuoustime Markov chain. A model includes genomic loci that may or may not be mutated. It starts out with all mutations absent and then progressively accumulates mutations. The number of possible states of the tumor is thus . In typical applications one is interested in probability distributions over this state space that are far from stationary. Extrapolating the future course of a given tumor requires the computation of transient distributions. However, since the age of a tumor and thus the time point of an observation of the Markov chain is generally unknown, we study the timemarginal distribution, which is defined as the solution of a certain linear system. Today at least genes are known to drive tumor progression, cf. [Bailey2018], i.e., a single distribution would require the storage of more entries than there are atoms in the visible universe. This phenomenon is called statespace explosion [Buchholz2007] and renders classical methods for calculating or storing distributions infeasible. This problem is also known in other application areas, e.g., chemicalreaction networks [Anderson2010], the chemical master equation [Kazeev2014], Hamiltonian dynamics [Lubich2016], queuing networks [Chan1987]
, or stochastic neural networks
[Yamanaka1997]. To overcome the statespace explosion, we describe such Markov chains via Stochastic Automata Networks [Plateau2000] which provide a sparse representation of the infinitesimal generator as a short sum of Kronecker products. The main idea is to split up a large Markov chain into smaller ones, the socalled automata, which are interconnected. In tumor progression, for example, each automaton represents a driver mutation and encodes whether it is present or absent. When describing transitions, there are effects between individual automata, e.g., some mutations may favor or inhibit the occurrence of others. This leads to entries within the sparse representation of the generator that are functions of the current state. The explicit evaluation of these functions at each state would result in exponential cost. However, if these functions are separable, the structure as a short sum of Kronecker products without functional entries can be preserved and the exponential cost avoided. Assuming this separability, we represent the generator and also the operator of the system as a short sum of Kronecker products. We make the common assumption that the Markov chain starts always at a certain state, which implies that the righthand side of the system is a Kronecker product. To determine the marginal distribution, i.e., to solve the linear system, in a sparse way we exploit this structure and turn to socalled lowrank tensor formats. These are known to handle the statespace explosion by reducing the exponential cost in the number of automata to linear cost. In the context of lowrank tensors, a sum of Kronecker products is referred to as CANDECOMP/PARAFAC (CP) format [Carroll1970, Harshman1970]. For large Markov chains, the CP format has been used extensively to approximate, in particular, stationary distributions, e.g., in [Kulkarni2011], and to derive conditions for their existence, e.g., in [Fourneau2008]. However, the problem of finding a CP decomposition of a given tensor is NPhard in general, cf. [Hastad1990]. Therefore we focus on tree tensor networks, especially on the tensortrain [Oseledets2009, Ostlund1995, White1992, Loan2008] and the hierarchical Tucker format [Grasedyck2010, Hackbusch2009]. There are several known methods to determine an approximate solution of a linear system using lowrank tensor formats, see, e.g., [Grasedyck2013] for an overview. These methods can roughly be divided into optimizationbased approaches and iterative procedures including truncation. Truncation, i.e., approximation of a tensor with one of lower rank, is needed to keep computational and storage costs low during the iteration. The tensortrain format was successfully used for the computation of, e.g., transient distributions [Johnson2010], mean times to failure [Masetti2019], and stationary distributions [Kressner2016, Kressner2014]. There mainly optimizationbased approaches were presented. In [Buchholz2016] the hierarchical Tucker format was applied to reduce the storage cost for distributions and the computational cost for performing basic operations. Continuing in [Buchholz2017] adaptive truncation strategies for the computation of stationary distributions using iterative methods were presented. To allow for a probabilistic interpretation of an approximation, the latter should be a probability distribution itself, i.e., all entries should be nonnegative, and the sum of all entries should be equal to one. Similar to [Kressner2016, Buchholz2017, Kressner2014, Kulkarni2011, Masetti2019] we neglect the nonnegativity to ensure an errorcontrolled approach, see, e.g., [Kim2014] for an overview. We focus on an iterative solver to determine timemarginal distributions which guarantees that the sum of all entries of the approximation is one. In [Kressner2014]a power iteration based on a formulation of the stationary solution as an eigenvalue problem was derived, where after each application of the operator in the power iteration the current approximation was rescaled to ensure that it sums up to one. However, the marginal distribution we are interested in is not stationary, i.e., the righthand side of the linear system is nonzero. Rescaling the current approximation is not an option in this case since the final result would then solve the rescaled system rather than approximate the marginal distribution. Alternatively, one could formulate the system as an eigenvalue problem, but this is not straightforward because it is unclear how the righthand side should be modified for this purpose. Other popular iteration methods, e.g., projection methods, which are compatible with lowrank tensors do not allow us to implement our normalization condition either since the convergence of these methods cannot be guaranteed if rescaling takes place after every step. To overcome this problem, we derive a novel iterative method based on the Neumann series
[Dubois1979] and the uniformization method [Grassmann1977] using lowrank tensor formats. We verify that this method results in an approximation that sums up to one and prove its convergence. In our numerical experiments, we focus on the concept of Mutual Hazard Networks [Schill2019] for tumor progression. Our experiments illustrate that the marginal distribution can be approximated by lowrank tensors using the new algorithm. This work is organized as follows. In creftype 2 we derive the linear system that defines the timemarginal distribution of a continuoustime Markov chain. We represent its operator and righthand side in a sparse way based on Stochastic Automata Networks. In creftype 3 we introduce the concept of tensors and tree tensor networks. We present the tensortrain and the hierarchical Tucker format. Using these formats we derive an iterative method and prove its convergence. In creftype 4 we perform numerical experiments based on the concept of Mutual Hazard Networks. In creftype 5 we conclude.2 Statement of the problem
First we explain the main problem we address.
2.1 Timemarginal distribution
A continuoustime Markov chain is defined by its state space , its infinitesimal generator and an initial distribution . In this paper we assume that the state space is discrete. The generator is an operator which stores the rates of transition from state to another state in and the rates of staying in a state in . The probability distribution at time is defined by
(1) 
We make the common assumption that every trajectory starts at the same state, i.e., the initial distribution
is a canonical unit vector. Here, we assume that the observation time
is unavailable, such as, e.g., in tumor progression modeling, and thus must be treated as a random variable. Therefore, we are interested in a socalled
timemarginal distribution which is independent of the time and which we will call marginal distribution for brevity. Each entry of the marginal distribution indicates the probability of observing a stateat a random time. We follow the common assumption that the sampling time is an exponentially distributed random variable with mean
, i.e., . Similar approaches can be found, e.g., in [Beerenwinkel2009, Hjelm2006, Schill2019]. Thus, we have(2) 
In the following we assume that the spectrum is contained in the negative complex half plane . In this case the improper integral exists and
(3) 
Hence, the marginal distribution is defined as the unique solution of a linear system, since the operator is regular. To overcome the statespace explosion, we focus on the concept of Stochastic Automata Networks [Plateau2000] which offer a sparse representation of the infinitesimal generator .
2.2 Stochastic Automata Networks
For a continuoustime Markov chain of interacting processes the discrete state space factorizes in a natural way into the state spaces of the individual processes. Each process is itself a Markov chain over its own state space and is called a stochastic automaton . The set of stochastic automata is called a Stochastic Automata Network, cf. [Plateau2000]. The full state space consists of all possible combinations of states in each automaton, i.e., for and it is given by
(4) 
Each state specifies its state in each automaton . We now consider transitions between states in the full state space , which are given by one or more transitions between states within the individual state spaces . Such transitions are called functional/constant and synchronized/local according to the following definitions.

A transition is called functional if the transition rate is a function of the current state in . Otherwise it is called constant.

A transition in a state space may force or prevent transitions in another state space at the same time. Such transitions in are called synchronized. Otherwise they are called local.
It is known that the infinitesimal generator of a Stochastic Automata Network with automata can be represented as a sum of Kronecker products, cf. [Plateau2000]. To this end, the infinitesimal generator is split into two parts describing local and synchronizing transitions, respectively, i.e., . Since the local transitions only change the state in their corresponding automaton, we represent the local part as
(5) 
where denotes the identity in and describes the local transition in . The synchronized transitions are represented similarly. Instead of one term of Kronecker products, two terms are needed for each synchronized transition, cf. [Plateau2000]. The whole infinitesimal generator for automata and synchronized transitions is given by
(6) 
Instead of entries, one only has to store entries. Following the current state of the art in tumor progression models we only consider local transitions from now on.
2.3 Description of functional transitions
For Markov chains describing interacting processes, i.e., automata, we are interested in functional transitions, for which the entries of are functions. Every time such an entry is needed one would have to evaluate the function, which requires knowledge of the states of all automata. To analyze this problem more precisely, we concentrate on Markov chains satisfying the following assumptions:

The continuoustime Markov chain can be represented as a Stochastic Automata Network with automata. Each automaton has a state space with , and the state space of the network is given by .

There are only local and functional transitions.

For each automaton there exists an ordering of the states such that there is no transition from state to state for all . We denote the set of possible transitions in by .

The transition rate from state to in , where and differ only in automaton and satisfy , is separable and can be represented with parameters by

The time is an exponentially distributed random variable with mean , i.e., .
In the following we discuss these assumptions in more detail. Due to assumption 2 and 3, all possible transitions are covered in assumption 4. All other transition rates are zero. Each diagonal entry of the generator is given by minus its offdiagonal column sum, since each column sum is zero. Hence, the whole infinitesimal generator is already defined by the parameters for all states and , and we denote the corresponding generator as . The parameter with can be interpreted as a direct multiplicative effect of the state on the local transition from to in , while is a baseline rate of transition from to in . If a state has no direct effect on the transition from to in , then . As usually assumed, an automaton does not affect many others directly, and therefore many multiplicative effects, i.e., parameters , are equal to one. This corresponds to a sparsity of the interactions between the automata. Assumption 4 allows us to separate the functional transition rates and to transfer the parameters to the matrices corresponding to automaton within each Kronecker product in creftype 5. Thus, we represent all functional transitions within the Kronecker product structure without any functional entries:
(7)  
where has only one nonzero entry . By assumption 3, the generator is similar to a lower triangular matrix where the similarity transformation is obtained by permuting states following the order of each automaton. Hence, the spectrum consists of all diagonal entries
and the improper integral in creftype 3 exists.
2.4 Mutual Hazard Networks
A class of Markov chains satisfying assumptions 1 to 5 occurs in the context of tumor progression, the Mutual Hazard Network model [Schill2019]. There tumor progression is modeled as a Stochastic Automata Network over a discrete state space , and a state represents the genotype of a tumor. Each automaton represents a genomic event such as a point mutation, a copy number alteration, or a change in DNA methylation. The state space of each automaton is given by , where indicates that the genomic event has not occurred yet and that it has. Hence, the state space of the Markov chain is given by . All events are assumed to occur one after another and irreversibly, i.e., there are only local transitions from state to in . If an event has not yet occurred, this has no direct effect on all other events, i.e., for all . A parameter is interpreted as a promoting effect and as an inhibiting effect. Furthermore, time is considered as a random variable , cf. [Schill2019]. In summary, all assumptions 1 to 5 are satisfied. Similar conclusions can be obtained for several types of progression models, e.g., [Beerenwinkel2009, Hjelm2006]. The Mutual Hazard Network with genomic events corresponds to parameters and for all . Thus, the parameters can be collected in a matrix with . To allow for visualization and interpretation of the network, one assumes sparsity of the Mutual Hazard Network, i.e., many automata or events do not interact directly. In [Schill2019] optimal parameters
are found using maximumlikelihood estimation for a data set of tumors. Since the age of a detected tumor is unknown, the likelihood is given by the marginal distribution
, i.e., each entry is the probability under the model that a tumor has the genotype when it is observed at a random time. This requires for fixed parameters in each optimization step. Due to the statespace explosion, the computation of the marginal distribution in [Schill2019] based on classical methods was limited to about automata.2.5 Structure of the linear system
As the improper integral in creftype 2 exists, the timemarginal distribution depending on parameters is uniquely defined by
(8) 
Similar to the representation in creftype 7, the identity is a Kronecker product of smaller identities corresponding to the state spaces . The initial distribution is a canonical unit vector and thus a Kronecker product of canonical unit vectors corresponding to each . Hence, the operator and the righthand side of creftype 8 have a representation as a short sum of Kronecker products, which allows for efficient storage.
3 Lowrank method
We now compute the marginal distribution in a sparse way. To avoid losing sparse structures when performing arithmetic operations we keep the Kronecker products unexecuted and understand our operators and distributions as tensors.
3.1 Lowrank tensor formats
We view tensors as multidimensional generalizations of vectors and matrices, i.e., onedimensional and twodimensional tensors.
Definition 1 (tensor)
Let and be a Cartesian product of discrete index sets . An object is called a tensor of dimension . Each direction is called a mode of , and the cardinality of the th index set is called the mode size.
In our case, the index set corresponds to the state space , our distributions , , are tensors of dimension , and the automata correspond to the modes with sizes . Similar to the reshaping of the distribution vectors into tensors, one can reshape a tensor into a matrix. This concept, called matricization, corresponds to an unfolding of the tensor and is defined, in analogy to [Grasedyck2010], as follows.
Definition 2 (matricization)
Let and with and . The matricization of corresponding to is defined as with and
for all . In particular .
In the language of tensors, the structure of in creftype 7 can be generalized to the socalled CANDECOMP/PARAFAC (CP) format introduced in [Carroll1970, Harshman1970].
Definition 3 (CP format)
A tensor has a CP representation if there exist such that
(9) 
Then is called the CP representation rank, and the are called the CP cores of .
The infinitesimal generator in creftype 7 has CP representation rank , while the identity as well as the righthand side have rank . A core advantage of the CP format is the data sparsity in case of small representation rank : The representation creftype 9 of a tensor has storage cost in in contrast to for . The problem of finding conditions that guarantee the existence of a lowrank approximation of a given tensor is an active research topic of its own, see, e.g., [Bachmayr2017, Kressner2016, Plateau2007]. Since this goes beyond the scope of this article, we concentrate on numerical experiments which indicate that the marginal distribution can be approximated with low rank. To this end, we have to solve a linear system whose operator and righthand side have a CP representation. For an operator with CP rank it is unknown how to calculate its inverse analytically. Hence, we need a solver and arithmetic operations to compute the solution numerically. Most of these arithmetic operations lead to an increase in the representation rank and thus in storage. To avoid such an increase one would like to truncate a given CP tensor to lower rank, i.e., approximate the tensor with one of lower rank. As the set of tensors with CP rank at least is not closed for , lowrank approximation within the CP format is an illposed problem, cf. [Silva2008]. To overcome this drawback, we use the following lowrank tensor formats which allow for truncation in an errorcontrolled way.
3.2 Tree tensor network graphs
Tensors and many interactions between them can be illustrated via an undirected graph as a tensor network. Each vertex of a tensor network graph represents a particular tensor. The dimension of this tensor is given by the number of its edges and halfedges, where a halfedge is an edge that is connected to only one vertex. creftypecap 1 shows some examples of tensor network graphs.
An edge between two vertices can be interpreted as a contraction over the corresponding mode, i.e., one sums over the index set of this mode. The resulting network can again be viewed as a tensor whose dimension is given by the number of halfedges. For example, the contraction of a twodimensional tensor and a onedimensional tensor is the matrixvector product with
for all . The corresponding tensor network graph is shown in creftype 2 where the mode sizes are denoted by .
This concept can be extended to higher dimensions. For a twodimensional tensor , i.e., a matrix, one can use these networks to illustrate the singular value decomposition as shown in creftype 3. There, the mode sizes corresponding to the halfedges equal the original ones of , and the mode sizes corresponding to the edges are given by the matrix rank .^{1}^{1}1For a twodimensional tensor the matrix rank equals the minimal CP representation rank, cf. [Hackbusch2012].
Again this can be extended to higher dimensions, i.e., higherdimensional tensors can also be represented as contractions of auxiliary tensors. In analogy to the twodimensional case, the mode sizes of the halfedges correspond to those of the original tensor, and the mode sizes of the edges generalize the concept of matrix rank. This generalized rank allows for an errorcontrolled way of truncation in tree tensor networks. Several tensor formats based on tree tensor networks are known. Here, we focus on the tensortrain format [Oseledets2009, Ostlund1995, White1992, Loan2008] and the hierarchical Tucker format [Grasedyck2010, Hackbusch2009].
3.3 Tensortrain and hierarchical Tucker format
The tensortrain format was first introduced to the numerical analysis community in [Oseledets2009]. It is also known in other areas as matrix product states [Ostlund1995, White1992] or as linear tensor network [Loan2008]. A tensor of dimension in the tensortrain format is represented as contractions of socalled core tensors. Each core tensor represents one mode and is connected to its forward and backward neighbors. The core tensors themselves are of dimension three, or two in case of the first and the last mode. For a given tensortrain representation, its rank is defined as the tuple of all mode sizes corresponding to edges. creftype 4 shows an illustrative example.
Instead of the original tensor one needs to store the core tensors. Thus, the required storage cost for a tensortrain representation with dimension , mode sizes all equal to and representation rank componentwise bounded by is in .
In our case the dimension of the distribution tensors is equal to the number of automata, and the mode sizes are given by the number of states in each automaton.
Note that in general the representation rank can depend on the ordering of the modes/automata, cf. [Grasedyck2011, Oseledets2009].
The hierarchical Tucker format was introduced in [Hackbusch2009] and further analyzed in [Grasedyck2010].
As the name suggests, the hierarchical Tucker format is based on a hierarchical binary tree structure defining the arrangement of the auxiliary tensors.
Similar to the tensortrain format there are auxiliary tensors each of which represents one mode of the original tensor. These twodimensional tensors, the socalled frames, build the leaves of the binary tree. The twodimensional tensor at the root of the tree and the threedimensional tensors at all other levels are called transfer tensors. For a given hierarchical Tucker representation, its rank is defined as the tuple of all mode sizes corresponding to the edges of the tree, i.e., where denotes the given tree. creftype 5 shows an illustrative example.
Note that in the hierarchical Tucker format we have many options for the structure of the binary tree and that the rank depends on our choice, cf. [Grasedyck2011].
Instead of the original tensor the frames and transfer tensors have to be stored. Thus, the storage cost for a hierarchical Tucker representation of a dimensional tensor with maximum mode size and a representation rank componentwise bounded by is in .
In our case the dimension of the distribution tensors is again equal to the number of automata, and the mode sizes are the number of states for each automaton.
Thus, in both the tensortrain and the hierarchical Tucker format, the storage cost grows only linearly in the dimension . This allows for efficient storage and prevents the statespace explosion provided we have low ranks.
It is easily possible to convert a CP representation of rank to a tensortrain or a hierarchical Tucker representation, where all rank components are bounded by , cf. [Hackbusch2012].
Similar to the CP format both formats provide arithmetic operations like addition, application of operators, or scalar products.
Since the storage cost increases with the square or cube of the rank and arithmetic operations also increase the rank, we need a way to truncate a tensor to lower rank in an errorcontrolled way.
In tree tensor networks we use a generalization of the singular value decomposition to truncate in a quasioptimal way, see [Oseledets2009] for the tensortrain and [Grasedyck2010] for the hierarchical Tucker format.
There it is shown that the error in the Frobenius norm caused by the truncation of a tensor to rank for a given tree is bounded by
(10) 
where denotes the truncation operator, is the th singular value of the matricization , is a best rank approximation, and is a small constant. According to creftype 10, a steep decay in the singular values of the corresponding matricization leads to a small relative truncation error, i.e., the tensor can be well approximated with one of low rank. Inspired by [Hackbusch2012], creftype 1 summarizes some arithmetic operations and their cost in the tensortrain and the hierarchical Tucker format, see [Grasedyck2010, Hackbusch2012, Oseledets2009]. creftypecap 1 shows that using those lowrank tensor formats reduces the exponential computational cost in the dimension to linear cost.
Operation  Cost in  Cost in 

Storage  
Addition  
Evaluation  
Scalar product  
Operator application  
Truncation 
Operations and their cost in tensortrain and hierarchical Tucker format for dimensional tensors with maximum mode size and rank componentwise bounded by .
Since the hierarchical Tucker format is based on a binary tree, it is possible to do many of the arithmetic operations and especially the costly truncation in parallel level by level of the tree as described in [Grasedyck2017]. Thus, for a balanced tree and dimension the runtime can often be reduced from to , cf. [Grasedyck2017].
3.4 Lowrank method
We now make use of these lowrank tensor formats to approximate the marginal distribution as the solution of creftype 8. Since the exact solution is a probability distribution, its entries sum up to one, i.e., it fulfills
(11) 
where is the tensor of all ones. To allow for a probabilistic interpretation of an approximation of , we need to treat creftype 11 as an additional constraint. To this end, we now present an iterative method based on the Neumann series [Dubois1979] and the uniformization method [Grassmann1977]. Let be a bound on the diagonal entries of the infinitesimal generator, then
Owing to the ordering assumption 3 and the fact that all diagonal entries of are nonpositive, the spectral radius is smaller than one, i.e., , and so the Neumann series converges. Hence, we can represent the marginal distribution as
(12) 
Alternatively we can derive creftype 12 using the uniformization method. Here, the idea is to describe a continuoustime Markov chain by a discretetime Markov chain with a time increment that is exponentially distributed. With this interpretation, we write the timedependent probability distribution in creftype 1 as
for a time . Note that is the transition probability matrix of a discretetime Markov chain, since is a bound on the diagonal entries of . Marginalization of time similar to creftype 2 and substitution leads to creftype 12:
where denotes the gamma function. A natural approximation based on creftype 12 would be
for . Based on the properties of the Neumann series this sequence converges linearly to , but does not satisfy the normalization condition creftype 11 for any . As is a transition probability matrix, its application to a probability distribution leads to a probability distribution again satisfying creftype 11, and thus
for all . By scaling each element of the sequence, we obtain
(13) 
for , which is now an approximation to that satisfies creftype 11. We prove its linear convergence to in the following theorem.
Theorem 1
Let be some given parameters, be the solution of the corresponding linear system creftype 8, and be defined by creftype 13 for all . Then converges linearly to as approaches infinity, i.e.,
where .
Proof
For , and any we have
Furthermore we obtain
Since , the linear convergence follows.
In order to employ lowrank tensor formats, we extend the iterative method corresponding to creftype 13 by truncation. As shown in [Hackbusch2008], a convergent iterative method combined with truncation still converges if the truncation error is sufficiently small. However, the truncation could violate condition creftype 11. Therefore, we have to extend our method to ensure that
for all where denotes the truncation operator. We implement these conditions by rescaling after truncation as shown in creftype 1, see lines and . The procedure should be continued until the norm of the relative residual is smaller than a given tolerance .
In creftype 1 we have to choose a bound on the diagonal entries of . The cost for computing all diagonal entries to find the maximum is in . In case of parameters that vary little,
(14) 
can be used as an inexpensive bound. For the computation of creftype 14 we need comparisons and evaluations^{2}^{2}2Strictly speaking, the computational cost of the method is therefore quadratic, rather than linear, in . However, only needs to be precomputed once, the computational cost of which is negligible. of the parameter where is the maximum number of possible transitions in an automaton. In the case of strongly varying parameters , creftype 14 may be a gross overestimation for the diagonal entries of . The question of how to determine a tighter bound with polynomial effort using lowrank tensor formats will be dealt with in future work.
4 Numerical experiments
We illustrate our method for the computation of timemarginal distributions in numerical experiments using the following setting based on the model of Mutual Hazard Networks.
4.1 Setting
We consider automata and sparse parameters with a particular blockdiagonal form. Each block of size characterizes a subset of automata which directly affect one another. There are no direct effects between automata from different blocks. In each block, we would like the effects between neighboring automata to be stronger than those between automata that are farther apart. Therefore, we draw the entries
of the blocks from a normal distribution with mean
(and restrict them to ). Two possible examples for automata and block size are given below:Parameters of the same color follow the same distribution. For the application of creftype 1 we always choose the bound as in creftype 14, which for Mutual Hazard Networks can be reduced to
(15) 
We use a canonical balanced tree tensor network for the application of the hierarchical Tucker format, i.e., the automata are assigned to the leaves following their ordering, see creftype 6. We perform all experiments for randomly generated sample parameters for each combination of block size and number of automata. Unless stated otherwise, we compute lowrank approximations of the marginal distribution using creftype 1 with a maximum relative truncation error of . The algorithm stops when the norm of the relative residual is smaller than a tolerance value of . The mean values we compute are arithmetic means.
4.2 Study of singular values
One fundamental assumption for solving linear systems using lowrank tensor methods is that a solution can be well approximated by a tensor of low rank. According to the error bound in creftype 10, the truncation error is determined by the singular values of the matricization corresponding to the chosen tree network. A fast decay of those singular values indicates that a tensor can be approximated accurately by one of low rank. To analyze this issue, we solve creftype 8 using classical matrix methods of MATLAB [Matlab2019] and compute the singular values of the corresponding matricization using the htucker toolbox [Kressner2012]. creftypecap 6 shows the decay of the singular values of each matricization corresponding to the canonical balanced tree for automata and blocks of size . For each vertex the semilogarithmic plot displays the means of the singular values.
We observe that the singular values, especially those close to the root, exhibit an exponential decay. The two matricizations closest to the root are transposes of each other, and therefore their singular values are identical. The smallest of the singular values are indistinguishable from zero and therefore cannot be displayed in the semilogarithmic plot. The exponential decay of the singular values indicates that the marginal distribution is well approximated with low rank.
4.3 Study of the tree structure
In general, for tree tensor networks the rank depends on the structure of the tree. As already observed for large Markov chains using the tensortrain format, a change in the ordering of the automata within the tensortrain tree network can affect the rank, cf. [Masetti2019]. In the following, we focus on the hierarchical Tucker format and study how the ordering of the automata in the leaves of the tree affects the lowrank approximability. We preserve the balanced binary structure of the tree because this is advantageous for parallelization, cf. [Grasedyck2017]. We already studied the singular values of the matricizations corresponding to the canonical balanced tree, see creftype 6. We change only the arrangement of the automata in the leaves of the tree and compute the marginal distribution again. The decay of the singular values for each matricization is shown in creftype 7 where the semilogarithmic plot at each vertex displays the means of the singular values.
Comparing creftypeplural 76, we observe that the choice of the canonical binary tree, i.e., the original ordering in the leaves, results in a significantly faster decay of the singular values close to the root. creftypecap 7 shows that the singular values closest to the root also have an exponential decrease, but at a much slower rate. Note that blocks of size indicate that the automata interact directly with one another and are thus highly correlated. The same holds for the automata . There are no direct interactions between automata of different blocks, i.e., these are only weakly correlated. Hence, the modified balanced tree separates highly correlated automata, which explains the slower decline of the singular values in level (level being the root). In contrast, the canonical balanced tree separates weakly correlated automata, leading to a faster decline of the singular values. We will make similar observations when studying the approximation rank.
4.4 Study of the approximation rank
As the rank in a tree tensor network is a tuple depending on the underlying tree , we consider the maximum rank and the effective rank . The effective rank of a tensor representation is defined such that the storage cost for this representation equals the cost to store one with rank . Since intuitively a rank should be an integer, is rounded up to the nearest integer. We compute lowrank approximations of the marginal distribution using creftype 1 as described in creftype 4.1
. In the plots we only show the mean values of the ranks, since their variance among the
realizations of was very small. creftypepluralcap 88 show the maximum and effective approximation rank of as a function of the number of automata for blocks of size .We observe that is approximated to a tolerance of with low rank in all cases.
In particular, for all block sizes considered, the effective rank increases less than linearly in the number of automata and is bounded by .
As expected, the maximum rank reacts much more sensitively to changes in and .
For constant block sizes and no smooth increase in the number of automata can be detected.
Instead, maximum and effective rank oscillate in the number of automata.
Particularly low ranks typically occur when the strongly correlated automata are not separated by the tree structure.
For example, in case of automata with a block size of the strongly correlated automata are split only in the lower levels of the tree, and we observe a very low rank.
For automata with a block size of , however, the strongly correlated automata , for example, are separated already in the first level of the tree, and we observe a much higher maximum rank.
A closer look at the rank components reveals an increase especially at the points where the blocks are split up.
In the case of the block size increases linearly in the number of automata.
In creftype 8 we see a strong increase of the maximum rank for automata.
However, the effective rank grows smoothly and logarithmically in the number of automata, see creftype 8.
This indicates that the maximum rank occurs only sporadically.
A closer look shows that the maximum ranks again occur only in isolated cases at the vertices where highly correlated automata are separated.
This observation and the fact that the effective rank is small implies that most rank components are small as well.
These results suggest that neither the number of automata nor the size of the blocks alone are responsible for an increase in rank, but that especially the distribution of the automata in the tree has a large effect.
How to construct an appropriate tree in order to keep ranks low using apriori information on the parameters is a topic of ongoing research.
We also analyze how the approximation rank is influenced by the maximum relative truncation error allowed within the algorithm.
To this end, we compute a lowrank approximation of the marginal distribution for automata and blocks of size using creftype 1. The iteration stops when the norm of the relative residual for creftype 8 is smaller than .
In order to achieve this for all sample parameters, has to be chosen smaller than . If we allow for a higher truncation error, the iteration stagnates in some cases.
We again restrict ourselves to plotting mean values since the variances are small.
creftypepluralcap 99 show semilogarithmic plots of the mean of the maximum and effective rank of the approximation of with automata and block size as a function of the maximum relative truncation error for several values of the tolerance .
For maximum and effective rank are nearly constant with and . For , we observe that there are only small differences in maximum and effective rank, respectively. In these cases, the maximum rank increases logarithmically in the maximum relative truncation error , see creftype 9, and the effective rank is nearly constant for all truncation errors, see creftype 9. For very small all curves converge. This observation suggests, on the one hand, that can be accurately approximated with a tensor of maximum rank and effective rank , since a more accurate truncation, i.e., smaller , has only very small impact on the approximation ranks. On the other hand, the conclusion that the ranks are low and nearly independent of the tolerance value indicates that the ranks during the iteration are also low. This allows not only for efficient storage of the resulting approximation but also for efficient computation using creftype 1.
4.5 Study of the method
We now study the convergence of creftype 1. creftypecap 10 and creftype 10 show semilogarithmic plots of the norm of the relative residual as a function of the iteration step for block size and automata. creftypecap 10 displays the mean value of the relative residual and creftype 10 additionally the corresponding box plot illustrating the variances for automata.
We observe a linear convergence of the method for all values of . For larger number of automata the convergence slows down. This can be explained by the fact that the upper bound increases in , i.e., the convergence rates are closer to one. Especially for large and strongly varying parameters a tighter estimation for the maximum diagonal entries using lowrank formats might be advantageous to speed up the convergence. In creftype 10 the ranges for
given by the boxes are small, which indicates that there are only a few outliers given by the whiskers. We have confirmed our results using different numbers
of automata and block sizes (not shown). The size of the blocks as well as the ranks during the iteration have smaller impact on the convergence. In our tests we observe that the number of iteration steps needed to achieve a certain tolerance grows linearly in the number of automata but is independent of the block size . This indicates that the number of iteration steps is independent of changes in the ordering of automata and consequently of changes in the rank.5 Conclusion and future work
Inspired by current research in tumor progression models, we considered a class of continuoustime Markov chains that describe interacting processes and suffer from the problem of statespace explosion. The corresponding timemarginal distribution is uniquely defined as the solution of a certain linear system. By representing the Markov chain via a Stochastic Automata Network with separable functional transitions, we obtained a lowrank tensor representation of the operator and the righthand side of this linear system. This enabled us to derive an iterative method to compute a lowrank tensor approximation of the timemarginal distribution and hence to overcome the statespace explosion. The method guarantees that the entries of the approximation sum up to one as required for a probability distribution. We proved the convergence of the method. In numerical experiments focused on the concept of Mutual Hazard Networks we illustrated that the timemarginal distribution is well approximated with low rank. The method allows for consistently low ranks during the iteration, and linear convergence was observed independently of the number of processes/automata. A probability distribution, in addition to being normalized to one, must be nonnegative. An approximation of a probability distribution should also satisfy these conditions. How to guarantee nonnegativity and at the same time convergence will be part of our future research. Moreover, we observed that the approximation rank for the timemarginal distribution depends strongly on the structure of the tree tensor network and on the effects between automata. To minimize the approximation rank we plan to develop a strategy to determine an optimal tree tensor network structure apriori.
Acknowledgment
We thank Tim A. Werthmann for his critical reading of and suggestions for this article.