## 1 Introduction

Stochastic reaction networks that are used to describe cellular processes are often subject to inherent stochasticity. The dynamics of gene expression, for instance, is influenced by single random events (e.g. transcription factor binding) and hence, models that take this randomness into account must monitor discrete molecular counts and reaction events that change these counts. Discrete-state continuous-time Markov chains have successfully been used to describe networks of chemical reactions over time that correspond to the basic events of such processes. The time-evolution of the corresponding probability distribution is given by the chemical master equation, which is a system of differential equations with one equation for each possible molecular count. However, its numerical solution is extremely challenging because of the enormous size of the underlying state-space.

In contrast, analysis approaches based on sampling, such as the Stochastic Simulation Algorithm (SSA) [gillespie77], can be applied independent of the size of the model’s state-space. However, statistical approaches are costly since a large number of simulation runs is necessary to reduce the statistical inaccuracy of estimators. This problem is particularly severe if reactions occur on multiple time scales or if the event of interest is rare. A particularly popular technique to speed up simulations is -leaping which applies multiple reactions in one step of the simulation. However, such multi-step simulations rely on certain assumptions about the number of reactions in a certain time interval. These assumptions are typically only approximately fulfilled and therefore introduce approximation errors on top of the statistical uncertainty of the considered point estimators.

Variance reduction techniques are an alternative to approaches that decrease the computational costs of each SSA run. By reducing the variance of the estimators, these methods need fewer runs to achieve high statistical accuracy.

In this work, we approach the variance reduction problem by considering a (infinite) set of differential equations for the evolution of the statistical moments of the molecular counts. Instead of applying a moment closure and performing numerical integration [ale2013general, engblom2006computing], we use these equations to derive a combination of moment constraints. Such moment constraints have already been used for for parameter estimation [backenkohler2018moment] and for computing moment bounds using semi-definite programming [dowdy2018dynamic, ghusinga2017exact]

. Here, we interpret these constraints as random variables that are correlated with the estimators of interest usually given as functions of population variables. These constraints can be used as (linear) control variates in order to improve the final estimate and reduce its variance

[lavenberg1982statistical, szechtman2003control]. The method is easy on an intuitive level: If a control variate is positively correlated with the function to be estimated then we can use the estimate of the variate to adjust the target estimate.The incorporation of control variates into the SSA introduces additional simulation costs for the calculation of the constraint values. These values are integrals over time, which we accumulate based on the piece-wise constant trajectories. This introduces a trade-off between the variance reduction that is achieved by using control variates versus the increased simulation cost. This trade-off is expressed as the product of the variance reduction ratio and the cost increase ratio.

For a good trade-off, it is crucial to find an appropriate set of control variates. Here we propose a class of constraints which is parameterized by a moment vector and a weighting parameter, resulting in infinitely many choices.

In previous work [backenkohler2019control], we have proposed an algorithm that learns a set of control variates through refinement of an initial set. This initial set of control variates is based on samples of the time-weighting . Each control variate is then checked for effectiveness in isolation. Furthermore, the set is refined by considering variables pairwise to determine redundancies.

In this work, we improve on the initial selection of control variates. This initial set is build using a splitting approach akin to sequential Monte Carlo methods: Over multiple rounds, new control variates are sampled based on their performance in prior iterations. This way, we construct a set of candidate variates and select a subset using a greedy approach, which takes into account the correlation between variates. A benefit of this algorithm is that it is less sensitive to user input. In particular, no heuristic redundancy threshold has to be fixed, making this approach more flexible.

This approach applies to the Monte Carlo estimation of any quantity deal with any property that can be expressed in terms of expected values such as probabilities of complex path properties. Another advantage of our technique is that increased efficiency is achieved without the price of an additional approximation error as is the case for methods based on moment approximations or multi-step simulations.

This paper is structured as follows. In Section 2 we give a brief survey of methods and tools related to efficient stochastic simulation and moment techniques. In Section 3 we introduce the common stochastic semantics of stochastic reaction networks. From these semantics we show in Section 4 how to derive constraints on the moments of the transient distribution. The variance reduction technique of control variates is described in Section 5. We show the design of an algorithm using moment constraints to reduce sample variance in Section 6. The efficiency and other characteristics of this algorithm are evaluated on four non-trivial case studies in Section 7. Finally, we discuss the findings and give possibilities for further work in Section 8.

## 2 Related Work

Much research has been directed at the efficient analysis of stochastic stochastic reaction networks. Usually research focuses on improving efficiency by making certain approximations.

If the state-space is finite and small enough one can deal with the underlying Markov chain directly. But there are also cases where the transient distribution has an infinitely large support and one can still deal with explicit state probabilities. To this end, one can fix a finite state-space, that should contain most of the probability [munsky2006finite]. Refinements of the method work dynamically and adjust the state-space according to the transient distributions [andreychenko2011parameter, henzinger2009sliding, mateescu2010fast].

On the other end of the spectrum there are mean-field approximations, which model the mean densities faithfully in the system size limit [bortolussi2013continuous]. In between there are techniques such as moment closure [singh2006lognormal], that not only consider the mean, but also the variance and other higher order moments. These methods depend on ad-hoc approximations of higher order moments to close the ODE system given by the moment equations. Yet another class of methods approximate molecular counts continuously and approximate the dynamics in such a continuous space, e.g. the system size expansion [van1992stochastic] and the chemical Langevin equation [gillespie2000chemical].

While the moment closure method uses ad-hoc approximations for high order moments to facilitate numerical integration, they can be avoided in some contexts. For the equilibrium distribution, for example, the time-derivative of all moments is equal to zero. This directly yields constraints that have been used for parameter estimation at steady-state [backenkohler2018moment] and bounding moments of the equilibrium distribution using semi-definite programming [ghusinga2017estimating, ghusinga2017exact, kuntz2017rigorous]. The latter technique of bounding moments has been successfully adapted in the context of transient analysis [dowdy2018dynamic, sakurai2017convex, sakurai2019bounding]. We adapt the constraints proposed in these works to improve statistical estimations via stochastic simulation (cf. section 4).

While the above techniques give a deterministic output, stochastic simulation generates single executions of the stochastic process [gillespie77]. This necessitates accumulating large numbers of simulation runs to estimate quantities. This adds a significant computational burden. Consequently, some effort has been directed at lowering this cost. A prominent technique is -leaping [gillespie2001approximate], which in one step performs multiple instead of only a single reaction. Another approach is to find approximations that are specific to the problem at hand, such as approximations based on time-scale separations [cao2005slow, bortolussi2015efficient].

Multilevel Monte Carlo methods have been applied in to time-inhomogeneous SRNs [anderson2018low]. In this techniques estimates are combined using estimates of different approximation levels.

In the case of rare events, approaches based on importance sampling [gillespie2009refining, roh2010state] and importance splitting [jegourel2013importance] have been adapted to the setting of reaction networks. Importance sampling relies on a suitable change of the underlying probability measure, which is often handcrafted for each model, continually refined using the cross-entropy method [daigle2011automated, roh2011state], or derived from Gaussian approximations of the process [gillespie2019guided]. Importance splitting decomposes the state space into level sets and estimates the rare event probability by a level-based splitting of sample paths before reaching the set of interest. It requires to construct a model-specific level function together with the corresponding splitting thresholds.

Recently, randomized quasi-Monte Carlo (RQMC) approaches have been adapted to the application area of stochastic reaction networks [beentjes2019quasi] and improved for the case of long simulation horizons with an extension to array-RQMC [puchhammer2021variance]. It is based on a tau-leaping approach where time is discretized and requires a level/importance function or a costly multivariate sort.

## 3 Stochastic Reaction Networks

A stochastic reaction network (SRN) describes the interactions between a set of species

in a well-stirred reactor. Since we assume that all reactant molecules are spatially uniformly distributed, we just keep track of the overall amount of each molecule. Therefore the state-space is given by

. These interactions are expressed a set of*reactions*with a certain inputs and outputs, given by the vectors and for reaction , respectively. Such reactions are denoted as

(1) |

The reaction rate constant gives us information on the propensity
of the reaction.
If just a constant is given, *mass-action* propensities are assumed.
In a stochastic setting for some state these are

(2) |

The system’s behavior is described by a stochastic process . The propensity function gives the infinitesimal probability of a reaction occurring, given a state . That is, for a small time step

(3) |

This induces a corresponding continuous-time
Markov chain (CTMC) on with generator matrix^{1}^{1}1Assuming a fixed enumeration of the state space.

(4) |

Accordingly, the time-evolution of the process’ distribution,
given an initial distribution , is given by the
Kolmogorov forward equation, i.e. , where .
For a single state, it is commonly referred to
as the *chemical master equation* (CME)

(5) |

A direct solution of (5) is usually not possible. If the state-space with non-negligible probability is suitably small, a state space truncation could be performed. That is, (5) is integrated on a possibly time-dependent subset [henzinger2009sliding, munsky2006finite, spieler2014numerical]. Instead of directly analyzing (5), one often resorts to simulating trajectories. A trajectory over the interval is a sequence of states and corresponding jump times , and . We can sample trajectories of by using stochastic simulation [gillespie77].

Consider the birth-death model below as an example.

###### Model 1 (Birth-death process)

A single species has a constant production and a decay that is linear in the current amount of molecules. Therefore the model consists of two mass-action reactions

where denotes no reactant or no product, respectively.

For Model 1 the change of probability mass in a single state is described by expanding (5) and

We can generate trajectories of this model by choosing either reaction, with a probability that is proportional to its rate given the current state . The jump time

is determined by sampling from an exponential distribution with rate

.## 4 Moment Constraints

The time-evolution of for some function can be directly derived from (5) by computing the sum , which yields

(6) |

While many choices of are possible, for this work we will restrict ourselves to
monomial functions ,
i.e. the *non-central moments* of the process.
The *order* of a moment is the sum over the exponents,
i.e. .
The integration of (6) with such functions is well-known in the context of
moment approximations of SRN models.
For most models the arising ODE system is infinitely large, because the time-derivative of
low order moments usually depends on the values of higher order moments.
To close this system, *moment closures*, i.e. ad-hoc approximations of higher order moments
are applied [schnoerr2015comparison].
The main drawback of this kind of analysis is that it is not known whether the chosen closure gives
an accurate approximation for the case at hand.
Here, such approximations are not necessary, since we will apply the moment dynamics in the context
of stochastic sampling instead of trying to integrate (6).

Apart from integration strategies, setting (6) to zero has been used as a constraint for parameter estimation at steady-state [backenkohler2018moment] and bounding moments at steady-state [dowdy2018bounds, ghusinga2017exact, kuntz2017rigorous]. The extension of the latter has recently lead to the adaption of these constraints to a transient setting [dowdy2018dynamic, sakurai2019bounding]. These two transient constraint variants are analogously derived by multiplying (6) by a time-dependent, differentiable weighting function and integrating:

Multiplying with and integrating on yields [dowdy2018dynamic, sakurai2019bounding]

(7) |

In the context of computing moment bounds via semi-definite programming the choices [sakurai2019bounding] and [dowdy2018dynamic] have been proposed. While both choices proved to be effective in different case studies, relying solely on the latter choice, i.e. was sufficient. We can further forgo the time inversion such that .

By expanding the rate functions and in (7) and substituting the exponential weight function we can re-write (7) as

(8) |

with coefficients and vectors defined accordingly. Assuming the moments remain finite on , we can define the random variable

(9) |

with .

Note, that a realization of depends on the whole trajectory over . Thus, for the integral terms in (9) we have to compute sums

(10) |

over a given trajectory. This accumulation is best done during the simulation to avoid storing the whole trajectory. Algorithm 1 specifies the stochastic simulation of system trajectories while computing the values of integrals (10) alongside the trajectory itself.

The cost of a simulation using this algorithm is more expensive. For the method to be efficient, the variance reduction (Section 5) needs to overcompensate for this increased cost of a simulation run.

## 5 Control Variates

Now, we are interested in the estimation of some quantity by stochastic simulation. Let be independent samples of . Then the sample mean is an estimate of

. By the central limit theorem

Now suppose, we know of a random variable with .
The variable is called a *control variate*.
If a control variate is correlated with , we can
use it to
reduce the variance of [glasserman2005large, nelson1990control, szechtman2003control, wilson1984variance].
For example, consider we are running a set of simulations and consider a single
constraint.
If the estimated value of this constraint is larger than zero and we estimate a positive correlation
between the constraint and , we would, intuitively, like to decrease our
estimate accordingly.
This results in an estimation of the mean of the random variable

instead of . The variance

The optimal choice can be computed by considering the minimum of . Then

Therefore , where is the correlation of and .

If we have multiple control variates, we can proceed in a similar fashion. Now, let denote a vector of control variates and let

be the covariance matrix of . As above, we estimate the mean of The ideal choice of

is the result of an ordinary least squares regression between

and , . Specifically, . Then, asymptotically the variance of this estimator is [szechtman2003control],(12) |

This is commonly known as the fraction of variance unexplained [freedman2009statistical]. In practice, however, is unknown and needs to be replaced by an estimate . This leads to an increase in the estimator’s variance. Under the assumption of and

having a multivariate normal distribution

[cheng1978analysis, lavenberg1982statistical], the variance of the estimator is(13) |

Clearly, a control variate is “good” if it is highly correlated with . The constraint in (11) is an example of the extreme case. When we use this constraint as a control variate for the estimation of the mean at some time point , it has a correlation of since it describes the mean at that time precisely. Therefore the variance is reduced to zero. We thus aim to pick control variates that are highly correlated with .

Consider, for example, the above case of the birth-death process. If we choose (11) as a constraint, it would always yield the exact difference of the exact mean to the sample mean and therefore have a perfect correlation. Clearly, reduces to 1 and .

## 6 Finding Efficient Control Variates

In this work, we propose a novel algorithm to synthesize an efficient set of control variates. As we have seen in the previous section, effective control variates have a high correlation with the target random variable. In the case of a single variate, the variance reduction is directly proportional to , where is the correlation. In our case, infinitely many choices of are available. Our goal is to choose a subset that satisfies two objectives: Firstly, every selected control variate should reduce the estimator’s variance. Secondly, the subset should not be too large, i.e. we want to avoid redundancies to achieve a good overall computational efficiency of the variance reduction.

Accordingly, we define an *efficiency* value to estimate
whether the reduction in variance is compensating for the associated cost increase.
A natural baseline of any variance reduction is that it outweighs its associated additional costs.
Let be the variance of .
The *efficiency* of the method is the ratio of the necessary
cost to achieve a similar reduction with the CV estimate compared to
the standard estimate [l1994efficiency, mcbook], i.e.

(14) |

The cost ratio depends on both the specific implementation and the technical setup. The cost increase is mainly due to the computation of the integrals in (10). The accumulation over the trajectory directly increases the cost of a single simulation, which is the critical part of the estimation.

If the computation of a variate does not adequately compensate for its computation with variance reduction, we do not want to include it. Balancing both objectives is challenging because control variates often correlate with each other. Such correlations expose redundancy between different variates. This also becomes clear, when considering that the overall variance reduction depends on the coefficient of multiple correlation.

Here we follow a resampling paradigm:
We start by building up a set of candidates using a particle splitting approach.
After each splitting step, we generate a small number of SSA samples to estimate correlations.
Promising candidates are chosen based on the *improvement* they provide and their time-weighting parameter is resampled (see Fig. 2).
The main benefit of this bottom-up approach is its lower dependence on the initial sampling distribution of .
Moreover, the procedure spends less time evaluating unpromising candidates.
After generating a set of control variates, the overall covariance matrix is estimated using stochastic simulations.
Using this information, we construct an efficient subset using a greedy scheme, taking into account the redundancies between control variates.
We discuss Algorithm 2 in more detail below.

#### Initialization

A tuple of a moment vector and a time-weighting parameter uniquely identifies a control variate . The algorithm starts out with an initial small set of control variates. That is, we use and in (7). For a given set of time-weighting parameters , we use all moments up to some fixed order (line 2). For a fixed moment vector the time-weighting parameter can lead to vastly different correlations with the quantity of interest. The best choices of are usually not known beforehand. Therefore, we sample an initial set of ’s from a fixed distribution (line 2). Here, we use a standard normal distribution because its mean is the neutral weighting of and extreme values are unlikely.

#### Resampling

Promising candidates are chosen from all control variates based on the estimated *improvement ratio* they
provide, i.e.

(15) |

following (13). Specifically, control variate is chosen with probability proportional to (line 2). The covariances of (only) the new variates are roughly estimated using very few (e.g., ) SSA samples. For the selected variates , the time-weighting parameter is resampled using a step distribution. There is some freedom in the specifics of this resampling procedure. In particular, the number of splits and descendants for each candidate control the number of additional candidates. The algorithm performs rounds of resampling. Figure 2 illustrates this part of the algorithm.

#### Covariance Estimation

After sampling a set of candidates this way, we need to select the most promising ones. For this, we are interested in covariances between all control variates, as well. Since the resampling does not provide us with such estimates, we evaluate all candidates together for a fixed number of simulations (line 2).

#### Selection

The selection part of the algorithm (line 2) proceeds in a greedy fashion wrt. the potential estimated improvement given by any variate. However, covariates often have high mutual correlations. For example, and for a small are typically highly correlated — often more with each other than with the objective. We want to avoid this unnecessary computational overhead from computing nearly redundant information and numerical problems due to the covariance matrix inversion (see (12)). As a solution, we normalize the estimated improvement vector by the product of the fractions of explained variances by the already selected covariates. Therefore we choose the most promising candidate given a selection as

(16) |

in line 2. This selection is done, until some lower threshold is reached (line 2).

#### Estimation

## 7 Case Studies

The simulation is implemented in the Rust programming language^{2}^{2}2https://www.rust-lang.org.
The model description is parsed from a high level specification.
Rate functions are compiled to stack programs for fast evaluation.
To estimate the base-line cost , 1000 estimations were performed without
considering any control variates.

We first consider two case studies that have already been used in previous work [backenkohler2019control]. The first model is a simple dimerization process, albeit with an countably infinite state-space. The second model is a switch model with a more complex bi-modal behavior. We now describe the models and the estimated quantities in detail.

###### Model 2 (Dimerization)

We first examine a simple dimerization model on an unbounded state-space

with initial condition .

Despite the models simplicity, the moment equations are not closed for this system due to the second reaction which is non-linear. Therefore a direct analysis of the expected value would require a closure. For this model we will estimate .

The following models is bimodal, i.e. they each posses two stable regimes among which they can switch stochastically. We choose the initial conditions such that the process will move towards either attracting region with equal probability.

###### Model 3 (Distributive Modification)

This model was introduced in [cardelli2012cell]. It consists of the reactions

with initial conditions .

The expected value of interest here is .

We applied the presented algorithm for an estimation using simulations. Initially samples for the time-weighting parameter were drawn from a standard normal distribution (). Constraints corresponding to each first-order moment, i.e. the process’ expectations were generated (). The covariance estimation during resampling used samples.

We evaluated the algorithm both with and without resampling for these first two case studies. The algorithm without resampling leaves out lines 2–2 from Alg. 2. The evaluation without resampling provides a good point of comparison to our previous heuristics performance on these cases. In the case of dimerization we observe a variance reduction of compared to a best case reduction of in our previous work. This close performance however has to balance very different slowdown factors: With our new heuristic the slowdown is a factor of while in the previous case it was . Therefore the new method clearly outperforms in terms of efficiency ( (new) versus (old)). This is mainly due to the higher number of covariates used by the simple threshold heuristic. In contrast the new method takes into account redundancies between covariances while still retaining good performance. This becomes apparent when comparing the average number of used variates ( (old) versus (new)).

The variance reduction factor for the distributive modification model is similar at (old) versus (new). Noticeably, the new method uses on average fewer CVs () than the previous heuristic with the best efficiency (). The overall efficiency of the new algorithm with is slightly lower than the previous best value of , due to a higher slowdown. It is however important to note, that the trade-off differs significantly between different heuristics used in the previous algorithm. Furthermore, the lower average number of control variates would reduce the slowdown factor further, if more trajectories are generated.

In Figure 3 we contrast the variance improvement ratio with and without the resampling algorithm. For the dimerization model, we see a clear improvement of variance reduction. This improvement is due to the fact that the strongest correlations are present for (cf. Figure 2). This region of the time-weighting parameter space is less likely to be sampled by the initial samples from the standard normal distribution. Therefore the resampling procedure is especially beneficial if the better parameters are farther from the origin. In case of the distributive modification case study, we see a slight improvement. Here, the best parameters are close to zero and thereby more likely to be sampled by a standard normal. Still, the resampling improves covariate performance for the most frequent cases of 2–4 covariates being selected (the case of 5 covariates has only a few instances). Note, that the additional cost incurred by the resampling procedure is comparatively small, because at most 4 candidates are evaluated in each iteration.

Next, we turn to the estimation of probabilities. In particular, we consider the event of a species being below a threshold at time (species for the dimerization and for the distributive modification). In Figure 4 we summarize the results of this study for varying levels . In both case studies we observe that control variates are efficient for probabilities not close to either zero or one. In this case control variates are able to reduce the variance of the estimated probabilities whilst maintaining a beneficial reduction-slowdown trade-off. This region is larger for the distributive modification model because of its bimodal behavior. If the probability to be estimated is close to either one or zero, the event occurs too rarely or too often, respectively, to adequately explain variance using linear correlations. We note, that the worst case efficiency is close to one. This is due to the algorithm throwing out all covariate candidates leaving us with a standard estimation. Only the initial covariate evaluation and resampling causes a slowdown, driving efficiency slightly below one. Naturally this cost decreases with more samples .

Control variates based on test functions restricted to intervals did not lead to an improvement (data not shown).

Finally, with the lac operon model we consider a larger case study. This model consists of 11 species and 25 partly non-linear reactions.

###### Model 4 (Lac operon)

This is a well-known model of genetic regulation with positive feedback [stamatakis2009comparison]. Its reactions are

Initially, and while all other abundancies are zero. The parameters are , , , , , , , , , , , , , , , , .

We estimate the abundancy of LacY after one time unit, i.e. . It is encoded by Y and facilitates the lactose import via reactions 14 and 16. A typical simulation of the system up to time-horizon takes well above one minute of computational time. Therefore we reduce the number of used trajectories to . The other settings remain as above.

Despite the high dimensionality, we observe a good efficiency value of . The slowdown caused by the method is approximately 1.98. A big part of this slowdown is due to the initial search of covariates. Initially 10 covariates are generated for each first order moment, i.e. each of the 11 species. The number of additionally resampled covariates is similar to previous case studies. Thus the main cost of the initial resampling and selection is due to the first iteration of the resampling loop and the simulation loop of the selection procedure. This part naturally has still potential for optimization: Not all known covariates need to be reconsidered at the selection stage. Instead, unpromising candidates could be discarded prior to that stage.

Still, the high variance reduction by a factor of approx. more than compensates for this increase in computational cost, leading to the good overall efficiency. This shows that, even for more complex models, the method is applicable and can extremely beneficial for Monte Carlo estimation.

## 8 Conclusion

In the context of Monte Carlo simulation, variance reduction techniques offer an elegant way of improving the performance without introducing approximation errors in addition to the statistical uncertainty.

For stochastic reaction networks, we show that it is possible to exploit constraints derived from the statistical moment equations for the construction of control variates. We propose a robust method to select an appropriate subset from the large set of all possible variates. In particular, we improve an initial subset by selecting particularly effective variates and removing redundant variates. By resampling the time-weighting parameter we ensure that appropriate values are flexibly explored. In the worst case, all variates are dropped and the performance approaches the standard SSA. In most cases, however, a suitable subset is found together with the corresponding choices of .

We analyze the performance of the method when estimating event probabilities and not only average molecule counts. Our largest case study has 11 species and 24 reactions.

In the future, we will further explore the algorithmic design space. For example, the resampling distribution could be adjusted using decaying standard deviations. Furthermore, we will look at different test functions weighting the state space more flexibly.

### 8.0.1 Acknowledgements

This work was supported by the DFG project MULTIMODE.