Computational modelling in the sciences is founded on the notion of abstraction, the process of identifying and representing mathematically the salient features and interactions of a real system. Abstraction is a human led and interdisciplinary activity: many factors influence the decision of which features/ interactions are eventually represented in the abstracted model, including the specialist interests of the scientists formulating the model, as well as computational constraints on the level of detail which can feasibly be implemented on the available hardware. Such factors inevitably vary between different research groups and at different times, leading to a proliferation of different models representing the same underlying phenomena.
Systems biology is a prominent field where models with different level of abstraction coexist. As an example, consider the process of gene expression, whereby genetic material stored in the DNA is transcribed into messenger RNA and eventually translated into protein. At the highest level of abstraction, which is frequently employed when studying high-throughput data, the process may be considered as a black box, and one may only be interested in characterising the statistical structures underlying observed levels of gene expression in different conditions . Alternatively, one may want to mechanistically represent the dynamics of some of the steps involved in the process. The choice of which steps to include is again largely driven by extrinsic factors: examples in the literature range from highly detailed models where the synthesis of mRNA/ protein is modelled through the binding and elongation of polymerase/ ribosomes, to models where protein production is modelled as a single reaction with first order kinetics [1, 11].
Representing the same physical process by multiple models at different levels of abstraction immediately engenders the question of how different model outputs can be reconciled. As the models all represent the same underlying physical system, it can be plausibly assumed that such models will agree at least qualitatively for suitable parametrisations. In general, however, models may not agree quantitatively, and their discrepancy may be a function of the parameters. Understanding and quantifying such discrepancies would often be very valuable: first of all, it can shed light on how simplifications within models affect predictions, and secondly it may open the opportunity to construct computationally efficient surrogates of complex models. Such surrogates can be precious when modelling requires intrinsically computationally intensive tasks, like inference.
In this paper, we approach the problem of reconciling models from a statistical machine learning angle. We start by sampling a sparse subset of the parameter space over which we evaluate the models’ outputs (generally by simulation). These evaluations are used as a training set to learn a correction map
via a non-parametric regression approach based on Gaussian Processes. We show that our approach yields a consistent stochastic equivalence between models, which provably reconciles the predictions of the two models up to the second moment. We demonstrate the approach on two biological examples, showing that it can lead to non-trivial insights into the structure of the models, and provide an efficient way to simulate a complex model via a simpler model.
The rest of the paper is organised as follows: we start by giving a high level description of the problem and how we attack it. This is followed by a formal definition and a thorough illustration of the proposed solution, discussing its desirable theoretical properties. We then demonstrate the approach on two proof of principle examples, showing the potential for practical application of the method. We conclude the paper by discussing the relationship of our approach to existing ideas in model reduction and statistical emulation, as well as some possible extensions of our results.
2 Problem definition
High level description of the approach.
In this paper we consider the problem of performing analyses which require exhaustive sampling of a model’s outputs, , the dynamics of which are expensive to compute. We are not interested in the origin of such a complexity, and we assume to be given an abstraction/surrogate of this model, , which is reasonably less challenging to analyze111 could be complex to analyze either because of its structure, e.g., it might have many variables, or its numerical hurdles, e.g., the degree of non-linearity or parameters stiffness. For similar reasons, we do not care whether is has been derived by means of independent domain-knowledge or automatic techniques.. For this reason, we want to investigate a general methodology to use as a reliable proxy to get statistics over . Possible applications of this framework, which we discuss in §4, regard model selection and synthesis, inference, and process equivalence/control.
In general, we will assume both models to be stochastic processes, e.g. continuous time Markov Chains (CTMCs). Furthermore, we assume that the highly detailed modeland the reduced model share some parameters and some observables/ state variables , but the large model will generally have many more state variables and parameters. In general we can compute some statistics of the shared state variables (e.g. mean), and that such computation will be considerably more expensive using the detailed model .
As both models are devised as abstractions of the same physical system, it is not unreasonable to assume that the expected values of the shared variables will be similar for some parametrisations of the models. However, it is in general not the case that the distribution of the shared variables implied by the two models will be the same, and, as we vary the shared parameters , even the expected values may show non-negligible discrepancies. Our aim is to develop a generally applicable machine learning approach to correct the output of the reduced model, in order to match the distribution of the larger model. This has strong practical motivations: one of the primary uses of models in general is to test hypotheses statistically against observed data, and it is therefore essential to capture as accurately as possible the implied variability on observable variables.
The strategy we will follow is simple: we start by sampling values of the shared parameters
, and compute the first two statistics of the observed variables as implied by both the large and reduced models (by simulation). In general, one may expect the variance implied by the larger model to be larger, as a more detailed model will imply more stochastic steps. We can therefore correct the first two statistics (mean and variance) of the reduced model output by adding a random function of the shared parameters, which can be learned by rephrasing it as a regression task. We will work with heteroschedastic Gaussian Processes .
2.1 Formal problem statement
We assume to be given two models of the same underlying physical system:
a highly detailed model , with state variables and parameters .
a reduced model , with state variables and parameters .
We will have and .
In general, the problem we want to tackle draws immediate connection to that of using as a statistical emulation of . However, to exploit solutions from regression analysis and machine learning, we make further assumptions and discuss their limitations thorough the paper.
(Observables) we assume that it exists a non-empty set of state variables (or, more generally, observables) , common to both models, that is sufficient to compute our statistics of interest.
(Parameters) we assume that model is fully parametrized by
, a vector of real-valued parameters that breaks down as, with shared between models and . Here, we assume that is fully parametrized222In principle, even might have a set of free variables, with respect to . However, as we have full control over that model, we could assume a parametrization of such variables and all what follows would be equivalent. by , which ranges in a domain set . We term free parameters in , given
. We further assume to have a probability densityfor the free parameters, and a probability density for the shared parameters, encoding our knowledge about them.
(Sampling) we assume that, for every parametrization, each model’s dynamics is computable, i.e. it can be simulated.
In this work, we will correction maps conditioned to a reference statistic of interest, in the following sense.
Definition 1 (Statistic)
A statistic is any observable that we can compute from one, or from an ensemble of simulations of a model. We denote its estimator computed from model with parameters as , and its true value .
Valid examples of such statistics are, e.g., a single value or the expectation of a variable in , the satisfiability of a temporal logical formula depending on variables that could be model-checked, etc. The richer the estimator, the higher number of samples are required for the estimator to converge to the true statistics. We will make use of estimators that are consistent in the following sense: as the number of samples goes to infinity.
Definition 2 (Correction map)
We define an -correction map for a statistics to be a function that for any , satisfies
where is the precision, and is the expectation of the statistics computed from , with respect to its free parameters . is our corrected prediction of .
Thus, can correct the outcomes of assessed over , , to match (with a given precision) those that we would have by computing the statistics over . Notice that the corrected outcome has no more dependence from the free parameters, as this is a correction in expectation with respect to .
Notice that the correction is a -dimensional vector, and hence is used as distance metric, and that the term allows for tolerance in the correction’s precision. It is easy to define the optimal, zero-error, correction map:
However, the correction function is impossible to compute exactly, as we cannot compute neither nor its marginalisation over . Hence, we will learn an approximation of trying to keep its error low so to satisfy Definition 2. We turn this problem into a regression task, as graphically explained in Figures 1 and 2, and exploit Gaussian Processes.
3 Learning the Correction Map
In this section we will present our machine learning strategy in more detail. We consider the case of a scalar statistics, as -dimensional ones can be treated by solving independent learning problems.
In order to evaluate (2), we need to be able to compute or approximate the value with respect to the free parameters of , for a any given . As this integral cannot be treated analytically or numerically, due to the large dimensionality involved (the cost is exponential in ), we will resort to statistical approximations. Before discussing them, let us comment on the distribution
, which is an input for our method. In particular, this distribution encodes our knowledge on the more plausible values of the free parameters. In case we have no information, we can choose an uniform distribution. On the other side of the spectrum, we may know the true valueof , and choose a delta Dirac distribution, which will dramatically simplify the evaluation of the integral. In this case, we can evaluate (2) as
Moreover, the approximation, is appropriate when the distribution is tightly concentrated around its mode .
In general, however, when does not have this special form, we can resort to downsampling , by generating samples , , and approximating . In the following, however, we will not necessarily aggregate the values , and treat them individually to better account for the variance in the observed predictions.
3.2 Gaussian Processes
We will solve the learning problem resorting to Gaussian Process (GP) regression 
. GPs are random functions, i.e. probability distributions over a function space, in our case functions, with the property that any finite dimensional projection , ,
is a multidimensional Gaussian random variable. It follows that GP are defined by a mean function, returning the mean at any point in , and by a covariance or kernel function , for giving the covariance between any pair of points in . GP can be used to solve regression tasks in a Bayesian setting. The idea is as follows: we put a GP prior on the space of functions , typically by assuming constant zero mean and fixing a kernel function333
In this work, we use the classic Gaussian kernel fixing hyperparameters by maximising the type-II likelihood; see., and then consider the posterior distribution given a set of observations , , at input points . If we assume that , with a zero mean Gaussian noise term with variance , then we obtain that the posterior distribution given the observed data is still a GP, with mean and kernel functions that can be obtained analytically, see  for further details. GP regression is essentially the same if the observation noise is different at different input points, i.e. , in which case we talk about heteroschedastic regression.
3.3 The regression task
Let for some index set be the input space, and our training points. In case we use eq. (3) to evaluate the correction map, each is a scalar value, and a standard regression schema based on Gaussian Processes can be used. In that case we assume samples of the correction map to be observations from a random variable centered at a value given by the latent function
In this standard Gaussian Processes regression the noise model in the observations is assumed to be a constant for all sampled points.
In the more general case we work with downsampling solutions that exploit samples for the free variable, . In that case, we have correction values per training point, , that we can use in a straightforward way to reduce to the above schema, or to estimate the variance of conditioned to its free variables. In these cases, the training set is , namely we do regression on the point-wise expectation of the correction (i.e., ).
Estimator 1 (Empirical -estimator.)
Set to the empirical estimate of the variance across all correction values to exploit the schema in eq. (4) with .
Besides the simple first case, it is more interesting to account for a model of the variance in the observations of the predictions from a model; we discuss how this could be done in two ways.
Estimator 2 (Point-wise -estimator)
Let the empirical estimator of the variance of the correction values, per training-point
then define a model of the variance as a point-wise function of the regression parameter, that is
In this case, the variance that we expect in each prediction of the latent function is estimated from the data, thus leading to a form of heteroscedastic regression.
We can estimate with higher precision a model of the variation in the variance across the input space; to do that we perform regression of the variance change, and then inform the outer regression task of that prediction.
Estimator 3 (Nested -estimator)
Consider the same estimator of the variance as above, and collect the variance estimates as . Learn a latent function model of the true variance , which we assume to have a fixed variance
This is also a form of heteroschedastic regression, but nesting two GP regressions to account in a finer way for the variance’s changes.
3.4 Properties of the correction map
The correction map that we learn, for all variance schemes, is a statistically sound estimator of , in the sense of being consistent.
Theorem 1 (Correctness)
Let and be consistent estimators of and , then is a consistent estimator of , for any estimation strategy of .
The correction map is estimated from samples of the system, hence it is an approximation of the exact map defined by eq. (2). Thus, it is a correction map in the sense of Definition 2. However, being the result of a GP regression, is in fact a random function. Therefore, in measuring the error according to eq. (1), we need to consider the average value of the random function . The variance , instead, provides a way of computing the error , but in a statistical sense with confidence : can be estimated by averaging over (with respect to ) the half-width of the region containing % of the probability mass of .
The cost of all our approaches inherently depends on how many samples we pick from , the way parameters in are accounted for and the number of parameters in . The sampling cost in general grows exponentially with , and each Gaussian regression is cubic in the number of sampled values. Notice that, asymptotically, the cost of the empirical and nested -estimators is the same, as the two regressions are executed in series.
We discuss now several potential applications of our framework. The advantages of using our approach are mostly computational: the reduced model is simpler to analyze, yet it retains a mechanistic interpretation, compared to statistical emulation.
Many common problems in the area of dynamical systems can be tackled by resorting to correction maps.
Problem 1 (Model selection)
Let be a model, and , , a set of candidate models for correction, each one having a correction map , , . The smallest-correction model for a statistic is .
This criterion is certainly alternative to structural Bayesian approaches , which can be used to select the structurally smaller model within an equivalence class (see below). Also, allows to frame a model synthesis problem.
Problem 2 (Model synthesis)
For a model with parameters and for a statistic : partition into sets and , generate a finite set of plausible reduced models with parameters and select the best one, according to the above model selection criterion.
In this case, the partition might aim at identifying in the model’s parameters which contribute the most to the variance for the statistics . Opportunities for control are also plausible if one can reduce the problem of controlling to “controlling and correcting” a reduced model . This should be easier as is structurally smaller than , and so is lower the complexity of solving a controller synthesis problem.
Another application of our framework is in Bayesian parameter estimation of the parameters of the big model , given observations of variables , using the small model and the corrected statistics to build approximations of the likelihood function to plug into a Bayesian approach (i.e. to compute an approximate posterior). For instance, using the mean of variables (and the variance of the correction map as a proxy of the variance in after marginalisation of ), we can easily compute the distribution of
under a linear noise approximation and use a Bayesian inference scheme such as.
Correction maps can also be used to compare processes, via weak forms of bisimilarity with respect to the observations and a statistics.
Definition 3 (Model equivalence)
Models and , for a statistic and parameter sets and , are -bisimilar conditioned to , , if and only if for all , it holds . A class of equivalence of models with respect to and is the set of all such bisimilar models.
This notion of bisimilarity resembles conditional dependence, as we are saying that two models are equivalent if an observer corrects both the same way. In this case, is a universal corrector of , as it can correct for all the models in the class. The class of models that are equivalent to a model is – i.e., the class of models with correction zero; notice that in this case . The previous definition can be relaxed by asking that , for all .
Criterion is a weaker form of probabilistic bisimilarity, namely if are bisimilar, then for some and any statistics of interest. For instance, for CTMCs, this follows because implies that and have the same infinitesimal generators for any parameter and , hence the outcomes of and are indistinguishable, and so are their statistics.
|Full model ()||Reduced model ()|
E+ S[bend left=15]rk_1 & ES [bend left=15]lk_-1 rk_2 & E + PR Srf & PR
G_act^∗rα[bend left=25]dk_offPR & mRNA^∗rβdδ_RNA & PRdδ_P
G_in[bend left=25]uk_on & ∅& ∅ G_act^∗rβ[bend left=25]dk_offPR & PRdδ_P
G_in[bend left=25]uk_on & ∅
We investigate two examples to better illustrate our method.
5.1 Model reduction via Qssa
Consider the irreversible canonical enzyme reaction with its exact representation (here, model ), for enzyme E, complex ES , substrate S and product PR (Figure 3, top left panel). When the concentration of the intermediate complex does not change on the time-scale of product formation, product is produced at rate where and . This is the Henri-Michaelis-Menten kinetics and is technically derived by a quasi-steady-state assumption (QSSA), i.e., , that is in place when , where is the initial amount of species . is thus the QSSA reduced model (Figure 3, top right panel).
For non-degenerate parameter values both models predict the same equilibrium state, where a total transformation of substrate into product has happened, for large . Thus, we are not interested in correcting the dynamics of for long-run times, but rather in the transient (i.e., for small ).
Also, as the QSSA hypothesis does not hold for certain initial conditions, we set as the regression variable, and set and . The other parameters are , and with unit (mole per second). In terms of regression, we pick samples of the initial enzyme amount from , and set as it is a time in the transient (manual tuning). This particular example is rather simple, as the free parameters of are part of the Michaelis-Menten constant and fixed, so we use the simpler correction of eq. (3). Also, knowing when the QSSA holds gives us an interval where we expect the correction to shrink towards zero. The map is shown in Figure 4, which depicts the expected concordance among the correction map and validity of the QSSA.
5.2 Model reduction via time-scale separation
Consider a gene switching among active and inactive states of mRNA transcription to be ruled by a telegraph process with rates /. A reaction model of such gene , protein , messenger mRNA with transcription/translation rates / as in Figure 3, bottom left panel.
Here the gene switches among active and inactive states, with rates and , and PR feedbacks positively on inactivation. Proteins and mRNAs are degraded with rates and . In the uppermost part of the diagram species marked with a symbol are not consumed by a reaction, i.e., mRNA transcription is . This model can be easily encoded in a Markov process, as discussed in 0.A.1.
We can derive an approximation to where the transcription step is omitted. This is valid when (time-scales separation), namely for every copy of mRNA a protein is immediately translated.
Correction of protein dynamics.
We build a correction map with . In this case the telegraph process is common to both models, but and are free variables of ; here we assume to have a prior delta distribution over the values of mRNA’s degradation, so we set . For some values of the transcription rate condition C does not hold; in this case it is appropriate to account for ’s contribution to the variance in the statistics that we want to correct, when we do a regression over . Note that also is part of .
Model leads to stochastic bursts in PR’s expression when the baseline gene switching is slower than the characteristic times of translation/transcription. Here we set , and assume mRNA’s lifespan to be longer than protein’s ones (also in light of condition C), so . We simulate both models with one active gene, ; example dynamics are shown in Figure 5, for and . We observe that, when does not hold (
) the protein bursts increases of one order of magnitude, and the long-run probability density function for the proteins,, becomes multimodal.
We define two statistics. One measures the first moment of the random variable that models the number of proteins in the long run; the other is a metric interval temporal logic formula , expressing the probability of a protein burst within the first 100 time units of simulation.
The former is evaluated by a unique long-run simulation of the model, as its dynamics are ergodic. For the latter we estimate the satisfaction probability of the formula via multiple ensembles, as in a parametric statistical model checking problem .
For the regression task we sample values of , in the range . For , instead, we sample 50 random values in the same interval, for each value of ; notice that with high probability we pick values where does not hold, so we might expect high correction factors. Data generated and the regression results are shown in Figure 6, for both fixed-variance regression, empirical -estimator in eq. (4) and with the -estimator, eq. (6). Because variance spans over many orders of magnitude, regression is performed in the logarithmic space. Results highlight a general difference between the posterior variance between the two estimators.
For the second statistics, data is generated from an initial condition where one gene is inactive, . Notice that the expected time for the gene to trigger its activation is (the time upper-bound of the formula), so for some parametrization there will be non-negligible probability of having no protein spike above threshold . The formula satisfaction probability is evaluated with 100 independent model runs, and data generated are shown in Appendix Figure 8. Regression results are shown in Figure 7, where the point-wise -estimator and the empirical
-estimator are compared, highlighting the more robustness of the former with respect to the sampled bottom-left outlier.
Abstraction represents a fundamental tool in the armoury of the computational modeller. It is ubiquitously used in biology as an effective mean to reduce complexity, however systematic analysis tools to quantify discrepancies introduced by abstraction are lacking. Prominent examples, beyond the ones already mentioned, include models with delays, generally introduce to approximately lump multiple biochemical steps , or physiological models of organ function which adopt multi-scale abstractions to model phenomena at different organisational levels. These include some of the most famous success stories of systems biology, such as the heart model of , which also constitutes perhaps the most notable example of a physical systems which has been modelled multiple times at different levels of abstraction. Employing our techniques to clarify the quantitative relationship between models of cardiac electrophysiology would be a natural and very interesting next step.
Our approach has deep roots in the statistical literature. In this paper we have focussed on the scenario where we try to reconcile the predictions of two separate models, however the complex model was simply used as a sample generator black box. As such, nothing would change if instead of a complex model we used a real system which can be interrogated as we vary some control parameters. Our approach would then reduce to fitting the simple model with a structured, parameter dependent error term. This is closely related with the use of Gaussian Processes in the geostatistics literature , where simple (generally linear) models are used to explain spatially distributed data, with the residual variance being accounted for by a spatially varying random function. Another important connection with the classical statistics literature is with the notion of emulation 
. Emulation constructs a statistical model of the output of a complex computational model by interpolating sparse model results with a Gaussian Process. Our approach can be viewed as apartial emulation, whereby we are interested in retaining mechanistic detail for some aspects of the process, and emulate statistically the residual variability. In this light, our work represents a novel approach to bridge the divide between the model-reduction techniques of formal computational modelling and the statistical approximations to complex models.
-  S.Aitken, R.D.Alexander, and J.D Beggs. A rule-based kinetic model of rna polymerase ii c-terminal domain phosphorylation. J R Soc Int, 10(86):20130438, 2013.
-  R. Alur, T. Feder, and T.A. Henzinger. The Benefits of Relaxing Punctuality. J. ACM, 43(1):116–146, 1996.
-  D. Barber. Bayesian reasoning and machine learning. Cambridge Un Press, 2012.
-  L. Bortolussi, D. Milios, and G. Sanguinetti. Smoothed model checking for uncertain Continuous-Time Markov Chains. Inf and Comp, 247:235–253, 2016.
-  L. Bortolussi and G. Sanguinetti. Learning and Designing Stochastic Processes from Logical Constraints. In Proc. of the 10th Int Conf on Quantitative Evaluation of Systems, volume 8054 of LNCS, pages 89–105, 2013.
-  G. Caravagna. Formal modeling and simulation of biological systems with delays. PhD thesis, University of Pisa, 2011.
-  N.Cressie and C.K.Wikle. Statistics for spatio-temporal data. Wiley & Sons, 2015.
-  D.C. Hoyle, M Rattray, R. Jupp, and A. Brass. Making sense of microarray data distributions. Bioinformatics, 18(4):576–584, 2002.
-  M.C. Kennedy and A. O’Hagan. Bayesian calibration of computer models. J Ro Stat Soc: Series B (Statistical Methodology), 63(3):425–464, 2001.
-  M. Komorowski, B. Finkenstadt, C.V. Harper, and D.A. Rand. Bayesian inference of biochemical kinetic parameters using the linear noise approximation. BMC Bioinf, 10:343, 2009.
-  N.D Lawrence, G. Sanguinetti, and M. Rattray. Modelling transcriptional regulation using gaussian processes. In Adv Neural Inf Process Syst, 785–792, 2006.
-  D. Noble. Modeling the heart–from genes to cells to the whole organ. Science, 295(5560):1678–1682, 2002.
-  C.E. Rasmussen and C.K.I. Williams. Gaussian processes for machine learning. MIT Press, Cambridge, Mass., 2006.
Appendix 0.A Appendix
All the code that replicate these analysis is available at the corresponding author’s webpage, and hosted on Github (repository GP-correction-maps).
0.a.1 Further details on the examples
The two models from §5.1 correspond to these systems of differential equations
which we solved in MATLAB with the ode45 routine with all parameters (InitialStep, MaxStep, RelTol and AbsTol) set to .
Concerning the Protein Translation Network (PTN) in §5.2, the set of reactions and their propensity functions that we can use to derive a Continuous Time Markov Chain model of the network are the following. Here denotes a generic state of the system and, for instance, the number of mRNA copies in .
The reduced PTN model is a special of this reactions set where transcription and mRNA decay are omitted. In this case we used StochPy to simulate the models and generate the input data per regression – see http://stochpy.sourceforge.net/; data sampling exploits python parallelism to reduce execution times.
For regression, we used the Gaussian Processes for Machine Learning toolbox for fixed-variance regression, see http://www.gaussianprocess.org/gpml/code/matlab/doc/ and a custom implementation of the other forms of regression.
Proof of Theorem 1.
Both the empiricals and nested estimator rely on an unbiased estimator of the mean/variance, which means that if , i.e., we sample all possible values for the free variables, we would have a true model of /. This means that, for each sampled value from , even the simplest
-estimator would be equivalent, in expectation, to the marginalization of the free variables. This is enough, combined with properties of Gaussian Processes regression (i.e., the convergence to the true model with infinite training points), to state that the overall approach leads to an unbiased estimator of the correction map.