Separable nonlinear least-squares parameter estimation for complex dynamic systems

08/10/2019 ∙ by Itai Dattner, et al. ∙ Powered by Pure, Scopus & Elsevier Fingerprint Engine Georgia Institute of Technology University of Haifa 0

Nonlinear dynamic models are widely used for characterizing functional forms of processes that govern complex biological pathway systems. Over the past decade, validation and further development of these models became possible due to data collected via high-throughput experiments using methods from molecular biology. While these data are very beneficial, they are typically incomplete and noisy, so that inferring parameter values for complex dynamic models is associated with serious computational challenges. Fortunately, many biological systems have embedded linear mathematical features, which may be exploited, thereby improving fits and leading to better convergence of optimization algorithms. In this paper, we explore options of inference for dynamic models using a novel method of separable nonlinear least-squares optimization, and compare its performance to the traditional nonlinear least-squares method. The numerical results from extensive simulations suggest that the proposed approach is at least as accurate as the traditional nonlinear least-squares, but usually superior, while also enjoying a substantial reduction in computational time.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Nonlinear dynamic models are widely used for characterizing functional forms of processes that govern complex biological pathway systems. Of particular interest in this context are so-called canonical formats, which are very flexible in their possible responses, yet involve a very restricted domain of functional forms. Outside linear systems, the best-known canonical formats are Lotka-Volterra (LV) models (Lotka (1956), Volterra (1926), Peschel & Mende (1986), and May (2001)), which use binomial terms, and power-law systems within the framework of Biochemical Systems Theory (BST), which exclusively use products of power functions. BST was originally devised for the analysis of biochemical and gene regulatory systems, but has subsequently found much wider application in various biomedical and other areas (Savageau (1976) and Voit (2013)

). Whereas it is easy to set up an LV or BST model for a complex biological system in a symbolic format, the identification of optimal parameter values continues to be a significant challenge. As a consequence, estimating parameters of systems of ordinary differential equations (ODEs) remains to be an active research area that attracts contributions from a variety of scientific fields (e.g.,

Gennemark & Wedelin (2007), Chou & Voit (2009), Girolami & Calderhead (2011), McGoff et al. (2015), Ramsay & Hooker (2017), and Schittkowski (2002)). Indeed, numerous optimization methods for ODE models have been proposed in recent years, but none works exceptionally well throughout wide ranges of application.

The main focus of this paper is not a new estimation method per se

. Instead, we are interested in a more general and high-level point of view regarding parameter estimation. Specifically, our work addresses parameter estimation for dynamic models whose mathematical form contains significant linear features, that allow a natural separation of parameters and system states. A trivial example is a linear ODE where the vector field

is linear in the parameter , with denoting the derivative of with respect to . As a more interesting example, the ODE vector field may be partially linear in the parameters, as it is the case for so-called S-system models in BST.

Example 1.

An S-system (see Voit (2000)) is defined as


Here are positive rate constants, while are real-valued kinetic orders that reflect the strength and directionality of the effect that a variable has on a given influx or efflux. Informally, one can view this system as a regression equation, where the “covariates” are the variables

on the righthand side, whereas the “response” variables are the derivatives

on the left-hand side. Note that the vector field is linear in the rate constants , but nonlinear in the kinetic orders . ∎

Estimation methods that exploit separability of parameters and system states in dynamic models have a long history; see, e.g., Himmelblau et al. (1967) for a special case. However, a rigorous statistical analysis of one such method has been achieved only recently (Dattner & Klaassen (2015)). In a classical paper on the inference for dynamic models, Varah (1982) mentions in passing that “one can use the idea of separability or variable projection (see Golub & Pereyra (1973) or Ruhe & Wedin (1980)), in which the linear parameters are implicitly solved for, the resulting (fully) nonlinear least squares problem is solved for the nonlinear parameters, and then the linear parameters are obtained using their representation in terms of the nonlinear parameters. Since this reduces the size of the nonlinear least squares problem to be solved, it is worthwhile.” Somewhat surprisingly, given that parameter estimation for ODEs is commonly thought as a computational bottleneck in modeling dynamic processes, Varah’s suggestion has not been widely followed in practice. In fact, in the vast literature dedicated to parameter fitting techniques for dynamic models, we are aware of only two relevant references: Dattner et al. (2017), using the direct integral approach, applied the separable nonlinear least-squares to the inference of parameters in a predator–prey system acting in a heterogeneous environment, while Wu et al. (2019) used separability to estimate parameters of high-dimensional linear ODE systems. Moreover, Varah’s idea of exploiting separability for estimation of ODE parameters has been implemented only recently in a publicly available software package (Yaari & Dattner (2019)).

The analysis in this paper is hoped to convince the reader that Varah’s idea is indeed worth pursuing. Specifically, we explore and compare two general data fitting approaches for dynamic models: the traditional nonlinear least-squares method (NLS), and a proposed separable nonlinear least-squares method (SLS). Through extensive Monte-Carlo simulations of representative complex models, arising in different scientific fields, we identify and quantify significant statistical and computational gains when using this separation method. We will ultimately come to the conclusion that model separability can be very beneficial and that the SLS approach should be considered for complex dynamic systems with significant linear features.

The paper is organized as follows. In Section 2 we present details of the SLS methodology in the context of dynamic models. Section 3 describes the simulation setup, quantifies the statistical measures we use in order to compare the performance of SLS and NLS, and presents numerical results, while conclusions are provided in Section 4.

2. Separable nonlinear least-squares (SLS) and Varah’s idea

2.1. Generalities

Following Varah’s original idea within the context of inference in dynamic models, the main advantages of exploiting separability for parameter estimation are the following (Golub & Pereyra (2003)):

  1. fewer initial guesses are required for optimization,

  2. the optimization problem is better conditioned, and

  3. convergence is faster.

These advantages have been convincingly demonstrated in several publications. For example, see Mullen (2008) for an implementation and applications in physics and chemistry; Chung & Nagy (2010) for a high-dimensional case, where the number of parameters is substantially larger than the number of observations; Gan et al. (2018), who compared the performance of several algorithms for SLS problems; and Erichson et al. (2018)

, who studied sparse principal component analysis via variable projection. Separable models are of broad practical applicability, and as such form a subject of active theoretical and applied research. For instance, when analyzing the “reduced” nonlinear optimization problem of a separable structure, simplified conditions are required for establishing a variety of theoretical results concerning numerical and statistical properties of the resulting estimators, compared to the original NLS problem (e.g.,

Basu & Bresler (2000) and Dattner & Klaassen (2015)).

In the following we focus on complex dynamic models and, specifically, consider a system of ordinary differential equations given by


where takes values in and Let


where , and the symbol stands for the matrix transpose. Here , a vector of size , stands for the “nonlinear” parameters that are not separable from the state variables , while , a vector of size , contains the “linear” parameters; note that . As the vector field in (3) is separable in the linear parameter vector , we refer to the corresponding ODE system as linear in the parameter

(cf. the case of a linear regression model), although the

solution to the system might be highly nonlinear in , or even implicit.

Example 2.


and Then one sees that equation (1) is a special case of (2)–(3). ∎

2.2. Solution strategy

Let be the solution of the initial value problem (2). We assume that noisy measurements on the system are collected at time points , where


Here the random variables

are unobservable, independent measurement errors (not necessarily Gaussian) with zero mean and finite variance.

Varah’s approach to parameter estimation in ODE models works as follows. Let stand for a smoother of the data, obtained, e.g., using splines or local polynomials (see, e.g., Fan & Gijbels (1996), Green & Silverman (1994), and Wasserman (2006) for a treatment of various smoothing methods and an extensive bibliography). This smoother approximates the solution to the ODE (2). Varah suggests to insert the smoother into equation (2), which will now be satisfied only approximately, and to minimize the resulting discrepancy over the parameters and A minimizer is then an estimator of This idea was put on a solid statistical foundation in Brunel (2008) and Gugushvili & Klaassen (2012). Varah’s original approach requires the use of the derivative as an estimator of which is a disadvantage, as it is well-known that estimating derivatives from noisy and sparse data may be rather inaccurate: see, e.g., Vilela et al. (2007) and Chou & Voit (2009), or more generally Fan & Gijbels (1996). Recent research (Dattner & Klaassen (2015), Dattner (2015), Vujačić et al. (2015), Chen et al. (2017), Vissing Mikkelsen & Hansen (2017), Dattner et al. (2017), Yaari et al. (2018), Dattner & Gugushvili (2018), Yaari & Dattner (2018), Dattner & Huppert (2018)) has shown that it is more fruitful to transplant Varah’s idea to the solution level of equation (2). The corresponding approach is referred to as integral estimation and proceeds as follows. Define the integral criterion function


where is the Euclidean norm. A minimizer of (5) over gives a parameter estimator. In practice, the integral has to be discretized and replaced by a sum, so that minimization can be performed using the nonlinear least-squares method,


The NLS solution does not take into account the specific linear form of the ODEs in (3), but uses the general form in (2).

It is at this stage that Varah suggested to utilize separability, without actually investigating such an approach. However, details are easy to work out. Denote

Then, with kept fixed, a minimizer of (5) is given by

where denotes the identity matrix. The notation and emphasizes the dependence of the solution on the nonlinear parameters . This solution is plugged back into (5), yielding the reduced integral criterion function


Once is minimized over and a solution

is obtained, estimators for and follow immediately, and are given (with some abuse of the matrix transpose notation) by


respectively. Note that the nonlinear optimization is applied only for estimating the nonlinear parameters , which, in comparison to the NLS approach, can substantially reduce the dimension of the nonlinear optimization problem.

From the above derivation it is clear that SLS problems are a special class of NLS problems, with linear and nonlinear objective functions for different sets of variables. While the idea of using separability for improving parameter estimation was presented already in Lawton & Sylvestre (1971), it seems that much of the subsequent literature is based on the variable projection method proposed by Golub & Pereyra (1973). Golub & Pereyra (2003) reviewed 30 years of research into this problem.

3. Simulation framework and results

In order to investigate the relative performance of SLS and NLS, we designed and performed a large Monte-Carlo simulation, whose results are presented in this section.

All the computations were done on an Amazon EC2 m5a.4xlarge instance and were carried out in R using the simode package of Yaari & Dattner (2019), which is designed specifically for using separability properties of ODEs. We note in passing that in the context of nonlinear regression, the variable projection method of Golub & Pereyra (1973) is implemented in R in the nls command; see Venables & Ripley (2002), pp. 218–220 for an example of its application; see also the TIMP package of Mullen & van Stokkum (2007). We used default smoothing and optimization settings in simode, and in that respect both SLS and NLS received equal treatment; in particular, simode uses cross-validation (see, e.g., Wasserman (2006)) to determine the optimal amount of smoothing. The code to reproduce our numerical results can be accessed on GitHub.111See Code First Submission For plotting, we relied on the ggplot2 package in R, see Wickham (2009).

3.1. Monte-Carlo study design

We chose several representative and challenging ODE models arising in a variety of scientific disciplines. These were:

  1. SIR for simulating an infection process;

  2. Lotka-Volterra population model with sinusoidal seasonal adjustment;

  3. Generalised Mass Action (GMA) system within BST, e.g., for metabolic pathway systems;

  4. FitzHugh-Nagumo system of axon potentials.

Further mathematical details on these systems and the specific experimental setups we used are given below.

In each case, we generated observations by numerically integrating the system and next adding independent Gaussian noise to the time courses. We considered various parameter setups, sample sizes, and noise levels, as specified below. The ODE parameters were estimated via both NLS and SLS, defined in equations (6) and (8), respectively.

As performance criteria, the time required to perform optimization and accuracy of the resulting parameter estimates were used. While comparing computation times is trivial, numerous options are available for comparing accuracy. We focused on the main difference between the two optimization schemes, namely the way they deal with the estimation of linear parameters. SLS does not require initial guesses for these parameters. By contrast, NLS does require a good initial guess for each linear parameter, or otherwise it might diverge or get stuck in a local minimum: finding “good” solutions to nonlinear optimization problems requires “good” initial guesses in the parameter space. Thus some “prior information” regarding these parameters is of crucial importance for optimization purposes. The key insight is that this prior information is encapsulated in the mathematical form of the ODEs themselves, such as (3). Importantly, while NLS does not take into account the special mathematical features of the ODEs and treats all the parameters in a uniform manner, this is not the case for SLS. Thus, one might a priori expect SLS to be more efficient and possibly more accurate than NLS, when prior information regarding the linear parameters is of low quality. On the other hand, when one has high quality prior information regarding the linear parameters, we expect that SLS and NLS to perform similarly. One might note that the nonlinear parameters in almost all GMA and S-systems are very tightly bounded, usually between -1 and +2, and that their sign is often known, whereas the linear parameters are unbounded in GMA systems and non-negative in S-systems, and nothing is known about their magnitudes (see Chapter 5 of Voit (2000)). Thus, not needing prior information on the linear parameters in SLS can be a tremendous advantage.

For the Monte-Carlo study, we varied the prior information by using high, medium and low quality initial guesses for the parameter values. Here higher quality means that the initial guesses were closer to the truth. To be more specific, the initial guesses for the linear parameters used by NLS were Gaussian randoms variables centered on the true parameter values and having standard deviations equal to the true parameter multiplied by a prior information value (thus the prior information value can also be understood as the coefficient of variation of the “prior distribution”). The specific quantification of “high”, “medium” and “low” is admittedly somewhat subjective, and varied across the different ODE models, as specified below. For the sake of better and faster convergence of the optimization algorithms (especially NLS), the nonlinear parameters were constrained to a given range, and this range was the same no matter how we varied the prior information on linear parameters. Further, in each Monte-Carlo iteration we used exactly the same (pseudo-random) initial guess for nonlinear parameters for both NLS and SLS. Thus, as far as the information on nonlinear parameters is concerned, this was kept invariant for each benchmark model, irrespective of the prior on linear parameters. Consequently, both algorithms received the same prior information regarding nonlinear parameters, and hence none was treated preferentially.

The noise level (signal-to-noise ratio, SNR) we used is defined as follows. For a given solution of an ODE equation, we calculate the standard deviation . Then SNR of, say, and is given by , and , respectively, where is the standard deviation of a Gaussian measurement error as defined in equation (4). We will refer to these SNRs as “low noise” and “high noise”, respectively (cf. Johnstone & Silverman (2005), albeit in a different context). We then compared the mean square errors (MSE) of the resulting parameter estimates, which leads to a valid comparison in statistically identifiable ODE models (see, e.g., Dattner & Klaassen (2015) for relevant definitions and results). As another accuracy measure we used the criteria (5) and (7) evaluated at optimal parameter values. The two criteria we propose, though reasonable, are different. Hence, they are not expected to be in agreement in every experimental setup. However, the global conclusions reached with them in Section 4 are coherent and are in favor of SLS.

We now provide the mathematical details on the models and the experimental setups.

3.1.1. Age-group SIR

The system is an epidemiological model of an SIR-type (Susceptible – Infected – Recovered), and includes age group and seasonal components. The epidemic in each age group and each season is described using two equations for the proportion of susceptible () and infected () individuals in the population (the proportion of recovered individuals is given by ):


The parameters of the model are the transmission matrix , the recovery rate , and , which signify the relative infectivity of, e.g., influenza virus strains circulating in seasons compared to season ( is used as a reference and is fixed at ). As shown in Yaari et al. (2018), taking into account separability characteristics of this model is advantageous for data fitting purposes. Specifically, (9) is nonlinear in the initial values , which are typically unknown and have to be estimated. For our purposes it suffices to consider a model with one age group and one season. The following parameter setup was used: . We considered two sample sizes, and , and two noise levels, and . The prior information used was , corresponding to high, medium and low quality, respectively. The size of the Monte-Carlo study was simulations.

3.1.2. Lotka-Volterra with seasonal forcing

As another benchmark we considered an extension of a classical predator-prey model, namely the Lotka-Volterra model including the seasonal forcing of the predation rate, using two additional parameters that control the amplitude () and phase () of the forcing:

The nonlinear parameters are and . We considered the dynamics within the time interval . The parameter setup is given by

and initial values are . Four experimental scenarios where studied, corresponding to sample sizes of and , and SNRs of and . The prior information values were , corresponding to high, medium and low quality, respectively. The size of the Monte-Carlo study was simulations.

3.1.3. GMA system

The GMA system we analyzed consists of three differential equations in three variables (Voit (2000), pp. 84–85). They are:

Here the linear parameters are the rate constants , while the nonlinear ones are the kinetic orders . Note that the parameters are allowed to become negative and their sign might not be known too. We considered the dynamics of the system within the time interval . The parameter setup is the one presented in Voit (2000), namely

and initial values are . Four experimental scenarios were studied: sample sizes of and , and SNRs of and . The prior information values were , corresponding to high, medium and low quality, respectively. The size of the Monte-Carlo study was simulations. Parameter estimation for GMA systems is considered to be a challenging numerical task (Voit (2000)).

3.1.4. FitzHugh-Nagumo system

The FitzHugh-Nagumo system (FitzHugh (1961), Nagumo et al. (1962), FitzHugh (1969), FitzHugh (1969)

) models spike potential activity in a neuron. It is given by


This system with two state variables was proposed as a simplification of the model presented in Hodgkin & Huxley (1952) for studying and simulating the animal nerve axon. The model is used in neurophysiology as an approximation of the observed spike potential.

The system (10) is linear in parameters and , but nonlinear in . We considered two sample sizes, and , and two SNRs of and . The parameters were set to . The initial values were . The true solutions were obtained over the time interval . The prior information used here was , corresponding to high, medium and low quality, respectively.222The initial guesses for parameters were assured to be positive. The size of the Monte-Carlo study was simulations. Many researchers studied the problem of parameter estimation for the FitzHugh-Nagumo model. In particular, Ramsay et al. (2007), Campbell & Steele (2012) and Ramsay & Hooker (2017) pointed out several difficulties in estimating the parameters for this ODE system.

3.2. Results of the Monte-Carlo analysis

Our findings are presented through charts and tables. The primary summaries are Tables 1 and 2, where we report the ratios of the mean square errors (square errors averaged over Monte-Carlo simulations) for estimates of linear parameters (for nonlinear parameters, see the discussion at the end of this section). Several conclusions can be gleaned from the tables.

  1. Given high-quality prior information, the accuracy of NLS and SLS is comparable, and neither method has a clear lead throughout the variety of experimental setups.333At least some of the differences that one sees from the raw numbers in the tables are plausibly attributable to the Monte-Carlo simulation error and as such appear to be insignificant.

  2. When the quality of prior information degrades to medium or low, the performance of SLS becomes in most cases significantly better than that of NLS (with an extent depending on the specific experimental setup).

  3. For a fixed noise level, as the sample size increases, the advantage of SLS becomes more pronounced.

  4. For a fixed sample size, as the noise level increases, the SLS is still better than NLS, but to a lesser extent.

Low noise High noise
Prior sir ltk gma ftz sir ltk gma ftz
low 8.3 5.8 4.0 2.3 2.8 2.2 2.2 1.3
medium 3.9 1.8 3.1 1.7 1.4 1.2 1.8 1.2
high 0.9 1.3 0.9 2.0 0.4 1.0 0.6 1.1
The table gives the MSE ratios (computed by averaging square errors over Monte-Carlo simulation runs) of NLS and SLS for estimating the linear parameters in various benchmark models and under different experimental setups (see Section 3.1 for detailed specifications). To identify model names, self-explanatory abbreviations are used. The values in the table are rounded off to one digit after zero. The sample size is for the GMA and Lotka-Volterra models, for the FitzHugh-Nagumo system, and for SIR model. The noise levels are and Values larger than in the table correspond to the cases where SLS performs better than NLS. Note the decreasing pattern in the columns, reflecting the effect of the quality of prior information on the performance of NLS.
Table 1. MSE ratios for linear parameters (small samples)
Low noise High noise
Prior sir ltk gma ftz sir ltk gma ftz
low 12.0 9.6 3.9 3.9 3.8 3.2 2.2 2.2
medium 4.1 4.3 2.8 1.9 1.6 1.6 1.6 1.3
high 1.0 2.2 0.7 2.3 0.7 1.2 0.5 1.5
The sample size is for the GMA and Lotka-Volterra models; for the FitzHugh-Nagumo system; and for the SIR model. The noise levels are and For an interpretation of the results, see Table 1. Note an increased advantage of SLS over NLS in comparison to Table 1.
Table 2. MSE ratios for linear parameters (large samples)
Figure 1. The plot gives MSEs on a log scale (computed as averages over Monte-Carlo simulation runs) for linear parameters plotted against the quality of prior information. In the top panel, labeled A, the comparison is on the basis of the noise level. The graph indicates that the performance of NLS worsens with lowering of the quality of prior information. On the other hand, the performance of SLS is not affected by the quality of prior information, in agreement with the experimental design. Except for the rare case of high quality prior information, where NLS is better, SLS clearly outperforms NLS. In the bottom panel, labeled B, the comparison is based on the sample size. The overall pattern is similar to that in the top panel.

These points can also be visualized through a combination of simple statistical charts. Thus, Figure 1 displays the line graphs that compare MSEs of the two methods under several experimental setups. Whereas the numbers in Tables 1 and 2 are ratios of MSEs, in the figures we present the absolute MSE values. From the graphs, an advantage of SLS over NLS is apparent for less than ideal prior information. Note that in this specific setting SLS performed worse than NLS for high quality prior information. A plausible explanation lies in the fact that while under our experimental setup the amount of information used by SLS via (3) is fixed throughout simulations, NLS can in principle receive arbitrarily precise initial guesses on linear parameters. One may therefore envision existence of a point, from where on using the latter kind of information outweighs the benefits of using the structural relationship (3). However, a precise quantification of the phenomenon is hardly possible beyond an observation that it appears to manifest itself in scenarios with excellent knowledge on likely parameter values. In reality, ideal prior information is rare.

The bottom panel of Figure 1 further suggests that in the specific scenarios that we report there, SLS improves when the noise level diminishes; this is unlike NLS in that figure.

Figure 2. The plot visualizes the performance (on a log scale) of NLS and SLS according to criteria (5) and (7), which are evaluated at the optimal parameter estimates. Points in the scatterplot are colored according to the quality of prior information used to compute the NLS estimates. The 45° diagonal line passing through the origin has been added for reference and intuitive assessment. The scatterplot is supplemented with marginal density estimates using the same color coding. The density estimates indicate that, as the quality of prior information degrades, the quality of NLS results suffers, which manifests in longer right tails of the densities. By definition, performance of SLS is not affected by the quality of prior information on linear parameters. For high quality prior information, clustering of losses in the scatterplot close to the reference line suggests that the overall performance of both NLS and SLS is comparable. As the quality of prior information decreases, the point clouds spread to the right, indicating that SLS starts to perform noticeably better than NLS. Furthermore, unlike Tables 1 and 2, the scatterplot and the range frame (see Tufte (2001), pp. 130–132) convey an impression of the variability in the estimation results over multiple datasets: NLS is visually more variable than SLS.

Figure 2 is a scatterplot of NLS and SLS losses (5) and (7) (on a log scale) evaluated at optimal parameter estimates. The figure highlights in yet another way the importance of prior information for NLS: it is evident that the performance of the latter is strongly affected by the quality of initial parameter guesses. Again, NLS and SLS perform similarly when the prior information is of high quality. However, when the quality of prior information is less than ideal, as it is in most applications, NLS becomes substantially worse than SLS. The scatterplot also gives a quick impression of the variability of estimation results.

Figure 3. The plot presents a comparison of NLS and SLS. In the left panel, labeled A, boxplots of the losses (5) and (7) (on a log scale) evaluated at the optimal parameter estimates are displayed. For high quality prior information, the NLS and SLS loss distributions are close. As the quality of prior information degrades, NLS losses start to take higher values compared to SLS, and their variability increases, as evidenced by the elongation of boxplots. In the right panel, labeled B, the computation times are compared. The NLS computation times tend to be longer than those of SLS, and increase as the quality of prior information increases. In both panels, the performance of SLS does not vary with the quality of prior information, in concordance with the experimental design.

The conclusions that we drew from Figure 2 are confirmed by the left panel of Figure 3, which presents boxplots of NLS and SLS losses (on a log scale) measured according to criteria (5) and (7). The pattern is clear: SLS is better than NLS, and the inferiority for NLS becomes more dramatic with degrading prior information.

The right panel of Figure 3 summarizes computation times. SLS is in general much faster. The execution time of NLS is affected by the quality of prior information, and interestingly, increases with this quality.

Low noise High noise
Prior sir ltk gma ftz sir ltk gma ftz
low 23.0 1.7 1.1 1.1 6.6 1.1 1.0 1.0
medium 8.7 1.2 1.0 1.0 2.6 0.9 1.0 1.0
high 1.0 1.0 0.9 1.0 0.4 0.9 0.9 1.0
The table gives the MSE ratios (computed through square errors averaged over Monte-Carlo simulations) of NLS and SLS for estimating the nonlinear parameters. The experimental setup is as in Table 1. Values larger than in the table correspond to the cases where SLS performs better than NLS. Since the prior information regarding nonlinear parameters stays invariant (see Section 3.1 for details), the table in particular shows the effects that the quality of initial guesses for linear parameters has on the estimation accuracy of NLS in the case of nonlinear ones. The results suggest that in some settings a vague prior knowledge on linear parameters may have an adversary effect on the accuracy of NLS for nonlinear ones too.
Table 3. MSE ratios for nonlinear parameters (small samples)
Low noise High noise
Prior sir ltk gma ftz sir ltk gma ftz
low 29.0 4.1 1.1 1.6 6.5 1.3 1.0 1.3
medium 8.2 2.1 0.9 1.0 2.2 1.1 0.9 1.0
high 0.9 1.2 0.8 1.0 0.7 0.9 0.8 1.0
The setup is as in Table 2. For an interpretation of the results see Table 3.
Table 4. MSE ratios for nonlinear parameters (large samples)

Finally, in Tables 3 and 4 we provide information regarding the nonlinear parameters, where in the case of NLS one can observe how the prior knowledge on linear parameters propagates itself into estimation accuracy for nonlinear ones. In particular, for less than ideal prior information on the linear parameters, SLS holds a pronounced edge over NLS also in the case of nonlinear parameters.

4. Conclusions and outlook

In this work, we designed an extensive simulation study to explore the relative statistical and computational performance of two optimization schemes for inference in dynamic systems: the typical nonlinear least-squares (NLS) method and a new separable nonlinear least-squares (SLS) approach. As benchmarks, we considered several widely used ODE models arising in a variety of scientific fields. We measured statistical performance of the two methods by the mean square error (MSE) of the estimates. As another performance criterion, we employed the loss function values at the optimal parameter estimates. Computational performance of the methods was also compared by the execution times required to complete each optimization.

A pattern that emerged from our study is that SLS in general performs at least as well as, and frequently better than NLS, especially if the prior information on optimal parameters is not ideal, which is typically the case in practice. This statement is uniformly true over all models tested.

Our recommendation therefore is that parameter estimation problems for complex dynamic systems should be addressed, whenever the system contains an appreciable number of linear parameters, with the separable nonlinear least-squares method, rather than the more commonly used nonlinear least-squares method.

While the above message is simple and unambiguous, one must stress that data fitting in complex dynamical systems remains a challenging problem that cannot be treated in a cavalier fashion, even if one takes advantage of separability. For instance, in order to uncover the patterns in Section 3 of this work, we had to carefully design the experimental study, because otherwise simulations might not have converged, or might have converged to poor solutions. This was true for both NLS and SLS, but whenever they were observed, convergence issues were much more severe for NLS (especially sensitive was the case of the FitzHugh-Nagumo system). This result highlights the crucial role of prior information regarding the parameters, or expressed differently, the quality of the initial parameter guesses used for the optimization. We focused here primarily on the effects of the prior information on the linear parameters. However, it also became clear that prior information on the nonlinear parameters has an equally crucial role for optimization purposes, this being true for both NLS and SLS (data not shown).

As a result of our exploratory work, we envision the following promising research directions for the future.

  1. Numerical implementation of SLS for dynamic systems. All the computations in our paper were done in R using the simode package of Yaari & Dattner (2019). However, the idea of using separability properties of ODEs is independent of a particular programming language and can be implemented within other software packages quite as well. Indeed, much work has been done in the context of the variable projection method since it was first introduced in Golub & Pereyra (1973). We are aware of the TIMP package of Mullen & van Stokkum (2007), which implements the variable projection method. Thus, a next step could be to combine the strengths of both packages, simode and TIMP, in order to develop advanced software for variable projection in the context of dynamic systems.

  2. Customized algorithms for specific classes of complex dynamical systems. It is well-known that the performance of an optimization scheme depends crucially on the underlying mathematical model used for description of the data. Thus, it appears that different classes of dynamic models require specific algorithms tailored to their peculiarities. For instance, parameter estimation for GMA systems has different challenges than those encountered when working with SIR (see Section 3). We expect that there is much to gain from focusing future research on specific classes of models and developing stable algorithms for their parameter estimation.

  3. Theoretical properties of SLS in the context of dynamic systems. Gugushvili & Klaassen (2012) studied the statistical properties of NLS in the general context of smoothing, while Dattner & Klaassen (2015) specifically addressed ODE systems that are linear in (functions of) the parameters. One might expect that some assumptions used in Gugushvili & Klaassen (2012) can be relaxed when the problem is closer to the one considered in Dattner & Klaassen (2015).

  4. Extensions to partially observed, high-dimensional, and misspecified dynamic systems. Recent work dealing with inference in high-dimensional ODE models suggests that exploiting linearity in parameters is crucial for developing a successful estimation methodology (see, e.g., Chen et al. (2017) and Wu et al. (2019)). More generally, it would be interesting to study with the variable projection method the cases of partially observed, high-dimensional, and possibly misspecified dynamic systems. This might additionally require the use of high-dimensional regularization techniques (e.g., Chen et al. (2017)) for balancing data and model, and specifically taking into account a potential model misspecification (see Ramsay et al. (2007)).


  • Basu & Bresler (2000) Basu, S., & Bresler, Y. (2000). The stability of nonlinear least squares problems and the Cramér-Rao bound. IEEE Transactions on Signal Processing, 48(12), 3426–3436.
  • Brunel (2008) Brunel, N. J. B. (2008). Parameter estimation of ODE’s via nonparametric estimators. Electron. J. Stat., 2, 1242–1267.
  • Campbell & Steele (2012) Campbell, D., & Steele, R. J. (2012). Smooth functional tempering for nonlinear differential equation models. Stat. Comput., 22(2), 429–443.
  • Chen et al. (2017) Chen, S., Shojaie, A., & Witten, D. M. (2017). Network reconstruction from high-dimensional ordinary differential equations. Journal of the American Statistical Association, 112(520), 1697–1707.
  • Chou & Voit (2009) Chou, I.-C., & Voit, E. O. (2009). Recent developments in parameter estimation and structure identification of biochemical and genomic systems. Math. Biosci., 219(2), 57–83.
  • Chung & Nagy (2010) Chung, J., & Nagy, J. G. (2010). An efficient iterative approach for large-scale separable nonlinear inverse problems. SIAM J. Sci. Comput., 31(6), 4654–4674.
  • Dattner (2015) Dattner, I. (2015). A model-based initial guess for estimating parameters in systems of ordinary differential equations. Biometrics, 71(4), 1176–1184.
  • Dattner & Gugushvili (2018) Dattner, I., & Gugushvili, S. (2018). Application of one-step method to parameter estimation in ODE models. Stat. Neerl., 72(2), 126–156.
  • Dattner & Huppert (2018) Dattner, I., & Huppert, A. (2018). Modern statistical tools for inference and prediction of infectious diseases using mathematical models. Stat. Methods Med. Res., 27(7), 1927–1929.
  • Dattner & Klaassen (2015) Dattner, I., & Klaassen, C. A. J. (2015). Optimal rate of direct estimators in systems of ordinary differential equations linear in functions of the parameters. Electron. J. Statist., 9(2), 1939–1973.
  • Dattner et al. (2017) Dattner, I., Miller, E., Petrenko, M., Kadouri, D. E., Jurkevitch, E., & Huppert, A. (2017). Modelling and parameter inference of predator–prey dynamics in heterogeneous environments using the direct integral approach. Journal of The Royal Society Interface, 14(126), 20160525.
  • Erichson et al. (2018) Erichson, N. B., Zheng, P., Manohar, K., Brunton, S. L., Kutz, J. N., & Aravkin, A. Y. (2018). Sparse principal component analysis via variable projection. arXiv e-prints.
  • Fan & Gijbels (1996) Fan, J., & Gijbels, I. (1996). Local polynomial modelling and its applications, vol. 66 of

    Monographs on Statistics and Applied Probability

    London: Chapman & Hall.
  • FitzHugh (1961) FitzHugh, R. (1961). Impulses and physiological states in theoretical models of nerve membrane. Biophysical Journal, 1(6), 445–466.
  • FitzHugh (1969) FitzHugh, R. (1969). Mathematical models of excitation and propagation in nerve. In H. P. Schwan (Ed.) Biological engineering, vol. 9 of Inter-university electronics series, chap. 1, (pp. 1–85). New York, NY: McGraw-Hill.
  • Gan et al. (2018) Gan, M., Chen, C. L. P., Chen, G., & Chen, L. (2018). On some separated algorithms for separable nonlinear least squares problems. IEEE Transactions on Cybernetics, 48(10), 2866–2874.
  • Gennemark & Wedelin (2007) Gennemark, P., & Wedelin, D. (2007). Efficient algorithms for ordinary differential equation model identification of biological systems. IET Systems Biology, 1(2), 120–129.
  • Girolami & Calderhead (2011) Girolami, M., & Calderhead, B. (2011). Riemann manifold Langevin and Hamiltonian Monte Carlo methods. With discussion and authors’ reply. J. R. Stat. Soc., Ser. B, Stat. Methodol., 73(2), 123–214.
  • Golub & Pereyra (2003) Golub, G., & Pereyra, V. (2003). Separable nonlinear least squares: The variable projection method and its applications. Inverse Probl., 19(2), 1–26.
  • Golub & Pereyra (1973) Golub, G. H., & Pereyra, V. (1973). The differentiation of pseudo-inverses and nonlinear least squares problems whose variables separate. SIAM J. Numer. Anal., 10(2), 413–432.
  • Green & Silverman (1994) Green, P. J., & Silverman, B. W. (1994). Nonparametric regression and generalized linear models: a roughness penalty approach, vol. 58 of Monographs on Statistics and Applied Probability. London: Chapman & Hall.
  • Gugushvili & Klaassen (2012) Gugushvili, S., & Klaassen, C. A. J. (2012). -consistent parameter estimation for systems of ordinary differential equations: bypassing numerical integration via smoothing. Bernoulli, 18(3), 1061–1098.
  • Himmelblau et al. (1967) Himmelblau, D., Jones, C., & Bischoff, K. (1967). Determination of rate constants for complex kinetics models. Industrial & Engineering Chemistry Fundamentals, 6(4), 539–543.
  • Hodgkin & Huxley (1952) Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. The Journal of Physiology, 117(4), 500–544.
  • Johnstone & Silverman (2005) Johnstone, I. M., & Silverman, B. W. (2005). Empirical Bayes selection of wavelet thresholds. Ann. Stat., 33(4), 1700–1752.
  • Lawton & Sylvestre (1971) Lawton, W. H., & Sylvestre, E. A. (1971). Elimination of linear parameters in nonlinear regression. Technometrics, 13(3), 461–467.
  • Lotka (1956) Lotka, A. J. (1956). Elements of mathematical biology. New York: Dover Publications, Inc. Unabridged republication of the first edition published under the title: Elements of physical biology.
  • May (2001) May, R. M. (2001). Stability and complexity in model ecosystems. With a new introduction by the author. Princeton, NJ: Princeton University Press, 2nd ed.
  • McGoff et al. (2015) McGoff, K., Mukherjee, S., & Pillai, N. (2015). Statistical inference for dynamical systems: A review. Statist. Surv., 9, 209–252.
  • Mullen & van Stokkum (2007) Mullen, K., & van Stokkum, I. (2007). TIMP: An R package for modeling multi-way spectroscopic measurements. Journal of Statistical Software, 18(3), 1–46.
  • Mullen (2008) Mullen, K. M. (2008). Separable nonlinear models: theory, implementation and applications in physics and chemistry. Phd thesis, Vrije Universiteit.
  • Nagumo et al. (1962) Nagumo, J., Arimoto, S., & Yoshizawa, S. (1962). An active pulse transmission line simulating nerve axon. Proceedings of the IRE, 50(10), 2061–2070.
  • Peschel & Mende (1986) Peschel, M., & Mende, W. (1986). The predator-prey model: do we live in a Volterra world?. Springer.
  • Ramsay & Hooker (2017) Ramsay, J., & Hooker, G. (2017). Dynamic data analysis. Modeling data with differential equations. Springer Series in Statistics. New York, NY: Springer.
  • Ramsay et al. (2007) Ramsay, J. O., Hooker, G., Campbell, D., & Cao, J. (2007). Parameter estimation for differential equations: a generalized smoothing approach. J. R. Stat. Soc., Ser. B, Stat. Methodol., 69(5), 741–796.
  • Ruhe & Wedin (1980) Ruhe, A., & Wedin, P. Å. (1980). Algorithms for separable nonlinear least squares problems. SIAM Review, 22(3), 318–337.
  • Savageau (1976) Savageau, M. A. (1976). Biochemical systems analysis. A study of function and design in molecular biology, vol. 6739 of Advanced Book Program. London etc.: Addison-Wesley Publishing Company.
  • Schittkowski (2002) Schittkowski, K. (2002). Numerical data fitting in dynamical systems. A practical introduction with applications and software. Dordrecht: Kluwer Academic Publishers.
  • Tufte (2001) Tufte, E. R. (2001). The visual display of quantitative information. Cheshire, CT: Graphics Press, 2nd ed.
  • Varah (1982) Varah, J. (1982). A spline least squares method for numerical parameter estimation in differential equations. SIAM Journal on Scientific and Statistical Computing, 3(1), 28–46.
  • Venables & Ripley (2002) Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S. Statistics and Computing. New York, NY: Springer, 4th ed.
  • Vilela et al. (2007) Vilela, M., Borges, C. C. H., Vinga, S., Vasconcelos, A. T. R., Santos, H., Voit, E. O., & Almeida, J. S. (2007). Automated smoother for the numerical decoupling of dynamics models. BMC Bioinformatics, 8(1), 305.
  • Vissing Mikkelsen & Hansen (2017) Vissing Mikkelsen, F., & Hansen, N. R. (2017). Learning large scale ordinary differential equation systems. arXiv e-prints.
  • Voit (2000) Voit, E. O. (2000). Computational analysis of biochemical systems: A practical guide for biochemists and molecular biologists. Cambridge University Press.
  • Voit (2013) Voit, E. O. (2013). Biochemical systems theory: a review. ISRN Biomathematics, 2013, Article ID 897658.
  • Volterra (1926) Volterra, V. (1926). Fluctuations in the abundance of a species considered mathematically. Nature, 118, 558–560.
  • Vujačić et al. (2015) Vujačić, I., Dattner, I., González, J., & Wit, E. (2015). Time-course window estimator for ordinary differential equations linear in the parameters. Stat. Comput., 25(6), 1057–1070.
  • Wasserman (2006) Wasserman, L. (2006). All of nonparametric statistics. Springer Texts in Statistics. New York, NY: Springer.
  • Wickham (2009) Wickham, H. (2009). ggplot2. Elegant graphics for data analysis. Use R! New York, NY: Springer.
  • Wu et al. (2019) Wu, L., Qiu, X., Yuan, Y.-x., & Wu, H. (2019). Parameter estimation and variable selection for big systems of linear ordinary differential equations: A matrix-based approach. J. Amer. Statist. Assoc., 114(526), 657–667.
  • Yaari & Dattner (2018) Yaari, R., & Dattner, I. (2018). simode: R Package for statistical inference of ordinary differential equations using separable integral-matching. arXiv e-prints.
  • Yaari & Dattner (2019) Yaari, R., & Dattner, I. (2019). simode: Statistical inference for systems of ordinary differential equations using separable integral-matching. R package version 1.1.4.
  • Yaari et al. (2018) Yaari, R., Dattner, I., & Huppert, A. (2018). A two-stage approach for estimating the parameters of an age-group epidemic model from incidence data. Stat. Methods Med. Res., 27(7), 1999–2014.