I Introduction
Datadriven system identification [19] is omnipresent in many scientific and engineering domains: Given timeseries data describing observed system behaviour, the task is to find a model, i.e., identify an appropriate model structure and estimate the values of the model parameters. Identifying the model structure is a challenging problem in many practical applications, especially in domains where structure inference from first principles is not an option. In such cases, one can opt for the blackbox or the graybox modeling paradigm.
In blackbox modeling, we conjecture that the appropriate model structure belongs to a general class of structures, such as neural networks and fuzzy models
[16] or regression trees [1]. While nonlinear blackbox models can achieve highly accurate reconstruction of the observed system behavior, they do not reveal the structure of the observed system. Namely, there is no clear correspondence between the assumed model structure (e.g., neural network) and the structure of the system at hand.Graybox models, on the other hand, aim primarily at revealing the system structure. They are usually formalized as equations that scientists and engineers can comprehend and relate to [20]. To construct graybox models, methods for symbolic regression [15, 24] and equation discovery [32]
are commonly used. Symbolic regression addresses the graybox modeling task with evolutionary methods, i.e., genetic algorithms
[24]. Following these, equation structures are evolved from an initial set of variables, operators, functions and constants. Users have little or no control over the space of equation structures considered during search. Thus, the obtained models often include too complex and incomprehensible equations that make little contact with the modeling knowledge in the domain of use. Hence, the mapping between the structure of the model equations and the system structure is nontrivial and has to be inferred by human modelers.Equation discovery methods [18], on the other hand, allow the user to employ domainspecific knowledge and specify the appropriate space of candidate models. To this end, different formalisms for specifying equation fragments as components for building mathematical models have been proposed [9]. While these formalisms can, in principle, be used in the context of evolutionary methods [22], symbolic regression approaches rarely use them. Instead, equation discovery approaches often rely upon exhaustive or greedy search methods [31]. The most recent equation discovery approach, processbased modeling (PBM), is based on a knowledge representation formalism that establishes an explicit mapping between the model structure and the structure of the observed system [32, 5]. It is closely related to componentbased approaches for computeraided design of electronic circuits [2], where models are composed from a library of standardized, wellcharacterized circuit components. These are with fixed constant parameters, used for manual model construction, where the modeler establishes a configuration of model components. PBM employs a library of similar components in the context of computeraided inputoutput modeling, where the selection of an appropriate configuration of components (and the values of their parameters) is based on outputerror fit against measured data.
Processbased modeling has been applied in several practically relevant case studies. However, so far it has only been evaluated partially in limited setting and has not been systematically evaluated on a proper system identification benchmark yet. On one hand, PBM has been used to establish new models of aquatic ecosystems from data and domainspecific knowledge but with no reference to other known or wellestablished models [3]. On the other hand, PBM has been used to reconstruct the results of manual modeling efforts in the domain of systems biology [29]. These efforts of establishing new and reconstructing existing models, used ProBMoT, a software tool for processbased modeling [33]. Through the performed experiments, we have identified a number of design decisions that lead to an improved version of ProBMoT.
In this paper, we perform a systematic evaluation of the latest version of ProBMoT on a cascaded watertank system [35]
, a wellknown benchmark for nonlinear system identification. In particular, we analyze the robustness of PBM on synthetic data with different noise variances obtained by simulating the benchmark model. Furthermore, we also evaluate the PBM ability to reconstruct a valid model of the system dynamics from measurement data. Therefore, we present here a first proper evaluation of processbased modeling in the context of nonlinear system identification.
The remainder of the paper is organized as follows. Section discusses the relation of processbased modeling to alternative graybox modeling approaches. Section introduces the approach of processbased modeling by illustrating it on an example of modeling cascading water tanks. Section elaborates the design of the modeling experiments, the results of which are reported and discussed in Section . Finally, Section concludes the paper with a summary and a brief outline of several directions for further research.
Ii Related work
Iia Symbolic regression approaches
A number of graybox modeling approaches to system identification aim at both structure and parameter identification of an observed system. Most notably, symbolic regression performed with genetic programming
[17] has often been used to approach system identification [15]. The basic idea of performing symbolic regression with genetic programming is to evolve populations of equations, where the fitness function being optimized measures the discrepancy between the observed data and the simulation of the equations. The equations are represented as tree structures, where the internal nodes correspond to mathematical operations and functions, while terminal nodes correspond to system variables and model parameters.The modeler can control the space of candidate models considered by genetic programming, by specifying the set of primitive operators and functions that can be used in the internal nodes of the evolving trees. In addition, the modeler can also specify a heuristic for model selection that goes beyond the usual degreeoffit measure mentioned above. Madar et al.
[21] propose the use of parsimony principle when designing the fitness function by introducing a penalty for the complexity of the model equations. RodriguezVazquez et al.[23] and Ferariu & Patelli [12] propose multiobjective approaches to follow the parsimony principle, where different aspects of degreeoffit and model complexity are considered as objective functions.The common limitation of these methods is the coarse control given to the modeler over the space of candidate models. Often, modeling assumptions can not be easily or comprehensibly encoded as inputs to the genetic programming approaches: This leads to a risk of inferring models with implausible structure and/or parameter values. To address this issue, Whigham & Recknagel [34] and Cao et al. [7] propose approaches where the space of model structures, considered by genetic programming, is constrained by using grammars [22]. While the grammars allow for flexible specification of the space of candidate models, their use is cumbersome and different for systems’ modelers. In addition, the mapping between the structure of the obtained model and that of the observed system is nontrivial to establish: The inference of the mapping is left to the human modeler.
IiB Componentbased approaches
On the other hand, the mapping between the structure of the model and observed system is explicit in componentbased approaches [2], where models are composed from a library of standardized, wellcharacterized circuit components. Note that the components have fixed constant parameter values, that correspond to physical circuit components and are used for manual construction and design of electrical circuits. No attempts have been made to use libraries of components in the context of computeraided modeling from knowledge and input/output observational data.
While componentbased approaches rely on physical components, blockoriented nonlinear approaches [14] build models out of domainindependent linear, dynamic and nonlinear, static components. In turn, such blocks can be combined into different structures, where each block models a certain aspect of the system dynamics. While some of these combinations are well known and correspond to physical phenomena [27], in principle, different combinations and variations of such components can yield a large number of different structures without a direct mapping to the structure of the observed dynamics at hand. Also, blockoriented approaches usually focus on the manual construction of models, where the block configuration is a priori prescribed by the modeler [26]. M.Schoukens et al. [28] and J.Schoukens et al. [25] propose an approach that requires some modeler’s decision regarding model construction by decomposing the structure identification task. Here, an initial model structure is established first by evaluating the performance of different combinations of linear components using best linear approximation against the nonlinear measured output. The identified initial structure, is then used by the modeler as the basis for subsequent steps of adding additional nonlinear blocks to complete model structure.
IiC Blackbox approaches for graybox modeling
Recently, some approaches to nonlinear blackbox modeling have been also applied in the context of inferring graybox models. In particular, Brunton et al. [6]
combine linear regression with the construction of nonlinear terms and Lasso regularization to obtain nonlinear equationsbased models that can be considered graybox. The system SINDy introduces nonlinear terms by using a set of userspecified parameterfree transformations of the original system variables. These nonlinear terms are linearly combined with parameters estimated using simple linear regression. In this respect, SINDy is similar to the early equation discovery system LAGRANGE
[11]. A novelty in SINDy with the respect to LAGRANGE is the Lasso regularization, used to select terms that are relevant for modeling the observed data. Note however, that both methods are limited to inferring models with structure that is linear with respect to the parameters. Genetic programming, grammarbased equation discovery and processbased modeling are more general approaches that can infer models with structure that is nonlinear with respect to the model parameters.Iii Processbased modeling
Processbased modeling (PBM) is an approach to Equation Discovery that allows the modeler to formalize and use modeling knowledge specific to the domain of interest. We are going to introduce the PBM approach in the two following subsections. The first subsection introduces a general framework for representing modeling knowledge. In the second subsection, we present the method that employs the formalized knowledge to discover continuoustime models from observation data.
Iiia Formalization of the modeling knowledge
The PBM framework for formal representation of modeling knowledge is based on two general concepts of entities and processes. Entities correspond to the static components of the observed system, while processes relate to the dynamic interactions among entities. Each entity includes variables and constants that denote timevarying and timeinvariant properties of the corresponding system component, respectively. While entity constants correspond to constant (timeinvariant) model parameters, entity variables correspond to the state, input and output variables of the model. The system state is therefore represented with the set of the state variables included in the model entities. On the other hand, each process specifies the interacting entities, process constants and equations. The latter provide the mathematical model of the process influence on the interacting entities, more specifically, on the temporal change, i.e., time derivatives of their state variables. Each process equation can include constants and both system and input/output variables of the interacting entities as well as process constants. When several processes influence the same system variable, their mathematical models are summed up to obtain the equation for modeling the dynamics of the particular variable.
Figure 1 provides an illustrative example of a processbased model of a dynamic system consisting of two cascaded water tanks. It consists of three entities and three processes as shown in Figure 1b. The three entities correspond to the three system components depicted in Figure 1a: the two tanks and the water pump that fills the upper tank. Each tank entity includes a single state (endogenous) variable h, corresponding to the height of the water level in the tank, as well as two constants A and a denoting the areas of the tanks and their effluent areas, respectively. The third entity pump includes the input (exogenous) variable v representing the input voltage signal applied to the pump: the higher the voltage – the higher the water inflow in the upper tank. The voltagetoflow conversion constant of the pump is denoted with k. The three processes correspond to the entity interactions, which, in this particular example, represent the water flow between the tanks and the pump. The first process inflow models the flow from the pump to the upper tank, the second process valveTransmission models the flow from the upper to the lower tank, and the third models the outflow of water from the lower tank and the system. The latter two processes assume a squareroot influence of the height of the water in the tanks on the intensity of the tank valve outflow. The single constant parameter of these two processes specifies the valve transition constant G that equals (where is gravitational acceleration).
Processbased models provide additional structure to the model equations with a modular decomposition of the equations into building blocks. These correspond to the knowledge about building models in the domain of interest. In the particular example, the correspondence between the physical structure of the system and the processbased model structure is obvious and explicit. The benefits of this property are twofold. On one hand, it improves the descriptive power of the equations by adding an explanatory layer on top of them. In particular, the equations in Figure 1c do not explicitly reveal the meaning of the repetitive expression
. In contrast, the processbased model ties that expression to the flow from the upper to the lower tank. On the other hand, and even more importantly, the decomposition into building blocks provides means for formulating the task of inducing processbased model from data as a combinatorial optimization task. The latter is then constrained to those models that have plausible structure and are therefore acceptable to domain experts. Note, finally, that the processbased models can be easily and automatically rewritten into the standard mathematical form of differential equations that allows for model simulation and analysis.
The processbased formalism for representing modeling knowledge follows the model decomposition into building blocks. In particular, a modeling expert can specify the set of available building blocks in the domain of interest as a library of entity and process templates. The templates represent generic building blocks that, in a particular modeling scenario, can be instantiated into specific components of the model of the observed system. Each entity template prescribes the set of constant and variable properties important for modeling the corresponding static system component in the domain. Similarly, the process templates are generic interactions and prescribe template mathematical expressions for modeling them.
template entity Tank { 
vars: h {aggregation:sum, range:<0,500>}; 
template entity Pump { 
vars: v; 
consts: k {range:<1.0E3,30>};} 
template process Inflow(p: Pump, t: Tank){ 
equations: ; } 
template process ValveTransmission(t1: Tank, t2: Tank) { 
consts : G {range:<0,10>}; // } 
template process SquareRoot:ValveTransmission { 
template process Linear:ValveTransmission { 
template process Exponential:ValveTransmission { 
template process Outflow(t: Tank) { 
consts : G {range:<0,10>}; 
template process SquareRoot:Outflow { 
equations: ;} 
template process Linear:Outflow { 
equations: ;} 
template process Exponential:Outflow { 
equations: ;} 
Table I provides an example library of template entities and processes for modeling systems of connected water tanks and pumps. The two generic entities of Tank and Pump can represent an arbitrary component of any system in the watertanks domain. The entities tank1 and tank2 in the example processbased model from Figure 1b represent two specific instances of the template entity Tank. Note that the modeling knowledge encoded in the template entities include specifications of the constant and variable properties of the corresponding system components with their operational ranges rather than specific values. The aggregation: sum declaration specifies that when an entity Tank participates in more than one process, the multiple influences of the different processes on the entity variable h are summed up.
The template processes are organized in a hierarchy. In Table I, the ValveTransmission template represents the root node of a hierarchy with three descendants, i.e. SquareRoot, Linear and Exponential, that correspond to three alternative mathematical models of valve transmission. The first assume the squareroot, the second linear and the third exponential influence of the water level in the tank on the tankvalve outflow. Similarly, the Outflow hierarchy of template processes specifies three alternative models of water flow from the system to its environment.
The processbased library provides a general recipe for building models in the domain of use. Following the encoded knowledge, models of arbitrary complexity can be established: They can include an arbitrary number of entities and interactions among them. In particular, the library from Table I can be reused for modeling watertanks systems with an arbitrary number and configurations of tanks, pumps and flows among them and the system environment. Such knowledge also allows the modeler to explore various modeling assumptions. In situations where the valve transmission function is known, one can prescribe that the specific process should be an instance of, e.g., the SquareRoot template. On the other hand, if the valve transmission function is unknown and should be induced from the observational data, one should prescribe the root template of the hierarchy, i.e., ValveTransmission.
The alternative configurations of model components and corresponding mathematical expressions can be automatically explored by the algorithm for inducing processbased models from data and knowledge, introduced in the following subsection.
IiiB Inducing processbased models
Figure 2 outlines the ProBMoT algorithm for inducing processbased models, which takes three inputs. The first input is the processbased library introduced in the previous subsection. The second input specifies the modeling scenario including the specific entities that appear in the observed system, the known configuration of the interactions among them and the modeling assumptions. The third is the training and validation data sets of measurements of the state and input/output variables of the observed system. Given its inputs, ProBMoT proceeds in three steps: (1) instantiating the templates from the library into specific model components, (2) generating the candidate model structures corresponding to combinations of the model components, and (3) estimating the constant parameters and the fit of each model structure against data. Finally, the output of ProBMoT is a list of continuoustime processbased models ranked according to their performance, e.g. the discrepancy between the model simulation and the validation data set.
In the first step, given the entities of the observed system and the modeling assumptions, each template process is instantiated into a number of specific instance processes. For example, given two tank entities, the template process ValveTransmission has six instances. The first three of them correspond to the situation where the water flows form the first to the second tank, with each of the three corresponding to one valve transmission form. The remaining three correspond to the flows from the second task to the first. Table II introduces a specific modeling scenario for an observed system consisting of a cascade of a pump and two water tanks: the pump fills the first tanks, there is a flow between the first and the second tank, and the water of the second tank outflows into the system environment. In this particular scenario, the Inflow generic process has a single instance and each of the ValveTransmission and Outflow process has three instances.
entity tank1 : Tank { 
vars: h {role: endogenous; initial: 0.38086;}; 
entity tank2 : Tank { 
vars: h {role: endogenous; initial: 0.20508;}; 
entity pump : Pump { 
vars: v {role: exogenous;}; 
consts: k = null;} 
process inflow(pump,tank1): Inflow {} 
process valveTransmision(tank1, tank2): 
ValveTransmision { consts: G=4.429;} 
process outflow(tank2): Outflow { consts: G=4.429;} 
The seven process instances represent components for building processbased models of the observed system in the second step of the algorithm. Finding an optimal model is equivalent to the task of finding an optimal combination of the model components. This is a task of combinatorial optimization, which can be approached by an arbitrary algorithm for combinatorial search. In the particular modeling scenario, from the seven model components, we need to compose a model that has an Inflow, ValveTransmission and an Ouflow component. This leads to candidate combinations, so we employ exhaustive search, i.e., enumerate all the candidates. Alternatively, when facing more complex modeling scenarios leading to large search spaces, one can employ scalable method for combinatorial optimization that perform incomplete, heuristic search through the space of candidate combinations. To this end, symbolic regression methods often employ evolutionary approaches, such as genetic algorithms [24], while equation discovery methods employ greedy search algorithms [31].
ValveTransmision  SquareRoot  SquareRoot  SquareRoot  Linear  Linear  Linear  Exponential  Exponential  Exponential [t] 
Outflow  SquareRoot  Linear  Exponential  SquareRoot  Linear  Exponential  SquareRoot  Linear  Exponential [b] 
Model  SS  SL  SE  LS  LL  LE  ES  EL  EE 
In the final, third step of the algorithm, ProBMoT evaluates each candidate model structure by estimating the values of its constant parameters. Recall that the list of constant parameters and the ranges of their values are included in the processbased library. ProBMoT approaches the parameter estimation as a numerical optimization task. The objective function for numerical optimization is the output error, i.e., the discrepancy between the simulated model response and the measured system response in the training data set [4], defined as:
(1) 
In Eq. (1) denotes the number of measurements, and correspond to the measured and simulated values of the output variables and denotes the mean value of in the data set. ProBMoT employs the CVode library [8]
for simulating continuoustime models in the form of ordinary differential equations. CVode implements a generalpurpose ODE solver with linear multistep variablecoefficient methods for integration. Note that we perform longterm simulation of the model equations given the initial conditions only: This is in contrast to most simulation approaches that rely on valuecorrections of the simulation response at each simulation step
[36]. Given that the objective function includes simulation of nonlinear continuoustime models and an outputerror objective function, ProBMoT employs nonlinear methods for numerical optimization. In particular, it uses the Differential Evolution algorithm as implemented in the metaheuristic optimization framework jMetal [10]. Differential evolution is arguably a powerful stochastic optimization algorithm with encouraging results related to its asymptotic convergence towards a global optima [13].Iv Experimental design
To properly test the hypothesis that the processbased modeling approach can accurately perform nonlinear system identification, we propose investigating the two aspects of: (1) structure identification and (2) parameter estimation. To this end, the experiments consider modeling a cascading water tank system with two tanks as shown in Figure 1a, where we focus on modeling the dynamics of the lower tank . In the first line of experiments, we investigate ProBMoT’s ability to discriminate among different instantiated candidate structures. Next, we analyze its ability to correctly reconstruct the values of the model parameters.
In particular, we consider two modes of operation of ProBMoT: (1) multistage mode – Multistage ProBMoT and (2) singlestage mode – Singlestage ProBMoT. The former, relates to a standard approach in system identification where the problem is decomposed so that the dynamics of each signal is modeled separately [25]. Namely, using ProBMoT we first attempt to establish the structure and parameters of model of the first tank, using the signal in the learning process. In turn, the best performing (intermediate) model is used for establishing the structure and parameters of the lower tank using the signal . On the other hand, Singlestage ProBMoT, refers to the standard operation of ProBMoT, where it attempts to learn the whole model for both signals and simultaneously.
Regarding the data, in our experiments we consider both synthetic data, by varying the values of the signals and at each time point, as well as measured data [35]. The latter consists of 2500 measurements of the input signal and the two output signals, i.e., the water levels of the two tanks, and . To properly evaluate model performance, we split these data into a train, a validation, and a test set, consisting of 1000, 500 and 1000 measurements, respectively. In particular, we learn the model structure and parameters using the train data, we select an optimal model with respect to its performance on a validation data set, and evaluate the performance of the selected model on the test set.
The synthetic data is generated by simulating the model from Fig 1b using arbitrarily selected parameter values of , , , and . For the input signal , we used the real measurements, while for and , we used the simulated responses with Gaussian noise added with the equation , where
denotes a normaldistribution random variable with mean 0 and variance
. Note that we do not add noise to the input signal, since it comes from the measurements. We generate five different sets of synthetic responses with variance of 0.01, 0.02, 0.05, 0.1 and 0.2, where the uncorrupted responses correspond to variance 0. This yields six different synthetic experimental cases/data sets.In both the singlestage and the multistage modes, ProBMoT uses the processbased library of domain knowledge for modeling cascadingtank dynamics, presented in Table I. The library combined with the modeling scenario given in Table II, yields 9 candidate model structures shown in Table III. For instance, model SS denotes a model where the processes ValveTransmission and Outflow have sublinear SquareRoot dynamics. Note that, the artificial data was generated with the model SS with the parameter values stated in the previous paragraph. The different modeling scenarios used in the experiments with singlestage and the multistage ProBMoT are given in Appendix A.
Furthermore, the following parameter settings for Differential Evolution were used: a population size of 60, strategy rand/1/bin, differential weight (
) and a crossover probability (
) of 0.9. The limit on the number of evaluations of the objective function is per parameter. The particular parameter setting of DE is based on previous studies [30]. Finally, for simulating the models, we used the CVode simulator [8] with a standard setting of absolute and relative tolerances, and , respectively. For generating the synthetic data as well as perform model simulation in ProBMoT, we use the backward differentiation method combined with Newton iteration and a preconditioned Krylov method with steps implemented in [8].V Results
We investigate the ability of ProBMoT to reconstruct the correct models in both a synthetic and a realworld setting. In the first line of experiments, we examine ProBMoT’s ability to recover the groundtruth model SS from synthetic data. In particular, we present performance comparisons between the two modes of ProBMoT, Multistage ProBMoT and Singlestage ProBMoT, in terms of and on the six synthetic experimental cases. Next, we investigate the ability of the two modes of ProBMoT to estimate the groundtruth parameters from the synthetic data. In the second line of experiments, we test the ability of ProBMoT to identify the structure and parameter of the model from measurements of the watertanks system.
Va Results on synthetic data
Figure 3 depicts the errors of the models learned with Multistage ProBMoT. The top two graphs in Figure 3 refer to the errors of the three "intermediate" candidate models of the dynamics of considered in the first stage. Figure 3a shows that ProBMoT can identify the structure of the process ValveTransmission for correctly. Moreover, Figure 3b shows that the identified intermediate model with a SquareRoot alternative of the process ValveTransmission for the top tank has a constant and substantially better test performance than the other two models (with the Linear or Exponential alternative).
The two bottom graphs in Figure 3 depict the errors of the three final models that complete the correctly identified intermediate model of in the first stage. Figure 3c shows that ProBMoT selects the correct complete model SS for all noise variances. In particular, the validation error of the model SS in the noisefree scenario is almost percentage points lower than the second ranked model SL, and almost percentage points lower for the 0.2 noise variance. This result is also inline with the test performance of the SS model, depicted in Figure 3d, which, in general, performs up to percentage points better than the second ranked model SL and up to percentage points better than the thirdranked SE.
While Multistage ProBMoT is able to correctly identify the groundtruth model, it follows a procedure that requires an intermediate intervention by the modeler. The decomposition and the ordering of stages is to be provided by a human expert that, in turn, has to run ProBMoT for each stage separately. In contrast, Singlestage ProBMoT does not require any intervention by a modeler. Its identification capabilities are presented in Figure 4. Singlestage ProBMoT explores the complete space of nine candidate model structures, using both signals and as input, and ranks the models according to the summed error .
Figure 4a shows that Singlestage ProBMoT is able to reconstruct the correct model structure SS: In particular, based on the performance on a separate validation set, Singlestage ProBMoT accurately identifies the correct structure for all 6 synthetic cases. Note that here the relative difference in performance between the best identified structure and the remaining structures is substantial and higher than percentage points. Moreover, in terms of performance on unseen test data, Singlestage ProBMoT exhibit stable behavior, where for all synthetic cases the discrepancy between the model’s response of the output signal and the actual response (in the synthetic test data) remains stable and lower than .
In sum, both modes of using ProBMoT accurately identify the correct structure of the model. Figure 5 presents a comparison of the errors of the three candidate models considered with Multistage ProBMoT and Singlestage ProBMoT, measured on the output signal . Figure 5 shows that both methods lead to models with virtually identical errors on both validation (Figure 5a) and test (Figure 5b) data. This means that for the task of structure identification, Singlestage ProBMoT without any intervention by a modeler, explores the complete (and larger) space of candidate model structures, performs almost identically to Multistage ProBMoT. The latter embodies the stateofthe art approach in system identification, where the modeling task is tackled via decomposition into multiple stages, and requires intervention by a modeler.
Finally, we assess the capability of both Multistage ProBMoT and Singlestage ProBMoT to reconstruct the values of the model parameters from noisy data. Figure 6 depicts the relative difference between the estimated parameter values in the learned models and the original parameter values used for generating the synthetic data. To obtain reliable parameter estimates, we performed 100 repetitions of parameter estimation with Differential Evolution (by varying the random seed) for each of the six synthetic cases. Note that, while the model structure has 5 parameters, in Fig 6, we present the relative errors for the constants that appear in the equation from Fig Ic, i.e. the ratios of the parameters , , and , respectively.
As we can see from Fig 6, both modes of using ProBMoT can accurately estimate the original parameters values regardless of the noise variance. In general, the relative difference between the estimated and the original parameter values for both methods is lower than 5% across all experimental cases. Note, however, that Singlestage ProBMoT performs much better, lowering the variance of the parameter estimates to below 1%. However, for the particular modeling task at hand, the differences in the parameter values obtained with the two modes of ProBMoT is not substantial, since both sets of parameter values can accurately model the dynamics of the watertanks system.
VB Results on measured data
Here we focus on testing the ability of the processbased modeling approach to accurately reconstruct the benchmark system from measured data. Note that, for this modeling attempt, we assume that the correct structure of the system corresponds to the one given in Figure 1c while the parameters are unknown and remain to be found. More specifically, since both Multistage ProBMoT and Singlestage ProBMoT performed identically on the synthetic experiments, here we only use Singlestage ProBMoT to correctly recover the structure of the model and estimate the models parameters. Table IV presents the results of this experiment.
Model  [t]  

VALIDATION  TEST [b]  
SS  0.481  0.23 [t] 
SL  0.539  0.306 
SE  1.038  0.805 
LS  0.44  0.249 
LL  0.494  0.313 
LE  0.999  0.805 
ES  0.977  0.248 
EL  1.125  0.37 
EE  1.592  0.818 [b] 
From Table IV we can see that Singlestage ProBMoT struggles to identify the assumed correct structure from the measured data. Namely, we can see that the LS model performs better than the SS model in terms of summed on both signals. Nevertheless, in terms of test performance, the SS model has the best performance on the test set with the model LS being second. Note however, that the difference in test performance between SS and LS remains low and insignificant in favor of the model SS. To properly assess these behaviors, Fig 7 presents the simulation responses on the test data of the SS and LS models. From the figure, we can see that the simulated responses are indistinguishable. We conjecture that this is due to an implicit noise in the data i.e. from unforeseen nonlinearities in the measuring setup, which have not been included in the knowledge provided at input.
VC Discussion
In general, the results from the empirical analyses show that processbased modeling can be successfully applied to the task of nonlinear systems identification. In particular, from the experiments performed in the synthetic cases, we can conclude that ProBMoT, using domain knowledge and data, can accurately identify the correct structure of the system as well as estimate the model’s parameter values with high accuracy, regardless of the signal corruption present in the data. In the measureddata experiments, however, ProBMoT performs moderately well by discarding the majority of the models with incorrect structures, but nevertheless struggling to identify the "correct" one among the top two ranked. Earlier, we conjectured that such a performance might be an artifact from some nonlinearities in the measuring setup which were not captured in the domain knowledge provided at input. Such lack of knowledge can be explicitly handled by the modeler, where one can additionally decide on the correct structure. However, in the case of processbased modeling, such behavior can also be handled by extending the modeling knowledge provided at input: Either by encoding new processes in the library of domain knowledge or by weakening some of the constraints that relate to the dynamics of the individual processes.
For this particular task, we extend the modeling knowledge by weakening the constraint on the two processes ValveTransmission and Outflow. Given that ProBMoT mostly struggled to differentiate between strictly linear and nonlinear (squareroot) components for the interactions that involve the two tanks and , we add a component that allows for all sublinear interactions. We encode this in the processbased library Table V, by adding a new process alternative Power to each of the processes ValveTransmission and Ouflow. These two alternatives in turn are instantiated to two additional components which allow for modeling any sublinear dynamics in ValveTransmission and Ouflow. In particular, both processes introduce two additional parameters and , where:
which correspond to the exponents of the two signals and , respectively, in the equations. Table VI presents the results of the modeling experiment given the extended processbased library. To test this extension, we perform experiments using measured data, where we evaluate the performance of a new model PP and compare its performance against the models SS and LS.
⋮ 
template process Power:ValveTransmission { 
equations : 
⋮ 
template process Power:Outflow { 
equations: 
; } 
⋮ 
The results reported in Table VI show that the model PP performs substantially better on the validation set () than both SS and LS models. This result confirms our hypothesis that, for this particular modeling task, extending the modeling knowledge allows for better identification. Note that the estimated parameter values of and for the signals and are and , respectively. This results is also inline with our hypothesis that there are some implicit sublinear dynamics captured in the measured data, which does not correspond to the original theoretical model from Figure 1c. Moreover, Figure 8 presents the simulated responses on the test set of the three models PP, SS and LS. From the figure, we can see that, for some of the trends where SS and LS struggle, PP is able to capture them with high accuracy. In turn, this accuracy translates to predictive error on the output signal of in the case of PP, in the case of SS and of the model LS.
Model  [t]  

VALIDATION  TEST [b]  
SS  0.481  0.23 [t] 
LS  0.44  0.249 
PP  0.238  0.164 [b] 
In sum, we showed that by weakening the constraints of the processes ValveTransmission and Outflow in the processbased library, ProBMoT is able to discriminate among the model structures PP, on one hand, and the remaining two, on the other, therefore more accurately modeling the benchmark system from measured data. Note that, while we implemented this extension by adding two additional parameters, the performance improvement is a direct result of changing the structure of the model. To test this hypothesis, we performed an additional parameter estimation experiment for estimating the values of and using the synthetic data sets. For this task, we performed 100 repetitions for each of the 6 different synthetic cases. The results for both parameters are given in Figure 9.
From Figure 9 we can see that ProBMoT is estimates the values of both parameters to be in the range from
(with standard deviation
) in the synthetic case with noise variance to (with standard deviation ) in the case with noise variance . This stable behavior means that, in a controlled scenario where the appropriate amount of knowledge is present in the processbased library, ProBMoT is able to identify a structure identical to the underlying correct model structure (SS) regardless of the corrupted responses of the input/output signals.Vi Conclusion
In this paper, we present the first proper evaluation of processbased modeling in the context of system identification. Processbased modeling, a recent equitation discovery approach, addresses the task of computeraided modeling of nonlinear dynamic systems from measurements and domainspecific modeling knowledge. To test its utility, we perform a systematic evaluation of its utility for reconstructing a standard system identification benchmark model of cascaded watertank system. To this end, we test the latest implementation of processbased modeling, i.e. the ProBMoT system, which allows for computeraided construction, longterm simulation and evaluation of processbased models of dynamic systems. More specifically, we first analyze the robustness of ProBMoT on a series of identification tasks involving synthetic data with different noise variances, obtained by simulating the benchmark model. Next, we investigate ProBMoT’s ability to establish a valid model of a real benchmark system from measurement data.
The empirical evaluation shows that ProBMoT can reconstruct both the structure and the parameters of a system from synthetic data with various levels of corruption, as well as measured data. In particular, in the syntheticdata experiments, ProBMoT is able to accurately identify the correct structure of the system, as well as to estimate the model’s parameter values with high accuracy, regardless of the signal corruption present in the data. On the other hand, in the measureddata experiments, where ProBMoT initially struggled to discriminate between the two topranked models, we were able to identify implicit nonlinearities that were present in the data and not in the modeling knowledge provided at input. Consequently, with a minimal extension of the processbased library, we showed that ProBMoT can successfully reconstruct both the structure and the parameters of the model from measured data.
The processbased modeling approach, and in particular its application to the task of nonlinear system identification, is the main contribution of this paper. Namely, PBM has several distinguishing properties. First, PBM is a "whiteoff" graybox modeling approach [20] that constructs models from components with a clear mapping to the modeled physical phenomena. Next, the resulting processbased models retain the utility of mathematical models and therefore can be simulated and analyzed using well established numerical approaches. Finally, PBM is general and can be applied to any domain and to any other task of system identification. Moreover, processbased knowledge employed by the PBM is domainspecific in the sense that it captures the basic modeling principles in a given domain and can be reused for different modeling applications within the same domain.
Several directions for further work can be followed. While our empirical study is limited to one domain, i.e. modeling the dynamics of a cascaded water tank system with two tanks, the proposed approach to learning processbased models can be applied to many other tasks of system identification. An immediate continuation of the work presented here would be to investigate whether the findings of this paper also apply to other standard nonlinear benchmark systems in both realworld and synthetic settings. Next, we intend to compare the processbased modeling approach to other stateoftheart graybox modeling approaches, both in terms of accuracy of reconstructing the dynamics of a nonlinear system and in terms of the comprehensibility of the models that these approaches can provide.
Acknowledgment
We would like to acknowledge the financial support of the Slovenian Research Agency, via the grants P20103, P20001, P50093 and N20056.
Appendix A: Modeling Scenarios
entity tank1 : Tank { 
vars: h {role: endogenous; initial: 0.38086;}; 
entity tank2 : Tank { 
vars: h {role: endogenous; initial: 0.20508;}; 
entity pump : Pump { 
vars: v {role: exogenous;}; 
consts: k = null;} 
process inflow(pump,tank1): Inflow {} 
process valveTransmision(tank1, tank2): 
ValveTransmision { consts: G=4.429;} 
process outflow(tank2): Outflow { consts: G=4.429;} 
//STAGE 1 

entity tank1 : Tank { 
vars: h {role: endogenous; initial: 0.38086;}; 
entity pump : Pump { 
vars: v {role: exogenous;}; 
consts: k = null;} 
process inflow(pump,tank1): Inflow {} 
process valveTransmision(tank1): 
ValveTransmision { consts: G=4.429;} 
//STAGE 2 
entity tank1 : Tank { 
vars: h {role: exogenous; initial: 0.38086;}; 
entity tank2 : Tank { 
vars: h {role: endogenous; initial: 0.20508;}; 
process valveTransmision(tank1, tank2): 
ValveTransmision { consts: G=4.429;} 
process outflow(tank2): Outflow { consts: G=4.429;} 
References
 [1] D. Aleksovski, J. Kocijan, and S. Džeroski. Ensembles of fuzzy linear model trees for the identification of multioutput systems. IEEE Transactions on Fuzzy Systems, 24(4):916–929, 2016.
 [2] P. J. Ashenden. The Designer’s Guide to VHDL. Morgan Kaufmann Inc., San Francisco, CA, USA, 3 edition, 2008.
 [3] N. Atanasova, L. Todorovski, S. Džeroski, R. Remec, F. Recknagel, and B. Kompare. Automated modelling of a food web in Lake Bled using measured data and a library of domain knowledge. Ecological Modelling, 194(13):37–48, 2006.
 [4] L. Breiman, J. H. Friedman, C. J. Stone, and R. A. Olshen. Classification and Regression Trees. Chapman & Hall, London, UK, 1984.
 [5] W. Bridewell, P. W. Langley, L. Todorovski, and S. Džeroski. Inductive process modeling. Machine Learning, 71:1–32, 2008.
 [6] S. L. Brunton, J. L. Proctor, and J. N. Kutz. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences, 113(15):3932–3937, 2016.

[7]
H. Cao, F. Recknagel, L. Cetin, and B. Zhang.
Processbased simulation library SALMOOO for lake ecosystems. Part 2: Multiobjective parameter optimization by evolutionary algorithms.
Ecological Informatics, 3(2):181 – 190, 2008.  [8] S. D. Cohen and A. C. Hindmarsh. CVode, a stiff/nonstiff ODE solver in C. Journal of Computational Physics, 10(2):138–143, 1996.
 [9] S. Džeroski and L. Todorovski. Equation discovery for systems biology: Finding the structure and dynamics of biological networks from time course data. Current Opinion in Biotechnology, 19(4):360–368, 2008.
 [10] J. J. Durillo and A. J. Nebro. jMetal: A Java framework for multiobjective optimization. Advances in Engineering Software, 42:760–771, 2011.

[11]
S. Džeroski and L. Todorovski.
Discovering dynamics: From inductive logic programming to machine discovery.
Journal of Intelligent Information Systems, 4(1):89–108, 1995.  [12] L. Ferariu and A. Patelli. Multiobjective Genetic Programming for Nonlinear System Identification, pages 233–242. Springer Berlin, Germany, 2009.
 [13] S. Ghosh, S. Das, A. V. Vasilakos, and K. Suresh. On convergence of differential evolution over a class of continuous functions with unique global optimum. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 42(1):107–124, 2012.
 [14] F. Giri and E.W. Bai, editors. Blockoriented Nonlinear System Identification, volume 404 of Lecture Notes in Control and Information Sciences, Springer, Berlin, Germany, 2010.
 [15] G. J. Gray, D. J. MurraySmith, Y. Li, K. C. Sharman, and T. Weinbrenner. Nonlinear model structure identification using genetic programming. Control Engineering Practice, 6(11):1341 – 1352, 1998.
 [16] G. Kerschen, K. Worden, A. F. Vakakis, and J.C. Golinval. Past, present and future of nonlinear system identification in structural dynamics. Mechanical Systems and Signal Processing, 20(3):505 – 592, 2006.
 [17] J. R. Koza. Genetic Programming II: Automatic Discovery of Reusable Programs. MIT Press, Cambridge, MA, USA, 1994.
 [18] P. W. Langley, H. A. Simon, G. Bradshaw, and J. M. Zytkow. Scientific Discovery: Computational Explorations of the Creative Processes. MIT Press, Cambridge, MA, USA, 1987.
 [19] L. Ljung. System identification  Theory for the User. PrenticeHall, 1999.
 [20] L. Ljung. Perspectives on system identification. IFAC Proceedings Volumes, 41(2):7172 – 7184, 2008.
 [21] J. Madar, J. Abonyi, and F. Szeifert. Genetic programming for system identification. In Proceedings of Intelligent Systems Design and Applications, ISDA 2004, Budapest, Hungary, 2004.
 [22] R. I. Mckay, N. X. Hoai, P. A. Whigham, Y. Shan, and M. O’Neill. Grammarbased genetic programming: A survey. Genetic Programming and Evolvable Machines, 11(34):365–396, 2010.
 [23] K. RodriguezVazquez, C. M. Fonseca, and P. J. Fleming. Identifying the structure of nonlinear dynamic systems using multiobjective genetic programming. IEEE Transactions on Systems, Man, and Cybernetics  Part A: Systems and Humans, 34(4):531–545, 2004.
 [24] M. Schmidt and H. Lipson. Distilling freeform natural laws from experimental data. Science, 324(5923):81–85, 2009.
 [25] J. Schoukens, R. Pintelon, Y. Rolain, M. Schoukens, K. Tiels, L. Vanbeylen, A. V. Mulders, and G. Vandersteen. Structure discrimination in blockoriented models using linear approximations: A theoretic framework. Automatica, 53:225 – 234, 2015.
 [26] J. Schoukens, M. Vaes, A. Esfahani, and R. Relan. Challenges in the identification of discrete time block oriented models for continuous time nonlinear systems. IFACPapersOnLine, 48(28):596 – 601, 2015.
 [27] M. Schoukens and K. Tiels. Identification of blockoriented nonlinear systems starting from linear approximations: A survey. Automatica, 85:272 – 292, 2017.
 [28] M. Schoukens, K. Tiels, M. Ishteva, and J. Schoukens. Identification of parallel WienerHammerstein systems with a decoupled static nonlinearity. IFAC Proceedings Volumes, 47(3):505 – 510, 2014.
 [29] J. Tanevski, L. Todorovski, and S. Džeroski. Learning stochastic processbased models of dynamical systems from knowledge and data. BMC Systems Biology, 10(1):30, 2016.
 [30] K. Taškova, J. Šilc, N. Atanasova, and S. Džeroski. Parameter estimation in a nonlinear dynamic model of an aquatic ecosystem with metaheuristic optimization. Ecological Modelling, 226:36–61, 2012.

[31]
L. Todorovski, W. Bridewell, O. Shiran, and P. W. Langley.
Inducing hierarchical process models in dynamic domains.
In K. S. Veloso M.M., editor,
Proceedings of the 20th National Conference on Artificial Intelligence, NCAI ’05
, pages 892–897. AAAI Press, Pittsburgh, PA, USA, 2005.  [32] L. Todorovski and S. Džeroski. Integrating domain knowledge in equation discovery. In L. Todorovski and S. Džeroski, editors, Computational Discovery of Scientific Knowledge, volume 4660 of Lecture Notes in Computer Science, pages 69–97. Springer, Berlin, Germany, 2007.
 [33] D. Čerepnalkoski, K. Taškova, L. Todorovski, N. Atanasova, and S. Džeroski. The influence of parameter fitting methods on model structure selection in automated modeling of aquatic ecosystems. Ecological Modelling, 245(0):136 – 165, 2012.
 [34] P. Whigham and F. Recknagel. Predicting chlorophylla in freshwater lakes by hybridising processbased models and genetic algorithms. Ecological Modelling, 146(1–3):243 – 251, 2001.
 [35] T. Wigren and J. Schoukens. Three free data sets for development and benchmarking in nonlinear system identification. In Proceedings of the 2013 European Control Conference, ECC 2013, pages 2933–2938. IEEE, 2013.
 [36] Q. Zhang. Nonlinear system identification with output error model through stabilized simulation. IFAC Proceedings Volumes, 37(13):501 – 506, 2004.