## 1 Introduction

Inference of computational model parameters from empirical data can be referred to as *model calibration* [Kennedy2001]. Model calibration aims to both obtain model parameters that are theoretically plausible and generate model predictions that fit the observations. The inferred model parameters often represent physical quantities that are not directly observable or observed, i.e., they are not directly obtained from sensor measurements. Therefore, the inference of physics-based model parameters enables one to understand the underlying reasons for a discrepancy between physics-based model predictions and observations, i.e., the *reality gap* (see Figure 1). This is of particular relevance for scientific and engineering disciplines where one is interested in improving the physics-based model analytically or explaining the observed process in light of a given physics-based model structure. Applications can be found in multiple areas, including geology [Elsheikh2015], climatology [Sanso2008], biology [Henderson2009], health [Rutter2009], finance [Liu2019, Deng2008], cognitive science [Kangasraasio2019], mechanical engineering [Kumar2013], and applied physics [Higdon2008].

A particularly important field of application aiming at a reasoned analysis of discrepancies between model predictions and observations is model-based system health diagnostics of safety-critical engineered systems. Diagnostics involves detecting when a fault occurs, isolating the root cause, and identifying the extent of the damage [Roychoudhury2013]. In model-based health diagnostics, the discrepancy between model and observation is interpreted as a deteriorated or anomalous response of the system. Therefore, model-based health diagnostics addresses the diagnostics problem by inferring the value of model parameters, representing the health condition of the sub-components of a system that make the physics-based model predictions fit the observations. In this way, anomalies in the system’s behavior are detected and characterized by the value of model parameters.

Because of the relevance of model calibration in applications such as the one presented above, it is important that model calibration provide accurate inference of the model parameters while being robust to uncertainty in the observations and the physics-based model structure. However, calibration in real-world scenarios faces computational and statistical difficulties. Computational issues are related to the need for running time-consuming simulations using optimisation and inference techniques that generally imply a trade-off between inference accuracy and computation time. Scaling the method to large datasets, high dimensional spaces and complex dynamic models (such as a model with flow field calculation) further exacerbates the problem. Statistical issues arise from a) the incompleteness of the model representation, b) the existence of multiple solutions, i.e., confounding solutions that match the observations, and c) the uncertainty of the observations. Some safety-critical applications, such as model-based diagnostics of aircraft engines, require simultaneous speed, accuracy, and robustness in the inference of the model parameters to enable a fast and reliable state assessment. The necessity of fulfilling all of these requirements at the same time makes the development of methods for reliable dynamical model calibration challenging.

Several methods have been proposed to address the problem of dynamical model calibration. When the physics-based model structure is well founded on known physical principles (e.g., aircraft thermodynamic engine models), the majority of the available methods for parameter inference are probabilistic or estimation approaches developed in the fields of optimal control [Crassidis2011] and statistics [Sacks1989]

. Some examples of popular estimation methods include iterative reweighted least squares schemes

[AriasChao2015], Kalman filters (KF) [Kalman1960], extended Kalman filters (EKF) [Einicke1999, Borguet2012], unscented Kalman filters (UKF) [Julier1997, Turner2010], particle filters [Kantas2015]or Bayesian inference methods using Markov chain Monte Carlo

[Rutter2009]. Approaches of this type scale relatively well to high-dimensional calibration problems and, with their probabilistic nature, handle observation noise reasonably well. These estimation methods have achieved good results in practical applications and are considered the state-of-the-art in several applications such as model-based diagnostics. Yet, despite these attractive properties, they all suffer, at least to some degree, from various computational and statistical difficulties in real-world scenarios. In particular, this is because estimation with these methods involves multiple evaluations of the computational model, which makes them unsuitable for real-time calibration of models based on large datasets when the available computational resources are limited. Moreover, these methods are particularly affected by the inadequacy of the physics-based model structure, resulting in an inaccurate characterization of the reality gap.More recently, data-driven approaches

have been proposed to calibrate physics-based models. Aiming to avoid time-consuming simulations of previous calibration methods and achieve real-time model calibration, some researchers have deviated from the probabilistic formulation of the calibration problem. The most common approach is to address the calibration problem as a supervised learning problem

[Liu2019]. In this case, a neural network algorithm is trained in the inverse relation between the observations and the model parameters. Although these methods provide a real-time calibration approach (only a forward pass over a neural network is required at deployment time), the accuracy of the methods is strongly dependent on the representative quality of the training datasets. As a result, this model calibration approach is not able to adapt to new scenarios without re-training. To mitigate this limitation, an exhaustive mapping of possible system responses under different operating conditions and values of model parameters is required. In practice, in high-dimensional calibration problems with systems operating under a large range of conditions, an exhaustive mapping is infeasible. In addition, such methods exhibit poor performance in scenarios involving noisy observations, limiting their implementation in practical applications.Where a real-world system’s behavior is not well represented by a physics-based model’s structure, a popular framework for model calibration is the probabilistic framework proposed by [Kennedy2001]. In this framework, both the physics-based model response and the model discrepancy are modelled with Gaussian Processes (GP). While GP is an elegant solution to emulate the response of a physics-based model and is well suited for uncertainty quantification, the GP representation can a) limit the class of functions that can be modelled, b) restrict the scalability to large datasets [Rasmussen2006], and also c) suffer from poor extrapolation ability. Additional computational issues arise from the use of Markov chain Monte Carlo to perform inference. Several recent developments have been proposed to mitigate these limitations, including the extension of the modelling capabilities of GP with Deep GP [damianou13a] or considering variational inference [Marmin2018VariationalCO]. Although the representation of complex physics-based models with Deep GP reduces the scalability limitations of classical GPs, for large-scale calibration problems, the scalability and computational time at run time of Deep GP-based methods for real-time model calibration in real-world scenarios is still limited [Marmin2018VariationalCO].

Because of the issues mentioned above, the dynamic, real-time, robust, and accurate inference of physics-based model parameters of complex engineered systems remains challenging. However, recent developments in model-free reinforcement learning (RL) have fostered a great deal of progress in addressing related control problems [Zhang2020]. In fact, RL has proven to be effective in finding optimal control policies for non-linear stochastic systems when the dynamics are either unknown or affected by severe uncertainty [bucsoniu2018reinforcement], including complicated robotic locomotion and manipulation [kumar2016learning, xie2019iterative, hwangbo2019learning]. The policies learnt via RL have the ability to adapt to new scenarios and scale well to large-scale problems at run time. In fact, the decision-making of reinforcement learning takes place through a neural network without any further optimization, which overcomes the inference speed problem at deployment time. Therefore, model-free RL [sutton1992reinforcement] is a compelling alternative to traditional inference methods for physics-based model calibration.

One can realize the potential of utilizing model-free reinforcement learning for the inference of physics-based model parameters if one leverages the strong connection between inference in probabilistic models and reinforcement learning [levine2018reinforcement]. In fact, as highlighted in [levine2018reinforcement], the connection between probabilistic inference and optimal control has been covered in the literature under different names: a) the Kalman duality [Todorov2008], b) Kullback–Leibler (KL) divergence control [Kappen2009], c) stochastic optimal control [Toussaint2009], and d) maximum entropy reinforcement learning [Ziebart2010].

In this work, we propose a novel formulation of the calibration problem as a tracking problem

that is modeled by a Markov decision process. Based on this formulation, we apply maximum entropy deep reinforcement learning to train an agent that controls the physics-based model parameters to keep the model response matching the observations. In order to achieve greater robustness to observation uncertainty and model inadequacy, we propose a novel constrained Lyapunov-based actor-critic (CLAC) algorithm. The proposed CLAC algorithm adds constraints on the stability of the policy network and is an extension of the Lyapunov-based actor-critic (LAC) algorithm.

Without any knowledge of the physics-based model or simulator, the agent is able to exploit the full dynamics of the model and produce robust control (i.e., calibration) logic. Therefore, the proposed framework overcomes the difficulties of traditional optimal control methods and data-driven approaches. It provides: a) accurate real-time dynamical calibration, b) a policy that can adapt to new scenarios without having been specifically trained on them, c) scalability to large datasets and high-dimensional spaces, and d) robustness to observation and model uncertainty.

The proposed framework is summarized in Figure 2. In the first step we identify the parameters of the physics-based model that are subjected to inference. In a second step, we use a physics-based model or, alternatively, a deep neural network (DNN) model that emulates the expected system response for measured properties (i.e., observations). In the third step, we use the DNN model to train the calibration policy network via RL. At deployment time, the trained calibration policy is directly deployed to obtain the physics-based model parameters at run time (step 4). The resulting calibration policy is computationally efficient at run time. Most importantly, the calibration policy is robust to uncertainty in the observations and the physics-based model. The proposed methodology is demonstrated and evaluated on a model-based diagnostics test case utilizing two different physics-based models of a turbofan engine: the Advanced Geared Turbofan 30,000 (AGTF30) and Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) from NASA.

The contribution of this paper is two-fold: 1) We propose a solution to the problem of real-time dynamic calibration of physics-based models. In particular, we present a very general reinforcement-based model calibration framework that enables real-time inference of system model parameters without any online optimization and could be easily implemented on any system model. 2) From the methodological perspective, we propose the constrained Lyapunov-based actor-critic (CLAC) algorithm, which provides more action stability, especially on parameter tracking problems, compared to the state-of-the-art LAC reinforcement learning algorithm. This makes the proposed approach robust to noise and high variability.

## 2 Background

In this section, we briefly review the basic concepts and notations related to physics-based model calibration and reinforcement learning as they are the building blocks of the framework and method proposed in this work. In addition, we briefly introduce two traditional calibration approaches (unscented Kalman filters and end-to-end mappings with deep neural networks) to which we compare the performance of our proposed methodology.

### 2.1 Calibration of physics-based models

The problem of calibration of physics-based models corresponds in its general form to the problem of modelling a physical process as approximated by a physics-based model. Observations of the real system response are given in the form of sensor readings taken at variable inputs representing, for instance, the operating conditions at time . The physics-based model provides approximations of the real process at input condition given some values of the calibration inputs . Model calibration aims to infer the (unknown) value of that makes the model predictions follow the observations, i.e., . Following the formulation in [Marmin2018VariationalCO], the calibration problem can be generalized as:

(1) |

In this formulation, the observations are the result of an unknown stochastic warping mapping over the system model and the inputs . It is worth pointing out that the original formulation in [Kennedy2001] is obtained when applies the identity to and the mismatch between the physics-based model and the reality (i.e., model discrepancy ) is modelled by the warping function over the input variables :

(2) |

### 2.2 Reinforcement Learning

Reinforcement learning is a sub-field of machine learning that focuses on how an agent interacts with the environment to achieve a specific goal. The environments are typically stated in the form of a Markov decision process (MDP), which provides a mathematical description of decision-making processes. Under the right problem formulation, MDPs can be useful for solving optimization and inference problems, such as the one described above for physics-based model calibration, via reinforcement learning. The details of the MDP formulation of physics-based model calibration will be discussed in Sec

3.In conventional reinforcement learning, an agent is trained to interact with the environment and seek rewards on the basis of its actions. The agent receives a successor state from the environment as feedback in response to a decision (i.e., action ) taken at time-step . The goal is to find a policy that maximizes the discounted cumulative reward [sutton1998introduction], which is given by the following expression:

(3) |

where is the discount factor.

Maximum entropy RL. The maximum entropy reinforcement learning framework considers a more general objective, aiming to learn a stochastic policy which jointly maximises the expected discounted cumulative reward and its expected entropy [ziebart2010modeling]:

(4) |

where is the temperature parameter that controls the stochasticity of the optimal policy over the reward. Therefore, the resulting stochastic policies balance the exploration-exploitation trade-off and add robustness to the policy. Soft Actor-Critic (SAC) [haarnoja2018soft] is one of the state-of-the-art off-policy reinforcement learning algorithms based on the maximum entropy reinforcement learning framework.

Stability guaranteed RL. The maximum entropy reinforcement learning framework can also include a closed-loop stability guarantee of the system dynamics. Such a stability guarantee is particularly relevant when dealing with control problems in real-world applications. Recently, the Lyapunov-based actor-critic (LAC) method [tian2019model], implementing a stability guarantee, showed state-of-the-art performance on tracking tasks. From a control-theoretic perspective, the task of tracking can be addressed ensuring that the closed-loop system is asymptotically stable. In other words, starting from an initial point, the trajectories of states always converge to a single point or reference trajectory. Therefore, in [tian2019model], a stability-guaranteed reinforcement learning framework is proposed under the following definition of stability:

Stability Definition. Suppose is the cost function, . The system is said to be mean square stable (MSS) if holds for any initial condition .

Under this definition, the stability objective is given by Equation 5. The stability objective defines an energy decreasing condition that drives the trajectory asymptotically to the null space of the cost function, producing predictable behaviour of the agent. Here, we use the Lyapunov function to denote the system’s energy, so that the state goes in the direction of decreasing the value of the Lyapunov function and eventually converges to the origin or a sub-level set of the Lyapunov function.

(5) |

where the term controls the energy decreasing speed.

### 2.3 State-Update Method: Unscented Kalman Filter

Estimation of the physics-based model parameters from a transient data stream can be addressed with a traditional state-space formulation. In this solution strategy, the state vector comprises the model parameters and is modelled as a random walk. The measurement equation depends on the states and the input signals at the present time step

, which is available from the system model . Under this formulation, a UKF can be applied to a non-linear discrete time system of the form:(6) | ||||

(7) |

where is a Gaussian noise with covariance and is a Gaussian noise with covariance .

### 2.4 End-to-End Learning

An alternative approach to the calibration problem is to define a supervised learning set-up aimed at discovering a direct mapping from the condition monitoring data to the target . Different machine learning can be applied for this task. This approach is valid under the assumption that the training dataset is representative of the testing dataset. In this case, the supervised models can generalize well on the test set. However, the extrapolation capabilities of such approaches are limited, which can be a significant limitation for real applications in evolving environments.

The end-to-end learning strategy requires one to train a neural network in the inverse relation to the measurement equation of a state-update method:

(8) |

Since it is a supervised learning setup, this approach calls for an initial training set with the ground truth for the calibration parameters that are used as labels. This would require solving the inverse problem by other methods, using the results of other calibration methods as labels for the learning approaches or using synthetically generated labels. These are a crucial limitations of the end-to-end learning approaches.

## 3 Proposed Framework - Calibration Policy

### 3.1 Model calibration defined as tracking problem

In this work, we formulate the real-time model calibration problem as a tracking problem, which is modelled by an MDP, and use reinforcement learning to find the optimal tracking policy. The rationale behind this solution strategy is that learning to track observations of a real system response () by changing the model parameters () results in a control policy that makes the physics-based model yield a sound approximation of the physical process (), i.e., reducing the reality gap. Consequently, the tracking policy also serves as a calibration policy. It is worth noticing that this formulation of the calibration problem involves a system identification problem by tracking [Ljung1990].

Under a tracking solution strategy, the MDP describing the problem is given as the tuple (), where the set of states() comprises the current model output , the target value of the system response (observations of the real system) , and the operating conditions , i.e., . The set of actions () defines the model parameters that need to be calibrated, i.e., . The reward/cost function

evaluates how good the tracking is. The state transition probability function (

) corresponds to the dynamics of the system that can be modelled by a physics-based model or surrogate model.In order to speed up the learning process of the RL algorithm, a discrete time counterpart of the physics-based model is used. The resulting dynamical system or simulator is modelled by a deep neural network that approximates the dynamic transition equation describing how the expected system response changes given the current observations , the control variables , and model parameters , resulting in:

(9) |

For the tracking problem there is, therefore, a desired state that we would like the system to be in at each time step, i.e., . The task of the agent is to find a control policy that minimizes the cost based on a specific distance metric representing the reality gap of the physics-based model. In particular, given the dynamical system above and a target system trajectory (i.e, observations), we train the control policy to keep the simulator state matching the real system state by maximizing the cumulative reward as given in Equation 3. The complete reinforcement learning loop is shown in Figure 3.

### 3.2 Learning Algorithm

In this work, we adopt Lyapunov-based actor-critic (LAC) [tian2019model] as the learning algorithm, which is based on the soft actor-critic (SAC) [haarnoja2018soft] algorithm and also incorporates a stability guarantee objective. The stability guarantee objective enables a control policy that stabilizes the system in the case of interference by unseen disturbances or uncertainties in the system dynamics. Most importantly, the LAC algorithm yields the best performance on tracking problems [tian2019model].

Based on the maximum entropy actor-critic framework, LAC uses the Lyapunov function as the critic in the policy gradient formulation. The objective function of is given as follows:

(10) |

(11) |

where is the approximation target for as typically used in RL methods [mnih2015human, lillicrap2015continuous]. has the same structure as , but the parameter is updated through exponentially moving average of weights of

controlled by a hyperparameter

.The objective function for the policy network is given by:

(12) | ||||

where is parameterized by a neural network , and is an input vector consisted of Gaussian noise. The is the replay buffer for storage of the MDP tuples. In the above objective, and are positive Lagrange multipliers which control the relative importance of policy entropy versus the stability guarantee. As in [haarnoja2018soft2], the entropy of policy is expected to remain above the target entropy . The value of is adjusted through gradient method, thereby maximizing the objective:

(13) |

and the is adjusted by the gradient method, thus maximizing the objective:

(14) |

Under conditions of high sensor noise and simulator bias resulting from an incomplete representation of the system model (i.e., irreducible reality gap), the policy network can exhibit large variance. Such a situation is undesirable in many real-world applications where it is important to obtain a stable or smooth action over time. Therefore, in order to stabilize the action, we introduce the constrained Lyapunov-based actor critic (CLAC) algorithm, a modification of the LAC, which significantly improves the action stability under model uncertainty and sensor noise. In CLAC, the objective function has an additional term that aims to obtain a policy network that has similar optimal action when given a similar or near state (

) and is given by:(15) | ||||

where is a positive Lagrange multiplier, and outputs the action with largest probability. In our case, we use the adjacent time space state or to approximate .

The entire procedure for training the proposed constrained Lyapunov actor-critic is provided in Algorithm 1 and all the hyper-parameter settings may be found in the Appendix.

## 4 Case Study: Diagnostics of safety-critical systems

### 4.1 Introduction to model-based diagnostics

Model-based diagnostics aims to detect, isolate, and explain anomalies in the behaviour of a system by finding health-related model parameters that approximate the observed system response. Diagnostics of safety-critical systems, such as aircraft engines, is an active research area [Li2002, Fentaye2019] with a long history going back to the original work of [Urban1973]. Because of the potentially catastrophic impact of failures in such systems, it is important to provide accurate and robust inference of the health-related model parameters but also to perform this task in real-time to promptly raise the alarm and take mitigation actions with minimal delay. Yet current model-based diagnostics methods only offer a compromise between speed, robustness, accuracy, and scalability.

### 4.2 Experiments

The proposed framework and method are demonstrated and evaluated on two datasets generated with two different physics-based models focusing on the diagnostics of safety-critical systems represented by turbofan engines. Each dataset explores different aspects of real-world calibration problems. Dataset #1 corresponds to a one-dimensional calibration problem () under a wide range of real (i.e., noisy) flight conditions from a small fleet of ten units (). With 6.7M samples, Dataset #2 is a large dataset that explores complex failure modes affecting four components of the system simultaneously (). Therefore, Dataset #2 explores a calibration problem under complex system responses. In contrast to Dataset #1, it contains only data from one single unit and, consequently, has a more limited range of operating conditions. An overview of the two calibration problems is provided in Table 1.

Parameter | Dataset #1 | Dataset #2 |
---|---|---|

Model Name | C-MAPSS | AGTF30 |

0.5M | 6.7M | |

20 | 8 | |

1 | 4 | |

10 | 1 | |

Fault Type | Continuous | Discrete |

10 | 1315 | |

Alt [ft] | 35.0k - 10.0k | 29.0k - 25.7k |

XM [-] | 0.75 - 0.26 | 0.74 - 0.67 |

TRA [%] | 87.8 - 23.6 | 82.4 - 69.1 |

The performance of the proposed CLAC method is evaluated and compared to two alternative calibration models: a unscented Kalman filter (UKF) and a supervised end-to-end mapping with deep learning algorithm (E2E). The evaluation also covers variants of Dataset #1 designed to evaluate the robustness of the different methods to uncertainty in the observations and system model predictions.

### 4.3 Dataset #1: A Small Fleet of Turbofan Engines

Dataset #1 provides degradation trajectories of a small fleet comprising ten turbofan engines with unknown and different initial health conditions. The trajectories are given in the form of multivariate time-series of sensor readings (i.e., ). The dataset was generated with the Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) dynamical model [Frederick2007]. Real flight conditions (), as recorded on board a commercial jet, were taken as input to the C-MAPSS model [DASHlink]. Figure 4 (left) shows a typical flight profile given by the scenario-descriptor variables (): altitude (alt), flight Mach number (XM), and throttle-resolver angle (TRA) for ten units (). All the units are affected by the same fault mode corresponding to the degradation of the high pressure turbine (HPT) efficiency. Figure 4 (right) shows degradation profiles of each unit of the fleet given by the trace of the true HPT Eff. . The degradation of the HPT evolves following a stochastic process with a linear *normal degradation* followed by a steeper *abnormal degradation*. The degradation rate of each component varies within the fleet. More details about the generation process can be found in [AriasChao2020].

As discussed above, generation on an supervised end-to-end deep learning model requires access to the ground truth labels i.e., . Therefore, for training the E2E algorithm, we assumed that the labels are available for a subset of the units (Unit 1, 4, 7 & 9) corresponding to low altitude and short flights. This experiment design generates a training dataset that is not fully representative of the possible system responses present in the test set where higher altitude and longer flights are present.

### 4.4 Dataset #2: A set of fault scenarios in turbofan engines

Dataset #2 provides simulated condition monitoring data (i.e., ) of an advanced gas turbine during three flight profiles and multiple fault scenarios. The dataset was synthetically generated with the AGTF30 (Advanced Geared Turbofan 30k lbf) dynamical model [Chapman2017] taking as input real flight conditions as recorded on board a commercial jet [DASHlink]. Concretely, three different flight trajectories with a duration of 5000 s are considered. The dataset consists of concatenated time series of sensor readings (i.e., ) resulting from faulty engine conditions. The fault conditions are induced and simultaneously affect four health-related model parameters representing model modifiers of the high pressure turbine (HPT) and low pressure turbine (LPT) flow and efficiency. A total of 1315 different fault scenarios are generated by factorial design of a finite set of possible degradation intensities for each component. No additional noise was added to the model response since the flight conditions are already noisy.

## 5 Results

The aim of the proposed framework is to enable for the first time accurate, real-time, and robust model calibration for large-scale problems. Therefore, in this section, the performance of the proposed method is analysed based on six evaluation criteria: inference accuracy, computational cost, robustness to system model uncertainty, robustness to observation noise, scalability to large datasets, and tracking accuracy.

Inference Accuracy. The primary objective of model calibration is to infer the values of the model parameters . From the application perspective of model-based diagnostics, this objective corresponds to inferring the true underlying degradation parameters. Therefore, we compare the estimated degradation parameters () with the ground truth and report the inference accuracy in the form of the root mean square error (RMSE). Table 2 shows the inference performance of the unscented Kalman filter (UKF), end-to-end mapping (E2E), and the proposed method (CLAC) in both datasets. With the lowest RMSE, the policy obtained with CLAC shows the best overall performance in both datasets. The improvement is particularly significant under complex fault modes (i.e., Dataset #2). The E2E model yields the worst overall performance in Dataset #1, which highlights the limitations of supervised learning in cases where the training dataset is not fully representative of the test conditions. Figure 5 shows the inferred unobserved model parameters obtained with the three methods in Dataset #1. It is worth mentioning that unlike the end-to-end mapping, which needs the ground truth degradation parameters for training, our framework does not need any prior knowledge about the degradation parameters. This makes the approach more flexible and more applicable to real scenarios.

Method | Dataset #1 | Dataset #2 |
---|---|---|

UKF | 3.42e-04 | 3.51e-03 |

E2E | 1.36e-03 | |

CLAC | 3.30e-04 | 2.50e-03 |

Computational Cost. One crucial aspect of the proposed method is the ability to perform real-time calibration. Therefore, we evaluate the time required to perform inference of the model parameters at deployment. Table 3 reports the average times required to calibrate a single sample and the total training time with the three methods. In terms of deployment computational cost, the proposed method provides a speed up of compared to the UKF. Concretely, inference with the proposed CLAC method takes around 40 ms using a CPU thread. This deployment speed is comparable to the E2E model as both methods only require a forward pass over a deep neural network. By contrast, the UKF needs to perform model evaluations, which for Dataset #1 amounts to 6 s. The CLAC method requires several hours of training with an ordinary PC and therefore incurs all the computational cost in the training phase, which is typically not critical for practical applications. For real-time applications, the main limiting factor is the deployment time. Therefore, in terms of computational cost, the proposed method has a clear advantage over the current state-of-the-art approaches.

Method | Deployment Time [s] | Training Time [s] |
---|---|---|

UKF | 6 | 0 |

E2E | 4.2e-04 | 200 |

CLAC | 4.0e-04 | 6200 |

Robustness to Environment Uncertainty. Robustness to model inaccuracy is an important aspect in model calibration. It is also a well known limitation of model-based methods such as UKF. To evaluate the sensitivity of different approaches to inaccuracies in the models, we apply a model bias to the output of the dynamic system model (i.e., ) to emulate an inadequacy of the system model structure (i.e., inaccurate simulator). We also consider a case whereby Gaussian noise is added to the dynamic system model, i.e., where . It is worth noticing that adding noise to the output of the simulator transforms the deterministic model into a stochastic system model.

From the RL perspective, the presence of an inaccurate simulator is known as sim-to-real transfer. In fact, sim-to-real is always a critical problem in reinforcement learning since the agent is trained in a simulated environment which may be different from the real world. In our case, we use a surrogate DNN model to accelerate the training. Therefore, we have an unavoidable error between the DNN surrogate model and the engine physics-based model. Then, even in the case where noise is not added, the agent needs to make decisions with noisy DNN model outputs at every time step .

In order to test the trained policy under bias and noisy simulators, we tested two variants where we added a fixed bias (i.e., ) and a 10% Gaussian noise. (i.e., ) to the output of the DNN model. Table 4 shows that the policy obtained with the CLAC model provides a very good inference even under quite large uncertainty, demonstrating better robustness than the UKF, which failed to optimize a stable inference. The superior inference performance of the CLAC model under fixed bias is visualised in Figure 6.

Model Bias: | ||
---|---|---|

Intensity | UKF | CLAC |

2.04e-3 | 3.30e-04 | |

Model Noise: | ||

Intensity | UKF | CLAC |

# | 4.22e-04 |

Scalability to Large Dataset and High Dimensional Model Calibration Parameters . When the dimensionality of the physics-based model parameters increases, the complexity of inference increases as well. Due to the non-linear correlation between the degradation parameters and also between the degradation parameters and observations, the solution of the calibration problem in high dimensional spaces can lead to confounding solutions. In scenarios with noisy observations and systems with poor observability, the solution of inverse problems, such as UKF methods, might involve the spurious association of calibration factors that have similar system outputs. To test the scalability of our policies, we performed experiments on controlling 1, 2, and 4 degradation parameters in AGTF30 experiments (i.e. Dataset #2). Figure 8 shows the inferred and ground truth traces of a four-dimensional in Dataset #2 with UKF (left) and CLAC (right) approaches. As in the previous plots, the values for 1315 fault intensities are stacked one after the other, thus generating a single time sequence. We can observe that the UKF solution does confound or smear the source of degradation. Moreover, as observed for Dataset #1, at the beginning of each fault combination the predictions show large bias. Both of these issues are efficiently solved with the proposed CLAC method.

Robustness to Sensor Noise.

In real scenarios, the observations are always noisy. Therefore, it is also important to obtain a policy that is robust to sensor noise. To evaluate this effect, we modelled the engine sensor noise and generated a noisy dataset by adding Gaussian noise with an intensity of 70 db signal to noise ratio (

) to the original dataset. Table 5 shows the impact of noise on the inference performance of the UKF and CLAC methods. In this case, although our policy still shows good inference ability, UKF is more robust to sensor noise.Observation Noise: | ||
---|---|---|

Intensity | UKF | CLAC |

3.72e-04 | 7.18e-04 |

Tracking Accuracy. We formulate the calibration problem as a tracking problem and use reinforcement learning to track the operational trajectories of the real systems (i.e., the observations) while being constrained to have a stable policy. Therefore, we evaluate the error between the observed real system response and the calibrated model output. Figure 7 shows that our policies exhibit good tracking ability for the model outputs. Table 6 provides a complete overview of the root-mean-square error (RMSE) for each of the evaluated test cases. Although the CLAC framework shows good tracking ability in all the setups, the UKF achieves better tracking. This is an expected situation with the current RL formulation as the reinforcement learning is actually solving a more complicated problem. In particular, the current state contains the output of the DNN model instead of the historical observation (), as a result of which small errors accumulate. On the other hand, it is precisely this aspect that ensures that the proposed policy action will generalize well to unseen degradation trajectories.

Method | Dataset #1 | Dataset #2 |
---|---|---|

UKF | 0.62 | 1.78 |

CLAC | 0.98 | 5.54 |

### 5.1 Ablation Study

Comparison between LAC and CLAC algorithms We propose to extend LAC to CLAC to improve the stability of the policy under noisy conditions. To demonstrate the benefit of the proposed extension, we compared the inference performance of both algorithms, LAC and CLAC, on Dataset #1. In the C-MAPSS experiments, the flying conditions are very diverse and the DNN model is not very accurate and is particularly noisy. Therefore, the DNN model may lead to an unstable policy. Figure 9 shows the policy’s actions with LAC algorithm (orange squares) and ground truth (blue dots) for the entire trajectories and demonstrates a significant reduction in the variance of the policy. Concretely, in terms of the RMSE metric, the LAC results in a RMSE of 1.3 e-3 while the CLAC led to an RMSE of 3.3e-04. Therefore, CLAC provides a inference improvement.

## 6 Conclusions and future work

We proposed a maximum entropy reinforcement learning framework and the constrained Lyapunov-based actor-critic (CLAC) algorithm for model calibration. The proposed calibration methodology achieves high inference accuracy and robustness while reducing the computational load to a level that makes the proposed methodology applicable to real-time, noisy, and large-scale calibration problems. This capability was achieved purely on the basis of training in a simulation environment without any tedious sampling or computationally expensive solution of an inverse problem. Moreover, and in contrast to the end-to-end learning architectures, the proposed methodology only requires access to the model and the observations, eliminating the need for any ground truth calibration parameters for training. Overall, the proposed CLAC algorithm achieves more precise and faster inference than the prior state-of-the-art while being more robust to system model uncertainty.

The proposed framework can be generally combined with various RL algorithms, or can even be extended to the meta RL [rakelly2019efficient, finn2017model] or hierarchical RL [dietterich2000hierarchical, barto2003recent]. All our experiments are currently performed in a simulated environment. As a next step, we plan to evaluate the resulting policies on the real industrial plants or robots.

Although the learning framework presented in the work is demonstrated in a model-based diagnostics task, it is applicable to any physics-based model,including those used in so-called "digital twins". Therefore, the results presented in this paper suggest a promising research direction in the field of model calibration. From an application perspective, the targeted model-based diagnostics problem was solved using exclusively a set of three deep neural networks. Therefore, the proposed framework is a paradigm shift in the field of model-based diagnostics. Starting with a model-based problem, we demonstrate that a clever arrangement of deep neural networks can learn both the relevant physics of a complex system and the inference techniques required for diagnostics. It is worth pointing out that the use of deep neural networks is very diverse (e.g., functioning as the surrogate of a physics-based system model or as an inference network in a decision-making problem). The proposed framework demonstrates the great potential of fusing physics-based and deep learning models.

## References

## Appendix A Neural Network Architectures and Hyper-parameters

### a.1 Reinforcement learning

The proposed framework and method requires three neural networks: Policy, Lyapunov and Dynamical Model networks. The overall network structure of the proposed method is shown in Figure 10.

Policy and Lyapunov Networks

. For the policy network, we use a fully-connected multi-layer perceptron (MLP) with two hidden layers of 256 units, outputting the mean and standard deviations of a Gaussian distribution. We adopt the invertible squashing function technique as proposed in

[haarnoja2018soft2]to the output layer of the policy network. For the Lyapunov network, we use a fully-connected MLP with two hidden layers of 256 units, outputting the Lyapunov value. All the hidden layers use leaky-ReLU

[maas2013rectifier]activation function.Simulator Network. The system dynamics is approximated with an MLP with four layers (). The hidden layers have 100 units (). The output layer has the dimension of the sensor reading vector (i.e. ). ReLU activation function was used throughout the hidden layers. For the output layer is the identity.

The optimization of the networks’ weights was carried out with mini-batch stochastic gradient descent (SGD) and with the

Adam algorithm [Kingma2014Adam]. Xavier initializer [Glorot] was used for the weight initializations. Table LABEL:tb:settings_CLAC provides a detailed overview of the hyperparameters used for the experiments.Hyperparameters | Value |

Minibatch size | 256 |

Learning rate - Actor | 1e-4 |

Learning rate - Critic | 3e-4 |

Learning rate - E2E | 1e-4 |

Target entropy | -d |

Target smoothing coefficient() | 0.005 |

Discount() | 0.99 |

1 | |

Initial | 2 |

0.1 |

### a.2 E2E and UKF

E2E Network. To evaluate the different calibration methods under equivalent models, the E2E network is as also a MLP. In this way, we separate the effect of regularization in the form of model and learning strategies choice from other inductive bias in the form of choice of neural network type. The hidden layers have 100 units (). The output layer has the dimension of the sensor reading vector (i.e. ). ReLU activation function was used throughout the hidden layers. For the output layer is the identity. The resulting architecture is the result of a grid reach.

Set-up of the UKF algorithm The UKF algorithm required the definition of the diagonal covariance matrices and . We assumed the covariance matrices to be diagonal matrices with normalized standard deviation and (i.e., and where

is the identity matrix of dimension

).
Comments

There are no comments yet.