1 Introduction
Aging of critical assets is an omnipresent phenomenon in any production environment, causing significant maintenance expenditures or leading to production losses. The understanding and anticipation of the underlying degradation processes is therefore of great importance for a reliable and economic plant operation, both in discrete manufacturing and in the process industry.
With a focus on the chemical industry, notorious aging phenomena include the deactivation of heterogeneous catalysts [forzatti1999catalyst] due to coking [barbier1986deactivation], sintering [HARRIS1995], or poisoning [nielsen1995poisoning]; plugging of process equipment, such as heat exchangers or pipes, on process side due to coke layer formation [cai2002coke] or polymerization [wu2018data]; fouling of heat exchangers on water side due to microbial or crystalline deposits [muller2000heat]; erosion of installed equipment, such as injection nozzles or pipes, in fluidized bed reactors [wang1996erosion, werther2000fluidized]; and more.
Despite the large variety of affected asset types in these examples, and the completely different physical or chemical degradation processes that underlie them, all of these phenomena share some essential characteristics:

The considered critical asset has one or more key performance indicators (KPIs), which quantify the progress of degradation. ^{1}^{1}1Such a KPI can either be measured directly or often only indirectly through proxy variables. For example, while catalyst activity is not measured directly in process data, it manifests itself in reduced yield and/or conversion of the process.

On a time scale much longer than the typical production time scales (i.e., batch time for discontinuous processes; typical time between set point changes for continuous processes), the KPIs drift more or less monotonically to ever higher or lower values, indicating the occurrence of an irreversible degradation phenomenon. (On shorter time scales, the KPIs may exhibit fluctuations that are not driven by the degradation process itself, but rather by varying process conditions or background variables such as, e.g., the ambient temperature.)

The KPIs return to their baseline after maintenance events, such as cleaning of a fouled heat exchanger, replacement of an inactive catalyst, etc.

The degradation is no ‘bolt from the blue’ – such as, e.g., the bursting of a flawed pipe –, but is rather driven by creeping, inevitable wear and tear of process equipment.
Any aging phenomenon with these general properties is addressed by the present work.
Property (4) suggests that the evolution of a degradation KPI is to a large extent determined by the complex process conditions, and not by uncontrolled, external factors. This sets the central goal of the present work: To forecast the evolution of the degradation KPI over a certain time horizon, given the planned process conditions in this time frame.
For virtually any important aging phenomenon in chemical engineering, the respective scientific community has developed a detailed understanding of their microscopic and macroscopic driving forces. This understanding has commonly been condensed into sophisticated mathematical models. Examples of such mechanistic degradation models deal with coking of steamcracker furnaces [GAO2009501, DeSchepper2010, BERRENI20112876], sintering [ruckenstein1973growth, li2017modeling] or coking [froment2001modeling] of heterogeneous catalysts, or crystallization fouling of heat exchangers [brahim2003numerical].
While these models give valuable insights into the dynamics of experimentally nonaccessible quantities, and can help to verify or falsify hypotheses about the degradation mechanism in general, they are usually not (or only with significant modeling effort) transferable to the specific environment in a realworld apparatus: Broadly speaking, they often describe ‘clean’ observations of the degradation process in a lab environment, and do not reflect the ‘dirty’ reality in production, where additional effects come into play that are hard or impossible to model mechanistically. To mention only one example, sintering dynamics of supported metal catalysts are hard to model quantitatively even in the ‘clean’ system of Wulffshaped particles on a flat surface [li2017modeling] – while in real heterogeneous catalysts, surface morphology and particle shape may deviate strongly from this assumption. Consequently, mechanistic models are rarely used in a production environment to forecast degradation dynamics of critical assets.
When dealing with the complexity of modeling realworld aging processes, statistical approaches have proven successful in a variety of applications. For example, datadriven methods for fault detection of chemical plants [russell2012data]
, such as multivariate anomaly detection with fisher discriminant
[chiang2000fault]or principal component analysis
[Kresta1991pca], are routinely applied nowadays to monitor process equipment. However, we emphasize that most of these applications focus on the detection or monitoring of degradation, not on the prediction of its progression.Examples of datadriven models that predict degradation dynamics deal with batchtobatch fouling of heat exchangers as a function of the polymer type produced in the respective batch [wu2018data], or fouling of the crude preheat train in petroleum refineries [radhakrishnan2007heat, aminian2008evaluation]. However, to the best of our knowledge, all published work in this field is either based on classical statistical regression methods, such as ordinary or partial least squares [wu2018data], or on smallscale machine learning (ML) methods, such as small feedforward neural networks (FFNN) trained with limited datasets [radhakrishnan2007heat, aminian2008evaluation]. So far, advanced ML algorithms, such as recurrent neural networks (RNN), trained with years or even decades of historical plant data, have not been studied in depth in the context of predicting degradation of chemical process equipment [LEE2018111]. It is the aim of this work to investigate the prospects of advanced ML methods for this problem, compare them to classical regression methods and undestand potential limitations.
The rest of this paper is structured as follows: First, we formalize the general IAP problem setting (Section 2). Then we describe our two datasets (Section 3), as well as quickly introduce the five ML models that we evaluated for this task (Section 4). Finally, we present the prediction results of the different models on both datasets (Section 5) and conclude the paper with a discussion (Section 6).
2 Problem Definition
The general industrial aging process (IAP) forecasting problem is illustrated in Fig. 1: The aim is to model the evolution of one or several degradation KPIs within an upcoming time window between two maintenance events, referred to as the th degradation cycle, as a function of the planned process conditions in this cycle:
(1) 
where denotes stochastic noise that disturbs the deterministic relation between and .^{2}^{2}2Put differently, contains the influence of uncontrolled factors, which are always present, but which should not dominate the degradation dynamics according to property (4) in the introduction.
Degradation phenomena can exhibit pronounced memory effects, which means that a certain input pattern may affect the output only at much later times . In addition, these memory effects can also occur across multiple time scales, which makes these processes notoriously hard to model. As an example, consider a heat exchanger suffering from coking of the inner tube walls. The observed heat transfer coefficient serves as KPI , and process conditions comprise mass flow, chemical composition and temperature of the processed fluid. The time horizon is one cycle between two cleaning procedures (e.g. burnoff). If at an early time in the cycle an unfavorable combination of low mass flow, high content of coke precursors, and high temperature occurs, first coke patches can form at the wall, which are not yet big enough to impact heat transfer significantly. However, they serve as a nuclei for further coke formation later in the cycle, so that drops faster at compared to a cycle where the process conditions were not unfavorable around , but with very similar process conditions throughout the rest of the cycle.
An additional complication arises from the fact that in real application cases, the distinction between degradation KPI , process conditions , and uncontrolled influencing factors is not always clearcut. Consider, for example, the case of a heterogeneous catalyst subject to deactivation, where the loss of catalytic activity leads to a decreased conversion rate. In this case, the conversion rate could serve as a target degradation KPI , while process conditions, such as the temperature, which are manually controlled by the plant operators, would be considered input variables for the model. However, the plant operators might try to keep the conversion rate at a certain set point, which can be achieved by raising the temperature to counteract the effects of the catalyst degradation. This introduces a feedback loop between the conversion rate and the temperature, which means the temperature can not be considered an independent variable anymore, as its actual value (partially) depends on the target. Therefore, care has to be taken, since including such a dependent variable as an input in a model could lead one to report overly optimistic prediction errors that would not hold up when the model is later used in reality.
3 Datasets
To gain insights into and evaluate different ML models for the IAP forecasting problem, we consider two datasets: one synthetic, which we generated ourselves using a mechanistic model, and one containing realworld data from a large plant at BASF. Both datasets are described in more detail below.
The reason for working with synthetic data is that this allows us control two important aspects of the problem: data quantity and data quality. Data quantity is measured, e.g., by the number of catalyst lifecycles in the dataset, which can be chosen as large as we want for synthetic data, to test even the most datahungry ML methods. Data quality refers to the level of noise in the dataset, or, in other words, the degree to which the degradation KPI is uniquely determined by the provided process conditions in the dataset. In a synthetic dataset based on a deterministic degradation model, we know that there is a functional mapping between and , i.e., there exists no fundamental reason that could prevent a ML model from learning this relation with vanishing prediction errors. In contrast, with real data, a bad prediction error can either be a problem of the method, and/or of the dataset, which might not contain sufficient information on the input side to accurately predict the output quantity .
3.1 Synthetic dataset
For the synthetic dataset, we modeled the widespread phenomenon of slow, but steady loss of catalytic activity in a continuously operated fixedbed reactor. Ultimately, the catalyst deactivation leads to unacceptable conversion or selectivity rates in the process, necessitating a catalyst regeneration or replacement, which marks the end of one cycle.
The chemical process in the reactor under consideration is the gasphase oxidation of an olefine. To generate the time series for all variables, we used a mechanistic process model with the following ingredients (further details can be found in Section 7.1 in the supplement):

Mass balance equations for all five relevant chemical species (olefinic reactant, oxygen, oxidized product, CO, water) in the reactor, which is, for simplicity, modeled as an isothermal plug flow reactor, assuming ideal gas law. The reaction network consists of the main reaction (olefine + O product) and one side reaction (combustion of olefine to CO).

A highly nonlinear deactivation law for the catalyst activity, which depends on reaction temperature, flow rate, and inflowing oxygen, as well as the activity itself.

Kinetic laws for the reaction rates.

A stochastic process determining the process conditions (temperature, flow rate, etc.).
Based on the current process conditions and hidden states of the system, the mechanistic model generates a multivariate time series for roughly 2000 degradation cycles. The final dataset includes for each time point as input the five process parameters (mass flow rate, reactor pressure, temperature, and mass fractions of the two reactants olefine and O) and two degradation KPIs (conversion and selectivity).
To give an impression of the simulated time series, one month of data is shown in Fig. 2. The duration of deactivation cycles is around 810 days. The catalyst activity is a hidden state and therefore not part of the dataset, but is only shown to illustrate the dynamics of the problem: System output (selectivity and conversion) is not only affected by the current process parameters , but also the current catalyst activity , which is nonlinearly decreasing over each cycle.
In addition to the process parameters, the cumulative feed of olefine in the current cycle is also added to the dataset as a potential input quantity. This variable is often taken as a rough predictor of the catalyst activity. Therefore, it is usually calculated and monitored in the plant. In the language of machine learning, this variable represents an engineered feature of the raw input time series. This way, some basic domain knowledge about catalyst deactivation is added to the dataset.
3.2 Realworld dataset
The second dataset contains process data for the production of an organic substance in a continuous worldscale production plant at BASF. The process is a gas phase oxidation in a multitubular fixedbed reactor.
The catalyst particles in the reactor suffer from coking, i.e., surface deposition of elementary carbon in form of graphite. This leads to reduced catalytic activity and increased fluid resistance. The latter is the more severe consequence and leads to an increasing pressure drop over the reactor, as measured by the difference of gas pressure before and after the reactor.
When exceeds a predefined threshold, the socalled endofrun (EOR) criterion is reached. Then, the coke layer is burned off in a dedicated regeneration procedure, by inserting air and additional nitrogen into the reactor at elevated temperatures for a variable number of hours. Operational reasons can lead to a delayed burnoff with exceeding the EOR threshold, or, vice versa, a premature burnoff when has not yet reached the EOR threshold. Some exemplary cycles for are shown in Fig. 3.
Since coke is not removed perfectly by this burnoff procedure, coke residues accumulate from regeneration to regeneration, making the pressure drop issue ever more severe. Therefore, the entire catalyst bed must be replaced every 624 months.
Suspected influencing factors for the coking rate are:

mass flow through the reactor (“feed load”)

ratio of organic reactant to oxygen in the feed

intensity of previous regeneration procedures

length of the previous degradation cycle
The dataset contains seven years of process data from the four most relevant sensors, extracted from the plant information management system (PIMS) of the plant, as listed in Table 2
in the supplement. Given the time scale of 4 to 7 days between two burnoff procedures, this corresponds to 375 degradation cycles belonging to three different catalyst batches. The sampling rate is 1/hour for all variables with a linear interpolation to that time grid.
The task is to predict, at an intermediate moment
during a degradation cycle, the cokinginduced pressure drop over the entire remaining duration of the cycle. Of particular interest is a prediction of the time point at which the EOR threshold is reached.As mentioned above, several relevant process parameters may serve as input variables of the model (Table 2 in the supplement). Furthermore, engineered features, built from either those process parameters or from the degradation KPI in the previous cycles, can be used as additional inputs (see Table 3 in the supplement).^{3}^{3}3The latter features are in particular relevant as they may encode information about the longterm effects in the system, such as coke residues accumulating on the time scale of months and years. Forecasting these longterm effects is not the subject of the IAP forecasting problem for this dataset; we rather focus on forecasting the currently running cycle. Therefore, by the time the forecast is generated, the previous cycle has already been observed and may be used for feature engineering.
4 Machine Learning Methods
We will now frame the IAP forecasting problem defined in Section 2 in a machine learning setting. To this end, the mapping defined in Eq. (1) is expressed as a concrete function that returns
, an estimate of the KPIs at a time point
in the th degradation cycle, based on the process conditions at this time point as well as possibly up to hours before :(2) 
The task is to predict for the complete cycle (i.e., up to ), typically starting from about 24 hours after the last maintenance event that concluded the previous cycle. ^{4}^{4}4In Eq. (2), the prediction function is defined as a function of the current and past input variables . Since usually the values of the degradation KPIs are known for at least the first 24 hours of each cycle, in principle the set of input variables of could be extended to also include for . However, while this might improve the predictions at the beginning of the cycle, since our aim is to predict the complete cycle starting after the first 24 hours, for the predictions for most time points, not the real values could be used as input, but instead their predicted values would have to be used. Since these predicted values typically contain at least a small error, the forecast for time points further in the future would be based on noisier and noisier input data, as the prediction errors in the input variables would quickly accumulate. Therefore, the only explicit inputs to the model are the predefined process conditions .
The exact form of the function thereby depends on the kind of machine learning method that is chosen for the forecasting task. Yet, while the chosen ML model determines the form of the function, its exact parameters need to be adapted to fit the dataset at hand in order to yield accurate predictions. For this, first the available data is split into socalled “training” and “test” sets, where each of the two sets contains the entire multivariate time series from several mutually exclusive degradation cycles from the original dataset, i.e., multiple inputoutput pairs consisting of the planned conditions and degradation KPIs of the given process. Then, using the data in the training set, the ML algorithm learns the optimal parameters of by minimizing the expected error between the predicted KPIs and the true KPIs [vapnik1995nature, muller2001introduction, hansen2013assessment]. After the ML model has been trained, i.e., when predicts as accurately as possible on the training set, the model should be evaluated on new data to give an indication of its performance when later used in reality. For this, the test set is used. If the performance on the training set is much better than on the test set, the model does not generalize well to new data and is said to have “overfit” on the training data.
In addition to the regular parameters of
, many ML models also require setting some hyperparameters, that, for example, determine the degree of regularization (i.e., how much influence possible outliers in the training set can have on the model parameters). To find adequate hyperparameters, crossvalidation
[stone1974cross] can be used: here, in multiple iterations the training set is split further into a validation and a training part and a model with a specific hyperparameter setting is trained on the training part and evaluated on the validation part. Those hyperparameter settings that produce the best results on the validation splits are then used when training a final model on the whole training set, which is then evaluated on the setaside test set as described above.The machine learning models for time series prediction used in this paper can be divided into two main subgroups: stateless and stateful models (Fig. 4).
A stateless model directly predicts the output given the current inputs, independent of the predictions for previous time points. Stateful models, on the other hand, maintain an internal hidden state of the system that encodes information about the past and which is utilized in addition to the current process conditions when making a prediction.
Stateless models include most typical machine learning regression models, ranging from linear regression models to many types of neural networks
[draper2014applied, bishop2006pattern]. The stateless regression models that we will explore in this paper are linear ridge regression (LRR), kernel ridge regression (KRR), and feedforward neural networks (FFNN), i.e., one linear and two nonlinear prediction models.The most commonly used stateful models for the modeling of sequential data are recurrent neural networks (RNNs) [mandic2001recurrent]. While RNNs are some of the most powerful neural networks, capable of approximating any function or algorithm [siegelmann1995computational], they are also more involved to train [doya1992bifurcations, pascanu2013difficulty]
. Consequently, in this paper we chose to model IAPs using two different RNN architectures that are designed precisely to deal with the problems arising while training regular RNNs: echo state networks (ESN) and long short term memory (LSTM) networks.
The five ML models are briefly introduced in the following paragraphs, while a more detailed description can be found in Section 7.3 in the supplement. For simplicity, in many cases we only write and , omitting the reference to the current cycle and time points in questions, while might include the process conditions for multiple time points from a fixed time window in the past (i.e. up to ).
4.1 Stateless models
Stateless models are machine learning models that base their forecast only on the inputs within a fixed time window in the past, i.e., exactly as stated in Eq. (2).
Linear ridge regression (LRR)
LRR is an ordinary linear regression model with an added regularization term that prevents the weights from taking on extreme values due to outliers in the training set. The target variables are predicted as a linear combination of the input variables , i.e.,
where is a weight matrix, i.e., the model parameters of that are learned from the training data. The simple model architecture, globally optimal solution, and regularization of LRR all contribute to reducing overfitting of the model. Additionally, training and evaluating the model is not computationally expensive, making it a viable model for large amounts of data as well. Despite their relative simplicity, linear models are widely used in many application scenarios and can often be used to approximate realworld processes at fairly high accuracies, especially if additional (nonlinear) handengineered features are available [horn2019autofeat]. Furthermore, considering the limited amount of training data that is usually available for realworld IAP problems, reliably estimating the parameters of more complex nonlinear prediction models such as deep neural networks needs to be done with great care [lecun2012efficient], while linear models provide a more robust solution as they provide a globally optimal solution and are less likely to overfit given their linear nature.
Kernel ridge regression (KRR)
KRR is a nonlinear regression model that can be derived from LRR using the so called ‘kernel trick’ [scholkopf1998nonlinear, muller2001introduction, vapnik1995nature, scholkopf2002learning, muller1997predicting]. Instead of using the regular input features , the features are mapped to a high (and possibly infinite) dimensional space using a feature map , corresponding to some kernel function such that . By computing the nonlinear similarity between a new data point and the training examples for , the targets can be predicted as
where are the learned model parameters.
The nonlinear KRR model can adapt to more complex data compared to LRR, and the fact that the globally optimal solution can be obtained analytically have made KRR one of the most commonly
used nonlinear regression algorithms. However, the performance of the model is also more sensitive to the choice of hyperparameters, so a careful selection
and optimization of the hyperparameters is necessary. Additionally, the fact that computing the kernel matrix scales quadratically with the number of training examples makes it difficult to apply KRR to problems with large training sets.
Feedforward neural networks (FFNN)
FFNNs were the first and most straightforward type of neural networks to be conceived, yet, due to their flexibility, they are still successfully applied to many different types of machine learning problems ranging from classification and regression tasks to data generation, unsupervised learning, and more
[bishop1995neural, goodfellow2016deep, schmidhuber2015deep]. Analogously to LRR, FFNNs learn a direct mapping between some input parameters and some output values. However, unlike a linear model, FFNNs can approximate also highly nonlinear dependencies between the inputs and the outputs. This is achieved by transforming the input using a succession of “layers”, where each layer is usually composed of a linear transformation followed by a nonlinear operation
:FFNNs can be difficult to train since the error function is highly nonconvex and the optimization procedure usually only finds a local minimum, in contrast to the globally optimal solution found by LRR and KRR. However, the losses in these local minima are often similar to the global optimum [choromanska2015loss], so this properties does not significantly impact the performance of a properly trained neural network. Additionally, due to a FFNN’s large number of parameters () and high flexibility, if not properly trained (see [lecun2012efficient]) it may overfit, especially when using smaller training sets.
4.2 Stateful models
In contrast to stateless models, stateful models only explicitly use the input , not the past inputs , to forecast the output for some time point . Instead, they maintain a hidden state of the system that is continuously updated with each new time step and thus contains information about the entire past of the time series. The output can then be predicted utilizing both the current input conditions, as well as the hidden state of the model: .
The two stateful models that we are considering for this paper both belong to the class of recurrent neural networks (RNNs). RNNs are a powerful method for modeling time series, however they can be difficult to train since their depth increases with the length of the time series. If training is not performed carefully, this can lead to bifurcations of the gradient during the error backpropagation training procedure, which can result in a very slow convergence (“vanishing gradients problem”), if the optimization converges at all
[doya1992bifurcations, pascanu2013difficulty].Echo state networks (ESN)
ESNs are an alternative RNN architecture that can alleviate some of the above mentioned training related problems of RNNs by not using error backpropagation for training at all [jaeger2004harnessing]. Instead, ESNs use very large randomly initialized weight matrices, which essentially act as a random feature expansion of the input (similar to the implicit feature map used in KRR), combined with a recurrent mapping of the past inputs; collectively called the “reservoir”. This way, ESNs can keep track of the hidden state (with ) of the system by updating at each time step to contain a weighted sum of the previous hidden state and a combination of the randomly expanded input features and randomly recurrently mapped . The final prediction of the output is then computed using LRR on the inputs and hidden state, i.e.,
In general, echo state networks are a very powerful type of RNN, whose performance on dynamical system forecasting is often on par with or even better than that of other, more popular and complex RNN models (LSTM, GRU, etc.) [jaeger2004harnessing, bianchi2017overview]. Since the only learned parameters are the weights of the linear model used for the final prediction, ESNs can also be trained on smaller datasets without risking too much overfitting.
LSTM networks
Another very popular architecture for dealing with the vanishing gradients problem in RNNs is the long short term memory (LSTM) architecture, which was developed specifically for this purpose [hochreiter1997long]. LSTMs are trained using error backpropagation as usual, but avoid the problem of vanishing gradients by using an additional state vector called the “cell state”, alongside the usual hidden state. This cell state is the core component of the LSTM and runs through the entire recurrent chain while being updated slowly at each time step using only linear updates, making it capable of preserving long term dependencies in the data and maintaining a stable gradient over long sequences. The inclusion of new or removal of old information to the cell state is carefully regulated by special neural network layers called gates. While the updates of the hidden state of an LSTM network are much more complex compared to ESNs, the final prediction is again only a linear transformation of the network’s internal hidden state:
However, in this case, the parameter values of are optimized together with the other parameters of the LSTM network, instead of using a separate LRR model.
Due to the multiple layers needed to model the gates that regulate the cell state, the LSTM typically requires larger amounts of training data to avoid overfitting. Though despite its complexity, the stability of the
gradients of the LSTM make it very well suited for time series problems with longterm dependencies. This is
why LSTMs have become one of the most widely used machine learning models for sequential data, such as
speechtotext recognition [graves2014towards], machine translation [sutskever2014sequence], video
classification [donahue2015long], text [karpathy2015unreasonable] and image
generation [oord2016pixel].
5 Results
In this section, we report our evaluation of the five different ML models introduced in Section 4 using the synthetic and realworld datasets described in Section 3. To measure the prediction errors of the ML models, we use the mean squared error (MSE), which, due to the subdivision of our datasets into cycles, we define slightly differently than usual: Let the dataset be composed of cycles, and let denote the KPIs at time point within the th cycle, where is the length of the th cycle. Then, given the corresponding model predictions , the MSE of a model for the entire dataset is calculated as
Since the synthetic and realworld datasets are very different, they were used to examine different aspects of the models. The synthetic dataset was used to examine how the models perform in a nearly ideal scenario, where data is freely available and the noise is very low or even nonexistent. On the other hand, the realworld dataset was used to test the robustness of the models, since it contains only a limited amount of training samples and a relatively high noise level.
5.1 Synthetic dataset
In order to systematically evaluate the performance of the different methods in a controlled environment, a synthetic dataset was generated as described in Section 3. A total of 50 years of historical data were generated, consisting of 2153 cycles for a total of 435917 time points. Roughly 10% of the cycles of the dataset were randomly selected as the outofsample test set, resulting in a training set consisting of 1938 cycles (391876 time points), and a test set consisting of 215 cycles (44041 time points). Only results for conversion as a degradation KPI are discussed; results for selectivity are similar.
The hyperparameters for the LRR, KRR, and ESN models were selected using a 10fold crossvalidation within the training set. The FFNN and LSTM models were trained using stochastic gradient descent, using Nesterov momentum for the parameter updates
[bengio2013advances]. The hyperparameters for the neural network models were determined based on the performance on a validation set consisting of a random selection of 15% of the cycles in the training set. The number of the training epochs was chosen using early stopping, with training being stopped if the validation set error had not improved in the last 6 epochs. More details about the hyperparameters and their optimized values can be found in Section
7.4 in the supplement.For the stateless models, i.e. LRR, KRR, and FFNN, the input vector at time point consisted of the process parameters for the past 24 hours, giving the models a time window into the past, i.e. . Further increasing this time window did not yield any noticeable improvements in performance for either model. Since the stateful models are capable of encoding the past into their hidden state, the input for the ESN and LSTM at any time point only consisted of the process parameters at the current time point, i.e. .
Fig. 5
shows the mean squared errors (MSE) for each of the models on the training and test sets across different training set sizes. For most of the models, the error converges relatively early, meaning that even with a fraction of the complete dataset, the models manage to learn an accurate approximation of the dynamics of the synthetic dataset, as far as the respective model complexity permits. This also indicates that the existing errors in the models are largely due to the limitations on the flexibility of the models themselves, and not due to the training set not being large enough. This is clearly evident with LRR, which essentially achieves its maximum performance using 5% of the total dataset size. Since LRR is a linear model, it can only learn the linear relations between the inputs and outputs. While this high bias prevents the model from learning most of the nonlinear dynamics regardless of the training set size, this also means that the model has low variance, i.e., it tends not to overfit on the training data
[friedman2001elements]. For the FFNN, the error slowly declines as the number of samples increases, though at an ever slower rate, with the error using the full training dataset being significantly lower that LRR. As for ESN and LSTM, both methods seem to somewhat overfit for the smaller training set sizes, judging by the differences between training and test errors, however, even then the test errors are much lower compared to the three stateless models. The errors of both models converge at around 50% of the full dataset, after which there is virtually no overfitting and no significant improvement of the performance for larger dataset sizes. The general lack of overfitting can be explained by the fact that the training and test set are generated using the exact same model, i.e., they are taken from the same distribution, which is the optimal setting for any machine learning problem. Additionally, the lack of noise in the synthetic dataset also helps explain the lack of overfitting, since overfitting usually involves the model fitting the noise instead of the actual signal/patterns. Across all dataset sizes, the LSTM model is clearly the best performing, with its error when using the full dataset being 5 times smaller than the error of the ESN model. The corresponding numeric error values are reported in Section 7.5 in the supplement.Given the great performance of the ESN and especially the LSTM model, these experiment clearly demonstrate that even with smaller amounts of highquality data, entire degradation cycles can in principle be predicted with very high accuracy.
Fig. 6 shows plots of the true and predicted conversion rates of the different models for some randomly selected cycles from the training and test sets. These show that all the models are capable of accurately predicting the instantaneous effects of the input parameters on the output, since this relation is largely linear and not time dependent. However, where the models differ the most is in the nonlinear long term degradation, where the stateless models only predict a roughly linear trend, with FFNN coming slightly closer to the actual degradation trend due to its nonlinearity, while the ESN model predicts the degradation better but fails to capture the rapid decline near the end of each cycle. The LSTM model, on the other hand, manages to capture the short and long term effects almost perfectly, with only small errors at the very ends of the cycles where there is smaller amounts of data, due to the varying length of the cycles.
5.2 Realworld dataset
The realworld dataset is much smaller than the synthetic, consisting of a total of 375 cycles. After removing some outlier cycles (shorter than 50 hours), the final size of the dataset is 327 cycles for a total of 36058 time points, i.e., it is more than 10 time smaller than the full synthetic dataset. As the realworld dataset stretches over 3 time periods with different catalyst charges in the reactor, we test the performance in a realistic manner by selecting the third catalyst charge as the test set, which makes it possible to see to what extent the models are able to extrapolate across the different conditions caused by the catalyst exchange. This resulted in a training set consisting of 256 cycles (28503 time points), while the test set consists of 71 cycles (7555 time points).
The hyperparameters for the realworld dataset were selected in an analogous manner to the synthetic dataset, only that due to the smaller size of the dataset, and thus shorter epochs, early stopping was triggered when the validation error had not improved in the last 30 epochs. Once again, more details about the optimized hyperparameters can be found in Section 7.4 in the supplement.
For this dataset, the input for both the stateful and stateless models at time point only consisted of the process conditions at that time point . Extending a time window for additional hours into the past only reduced the performance, since it reduces the size of the training set (if hours from the past are taken, the inputs for each cycle have to start hours later, leading to the loss of samples per cycle) and increases the number of input features, making overfitting more likely for all models.
Fig. 7 shows the mean squared errors for each of the models on the training and test sets. Due to the larger noise and the smaller amount of data, the results here are different compared to the ones for the synthetic dataset: The more complex models show more overfitting, since the test errors are significantly larger than the corresponding training errors, especially for KRR, which also has the largest test error of all models. On the other hand, LRR shows almost no overfit and its performance on the test set is much closer to that of the other models. Once again, ESNs and LSTMs outperform the stateless models, but this time, the margin is much slimmer and both models show a very similar performance. This is likely due to the larger potential for overfitting of the LSTM model here, given the larger level of noise and the smaller number of samples. Again, the specific numbers for the errors can be found in Section 7.5 in the supplement.
Fig. 8 again shows the plots of the true and predicted pressure differences of the different models for some randomly selected cycles from the training and test sets of the realworld dataset. In this case, the outputs are much noisier and none of the models captures the dynamics perfectly. Once again all of the models capture the instantaneous dynamics fairly accurately, though not as well as in the synthetic dataset, and all of the models struggle with the nonlinear degradation trend. The ESN and LSTM models capture the dynamics in the training set fairly accurately, but this seems to be a consequence of overfitting, since both models fail to accurately predict the degradation trend at the end of the cycles for the selected test cycles.
6 Discussion
Formulating accurate mathematical models of industrial aging processes (IAP) is essential for predicting when critical assets need to be replaced or restored. In worldscale chemical plants such predictions can be of great economic value, as they increase plant reliability and efficiency. While mechanistic models are useful for elucidating the influencing factors of degradation processes under laboratory conditions, it is notoriously difficult to adapt them to the specific circumstances of individual plants. Datadriven machine learning methods, on the other hand, are able to learn a model and make predictions based on the historical data from a specific plant and are therefore capable of adapting effortlessly to a multitude of conditions, provided enough data is available. While simpler, especially linear prediction models have previously been studied in the context of predictive maintenance [LEE2018111], a detailed examination of more recent and complex ML models, such as recurrent neural networks, was missing so far.
In this paper, we address the task of predicting a KPI, which indicates the slow degradation of critical equipment, over the time frame of an entire degradation cycle, based solely on the initial process conditions and how the process will be operated in this period. To this end, we have compared a total of five different prediction models: three stateless models, namely linear ridge regression (LRR), nonlinear kernel ridge regression (KRR) and feedforward neural networks (FFNN), and two recurrent neural network (RNN) based stateful models, echo state networks (ESN) and LSTMs. To assess the importance of the amount of available historical data on the models’ predictions, we have first tested them on a synthetic dataset, which contained essentially unlimited, noisefree data points. In a second step, we examined how well these results translate to realworld data from a largescale chemical plant at BASF.
While the stateless models (LRR, KRR, and FFNN) accurately captured instantaneous changes in the KPIs resulting from changing process conditions, they were mostly unable to pick up on the underlying trend caused by the slower degradation effects. ESN and LSTMs, on the other hand, were able to additionally correctly predict longterm changes, however at the expense of requiring a large amount of training data to do so. With more parameters to tune, the nonlinear models often overfit on specific patterns observed in the training data and therefore made comparatively more mistakes on new test samples. In general, all models, especially those based on RNNs, yielded very promising predictions, which are accurate enough to improve scheduling decisions for maintenance events in production plants. The choice of the optimal model in a particular case depends on the amount of available data. For very large datasets, we found that LSTMs can yield almost perfect forecasts over long horizons. However, if only a few cycles are available for training or the data is very noisy, it can be advantageous to resort to a simpler (linear) regression model. In these cases, more extensive feature engineering could help to improve the prediction [horn2019autofeat]. Overall, ESNs might be a reasonable compromise, as they automatically expand the feature space and keep track of the internal hidden state using the randomly parametrized “reservoir” and therefore only require training to fit the set of output weights.
While machine learning models are very good at interpolating, i.e., predicting values for data points from the same distribution as they were trained on, extrapolating beyond the patterns observed in the historical data is much harder and clearly a limitation of most machine learning models. Unfortunately, this is often what is required when dealing with realworld data. For example, on our realworld dataset, the machine learning models struggled to transfer their model of the training data to the test data, which came from a different catalyst charge and thus from a different distribution than the training data. These types of continuous changes and improvements to the setup of a production plant make it very difficult to compile a large enough dataset containing the relevant information for a model to predict future events. These time dependent changes in the data also require extra care as not to overfit on the past, e.g., when selecting hyperparameters for the models on the training set. Some of these effects might be tackled by explicitly addressing such nonstationarity between training and test set [sugiyama2007covariate, sugiyama2012machine]
. For future research it might furthermore be interesting to examine the effects of training on a larger collection of more diverse historical data and using transfer learning to be able to adapt the models to new datasets with smaller amounts of training data
[aswolinskiy2017unsupervised, fawaz2018transfer]. This way, more expressive models such as LSTMs could be trained to learn some general patterns on a larger dataset and then finetuned on the more recent time points (which should be closer to the current degradation dynamics) to yield more accurate predictions for the future.Especially when there is reason to believe that the predictions of a machine learning model may not always be perfectly on point, for example, when the model was trained on a very small dataset and predictions need to be made for inputs from a sightly different distribution, a predictive maintenance system could be further improved by providing confidence intervals, which quantify the uncertainty of the model for individual predictions
[efthymiou2012predictive]. This could, for example, be achieved by incorporating probabilistic approaches [blei2017variational, kolassa2016evaluating, kasiviswanathan2013quantification, feindt2006neurobayes]. Predicting intervals instead of point estimates may further facilitate planing by making it possible to assess the best and worst case scenarios.While accurate predictions of IAPs will improve the production process by allowing for longer planing horizons, ensuring an economic and reliable operation of the plant, the ultimate goal is of course to gain a better understanding of and subsequently minimize the degradation effects themselves. While mechanistic and linear models are fairly straightforward to interpret, neural network models have long been shunned for their nontransparent predictions. However, this is changing thanks to novel interpretation techniques such as layerwise relevance propagation (LRP) [bach2015pixel, montavon2018methods, montavon2017explaining, arras2017explaining, kindermans2018learning, lapuschkin2019unmasking], which make it possible to visualize the contributions of individual input dimensions to the final prediction. With such a method, the forecasts of RNNs such as LSTMs could be made more transparent, therefore shedding light on the influencing factors and production conditions contributing to the aging process under investigation, which could furthermore be used to help improve the underlying process engineering [zhao2019causal].
References
7 Supplement
7.1 Mechanistic model to generate the synthetic dataset
The mechanistic process model generating the dataset has the following ingredients:

Mass balance equations for all five relevant chemical species (olefinic reactant, oxygen, oxidized product, CO, water) in the reactor, which is for simplicity modeled as an isothermal plug flow reactor, assuming ideal gas law. The reaction network consists of the main reaction (olefine + O product) and one side reaction (combustion of olefine to CO). With this, the mass flow of each species at the reactor outlet is determined by the volumetric reaction rates , stoichiometric matrix , molar weights , reactor volume , and residence time as follows:
(3) The residence time is approximated as
(4) Here, is the total mass flow through the reactor, which is conserved, and are the molar concentrations at the reactor inlet. With known reactor temperature and pressure , mass flow rates can be converted into mass fractions and molar concentrations :
(5) (6) where denotes the universal gas constant.

A highly nonlinear deactivation law for catalyst activity , which depends on the reaction temperature (in Kelvin), flow rate , inflowing oxygen , activity itself, and kinetic parameters and :
(7) Catalyst activity is expressed in relative units (0% to 100%). As the deactivation can not be observed directly, is a hidden state variable of the system.

Kinetic laws for the reaction rates , with kinetic parameters and :
Note that only the main reaction is catalyzed and depends on .

The relation for selectivity and conversion of the process:
(8) (9)
Parameter values for , , are provided in Table 1.
Parameter  Value  Unit 
30000  
42  
15000  
45  
2.7E10  
50  
4.712E02 
The cumulative feed of olefine for each cycle is calculated as:
where is the beginning of the current deactivation cycle, i.e., the last catalyst regeneration event with .
With the mechanistic model, the time series of the process dynamics are generated in the following way:

Beginning of new deactivation cycle: Set catalyst activity to .

Pick discrete random values for the process parameters , , , , according to the procedure described below, and calculate the residence time , Eq. (4).

Set the duration of the next time window of constant process parameters [in hours] by a random integer number between 1 and 24.

For each hourly interval in this time window

Integrate the reaction kinetics over the residence time .

If : End of cycle criterion met. Catalyst is regenerated. Begin new cycle (step 1).

Reduce catalyst activity according to the deactivation dynamics, Eq. (7).


Go to step 2.
Generation of random process parameters
Each of the process parameters , , , and is generated independently through the following random process:

At initial time , pick uniformly from the range of potential values for that parameter. This range is for ; is in ; is in ; and in .

At the time of the th change of process parameters, pick
from a Gaussian distribution that is centered at the previous value
and has a standard deviation of
. 
After generating the random values in this way, round them them to six equidistant values in . This way, e.g., the potential values of mass flows are , with .
Finally, set . The remaining mass fractions at the reactor inlet are zero.
7.2 Variables of the realworld dataset (Tables 2 and 3)
Variable Name  Unit  Description  Type 

PD  mbar  pressure difference over reactor  y 
T  C  reaction temperature  x 
F_R  kg/h  inflow of organic reactant into reactor  x 
F_AIR  kg/h  mass inflow air into reactor  x 
Variable Name  Unit  Description 

operation_mode    logical variable indicating state of operation (1: reaction; 2: regeneration; 0: other mode of operation, e.g., shutdown) 
cat_no    counter to index different catalyst batches; incremented whenever catalyst is replaced 
cycle_no    counter to index different cycles; incremented when new reaction phase begins 
t_react  h  duration of current cycle, i.e., hours of operation in reaction phase after last regeneration procedure 
last_PD  mbar  pressure loss (PD) at the end of the previous cycle 
F_AIR / F_R    ratio air to organic reactant in feed 
F_AIR + F_R  kg/h  total feed rate (organic reactant + air) 
7.3 ML models in detail
7.3.1 Linear ridge regression (LRR)
The LRR model assumes that the process parameters and KPIs at a time are linearly related. For this purpose, a weight matrix is used to model how each of the process parameters affects each KPI, which can be used to predict the KPIs in up to some noise that can not be explained by the model:
In order to reduce the influence the outliers on the model and to prevent overfitting, we use a linear ridge regression model, which imposes L2regularization on the weight matrix of the standard linear regression model. The amount of regularization is controlled using the regularization parameter . Let denote the matrix containing the inputs of all cycles concatenated along the time dimension, while denotes the corresponding output matrix constructed analogously to . The optimal weight matrix can then be found by solving the following optimization problem:
(10) 
The advantage of ridge regression is that the optimization problem can be solved analytically to obtain the globally optimal weight matrix as
7.3.2 Kernel ridge regression
The linear ridge regression optimization problem in Eq. (10) is rewritten by applying the feature map on the inputs to get
leading to the analogous analytic solution
Since the feature space is not known explicitly, can not be calculated directly. However, using some algebraic transformations and the kernel trick [scholkopf1998nonlinear, muller2001introduction, scholkopf2002learning], the following solution can be derived:
The kernel function used in this paper is the radial basis function (RBF) kernel, also known as the Gaussian kernel:
The kernel width parameter , along with the regularization parameter results in a total of two hyperparameters to optimize for the KRR model. We do this using crossvalidation, while making sure that trials belonging to the same cycle do not end up in the both the training and the test set.
7.3.3 Feedforward neural networks (FFNN)
Let denote the feature vector resulting after the transformation at the th layer, and let be the linear transformation matrix, the bias and the nonlinear function applied at layer . Then a generic layer feedforward neural network can be defined by the following sequence of operations:
An illustration of this typical FFNN architecture is also shown in Fig. 9.
The 1st layer is usually called the input layer and the th layer the output layer, while all the layers inbetween are known as hidden layers. By changing the values of the weight matrices of the different layers, the network can be adjusted to approximate the target function. In fact, multilayer FFNN are proven to be universal function approximators, meaning that the weights can be adjusted to fit any continuous function (see e.g. [hornik1989multilayer]
). The most common method for training neural networks to fit a given function is using error backpropagation to iteratively update the values of the parameters of the network, usually consisting of the weight matrices and the bias vectors. Error backpropagation is a way to efficiently perform gradient descent on the FFNN parameters by minimizing the error/loss of the network’s outputs with respect to some target labels. One of the most commonly used loss functions for regression problems, and also the one we use in this paper, is the squared error (SE), which for a given FFNN, denoted by
, where denotes the network’s parameters, an input time series , and corresponding target labels is given by:By taking the gradient of the loss function with respect to the FFNN’s parameters, using backpropagation, one can efficiently perform gradient descent to update the parameters and minimize the loss of the network [bishop1995neural].
7.3.4 Echo state networks (ESN)
The basic architecture of an ESN is displayed in Fig. 10. The ESN is composed of two randomly initialized weight matrices, the input weight matrix and the (usually sparse) recurrent weight matrix , where is the dimensionality of the hidden state vector and is usually much larger than the dimensionality of the input features . We will also refer to these to matrices collectively as the reservoir. Let be the hidden state vector at time and its update, while is called the leaking rate determining tradeoff between adding new and keeping old information in the hidden state. Additionally, let denote vertical vector/matrix concatenation. Now given an input vector , the update for the hidden state is given by
(11)  
Note that is applied elementwise here. It may be counterintuitive that the randomly generated matrices and can produce useful features without being trained, but because the transformation in Eq. (11) can be interpreted as a random feature map that produces a nonlinear, high dimensional feature expansion with memory of the previous inputs. In this sense, one can draw a parallel to kernel methods, which also map the input features to a nonlinear and high dimensional features space, where the features are more easily linearly separable [lukovsevivcius2012practical].
Finally, the predictions are made based on the timedependent features generated by Eq. (11) using a final output matrix as
where is trained to minimize the MSE of , in our case using LRR as described in Section 7.3.1. Using LRR provides a globally optimal analytical solution for , making training the ESN simple and avoiding the problems that arise when training RNNs with error backpropagation. Due to the high dimensionality and nonlinearity of the hidden state features, a simple model like LRR can still fit complex and nonlinear dependencies between the input and the output features. Additionally, since LRR is very robust against overfitting, ESNs are also very well suited for problems with smaller amounts of data.
One downside to ESN is that their performance is strongly dependent on the choice of hyperparameters, which mostly regulate the initialization of the reservoir. Since the reservoir weights are not trained, this initialization is crucial in order to ensure that the ESN has desirable properties. The main hyperparameters of the ESN include the dimensionality of the hidden state , the sparsity of the matrix , the distribution from which the nonzero elements of the reservoir are sampled, the spectral radii of each of the reservoir matrices, the scaling of the input data and finally the leakage rate . These hyperparameters, as well as general recommendations for their choice are summarized in Table 4 (for more details, see [lukovsevivcius2012practical]).
Parameter  Recommended choice 

Hidden state size  Choose as large as you can afford. 
Connectivity of  A small fixed number of nonzero elements (e.g. 10) per row on average, irrespective of the hidden state size. 
Distribution of nonzero values  Symmetric distribution centered around 0. 
Spectral radius of  Smaller than or close to 1. 
Spectral radius of  Larger for highly nonlinear and small for linear tasks 
Leakage rate  Adjust according to the dynamics of the time signal; if signal is changing slowly set close to zero, for fast dynamics set close to 1 
One of the most important hyperparameters is the spectral radius of the matrix, because keeping this close to 1 helps maintain the echo state property of the network, which is essential for the ESN to be wellbehaved. In a nutshell, the echo state property means that the hidden state should be uniquely defined by the fading history of the input , i.e., for a long enough input , the hidden state should not depend on the initial conditions that were before the input, e.g. the initialization of the reservoir.
7.3.5 LSTM networks
Fig. 11 shows the full architecture of one recurrent LSTM cell. The first gate of the LSTM is the forget gate, which is trained to regulate which information is to be removed or ‘forgotten’ from the cell state. The forget gate produces a vector
that is calculated by applying the sigmoid function, denoted by
, elementwise to a linear transformation of the current input and the previous hidden state , resulting in the equationThe nonlinearity ensures that the values of are between 0 and 1, with a value of 0 of meaning forget everything at position in the cell state, and a value of 1 meaning preserve the information at this position completely. Analogously to the forget gate, the input gate creates a vector , which regulates which values of the cell state will be updated. Additionally, another layer is used to create a vector of candidate update values that could be added to the cell state based on . This results in the following two equations:
where the sigmoid nonlinearity is once again used to obtain values between 0 and 1 for , while the candidate update is generated using a nonlinearity to ensure that the values in the cell state can be updated both by increasing and decreasing them. In the next step, the cell state is updated based on the vectors produced by the forget and input gates using elementwise multiplication, denoted by . This update is given by
Then, the next shortterm hidden state needs to be generated, which is based on the now updated cell state. A sigmoid nonlinearity is once again used to regulate to which degree the values of the cell state will be included in the hidden state, and a nonlinearity is used to compress the cell state in the interval (1,1). The new hidden state is thus calculated as
Finally, the output vector is calculated from the hidden state. This part is not always required, as many time series problems require only one output for a whole sequence. In our case however, we need an output for each time point, which is then calculated using a simple linear layer as
7.4 Model hyperparameters
All of the models used in this paper have certain hyperparameters that need to be optimized and the correct choice of these hyperparameters can in some cases strongly influence the performance of the model. Of the models used in this paper, the performance of the KRR and ESN models was strongly influenced by the choice of hyperparameters, while the performance of the other models remained stable over a wider range of hyperparameter values. For the ESN model, which had the largest number of hyperparameters, we used random search to select sets of hyperparameters for testing, while for the other models the hyperparameters were optimized using a grid search. In the following we report the hyperparameters that resulted in the lowest validation errors for each of our models.
LRR only has one hyperparameter, the regularization parameter , which was selected as for the synthetic dataset and for the realworld dataset. KRR has two hyperparameters, the regularization parameter and the kernel width . The resulting optimal hyperparameters from our crossvalidation procedure were and for the synthetic dataset and and for the realworld dataset. Since the other models have multiple hyperparameters that need to be optimized, the optimal values for each dataset are presented in Table 5 for the FFNN, Table 6 for the ESN model, and finally Table 7 for the LSTM model hyperparameters.
Parameter  Synthetic dataset  Realworld dataset 

Batch size  128  16 
Learning rate  0.000005  0.00001 
Momentum  0.95  0.95 
Dropout rate  0  0.3 
Number of layers  2  2 
Layers size  512  512 
Parameter  Synthetic dataset  Realworld dataset 

Hidden state size  343  58 
Connectivity of  32  90 
Distribution nonzero values  Standard normal  Standard normal 
Spectral radius of  1.18  0.77 
Spectral radius of  0.004  0.0016 
Bias scale for  0.68  0.08 
Leakage rate  0.05  0.012 
Parameter  Synthetic dataset  Realworld dataset 

Batch size  64  16 
Learning rate  0.001  0.005 
Momentum  0.95  0.95 
Hidden layer size  512  256 
7.5 Results tables 8 and 9
Model  Synthetic data  Realworld data  

train  test  train  test  
LRR  
KRR  
FFNN  
ESN  
LSTM 
Model  Synthetic data  Realworld data  

train  test  train  test  
LRR  
KRR  
FFNN  
ESN  
LSTM 
Comments
There are no comments yet.