# Continual learning-based probabilistic slow feature analysis for multimode dynamic process monitoring

In this paper, a novel multimode dynamic process monitoring approach is proposed by extending elastic weight consolidation (EWC) to probabilistic slow feature analysis (PSFA) in order to extract multimode slow features for online monitoring. EWC was originally introduced in the setting of machine learning of sequential multi-tasks with the aim of avoiding catastrophic forgetting issue, which equally poses as a major challenge in multimode dynamic process monitoring. When a new mode arrives, a set of data should be collected so that this mode can be identified by PSFA and prior knowledge. Then, a regularization term is introduced to prevent new data from significantly interfering with the learned knowledge, where the parameter importance measures are estimated. The proposed method is denoted as PSFA-EWC, which is updated continually and capable of achieving excellent performance for successive modes. Different from traditional multimode monitoring algorithms, PSFA-EWC furnishes backward and forward transfer ability. The significant features of previous modes are retained while consolidating new information, which may contribute to learning new relevant modes. Compared with several known methods, the effectiveness of the proposed method is demonstrated via a continuous stirred tank heater and a practical coal pulverizing system.

## Authors

• 12 publications
• 8 publications
• 7 publications
• 7 publications
08/07/2021

### Self-learning sparse PCA for multimode process monitoring

This paper proposes a novel sparse principal component analysis algorith...
01/21/2021

### Monitoring nonstationary processes based on recursive cointegration analysis and elastic weight consolidation

This paper considers the problem of nonstationary process monitoring und...
12/13/2020

### Monitoring multimode processes: a modified PCA algorithm with continual learning ability

For multimode processes, one has to establish local monitoring models co...
09/29/2021

### Dynamic probabilistic predictable feature analysis for high dimensional temporal monitoring

Dynamic statistical process monitoring methods have been widely studied ...
03/02/2022

### Continual Learning of Multi-modal Dynamics with External Memory

We study the problem of fitting a model to a dynamical environment when ...
06/02/2019

### Capabilities and Limitations of Time-lagged Autoencoders for Slow Mode Discovery in Dynamical Systems

Time-lagged autoencoders (TAEs) have been proposed as a deep learning re...
10/19/2021

### Hybrid variable monitoring: An unsupervised process monitoring framework

Traditional process monitoring methods, such as PCA, PLS, ICA, MD et al....
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Data-driven process monitoring is vitally important for ensuring safety and reliability of modern industrial processes [34, 31, 23]. Dynamic process monitoring methods have been extensively studied in recent years [30, 16, 10]. The slow feature analysis (SFA), which is effective in extracting invariant slow features from fast changing sensing data [28], has been widely extended to process monitoring applications. It is shown that SFA can be applied to establish a comprehensive operating status, where the nominal operating deviations and the real faults may be distinguished in the closed-loop systems [21, 22, 32, 20, 6]. Recursive SFA (RSFA) [22] and recursive exponential SFA [32] were developed and the associated parameters were updated for adaptive monitoring. Sufficient samples had been required to establish the initial model when a new mode was identified. Probabilistic SFA (PSFA) was proposed as a probabilistic framework for dynamic processes [20, 33] with the advantage of effectively handling process noise and uncertainties, in which measurement noise was modeled and missing data could be settled conveniently [6].

Most industrial systems operate in multiple operating conditions due to equipment maintenance, market demands, changing of raw materials, etc. Multimode dynamic process monitoring approaches have been widely investigated, which currently can be sorted into two broad categories [18], namely, single-model schemes and multiple-model methods. Single-model methods transform the multimode data to unimodal distribution [12, 29] or establish the adaptive models. Local neighborhood standardization can normalize data into the same distribution and popular methods for one mode could be applied [12]

. However, the effectiveness may be influenced by the matching degree of training and testing samples. Recursive principal component analysis was adopted for adaptive process monitoring

[4]. Although prior knowledge is not required, these algorithms are effective for slow changing features and may fail to track the dramatic variations on the entire dataset.

The mainstream approaches of multimode monitoring are based on multiple-model schemes, where the modes are identified and the local models are established within each mode. Mixture of canonical variate analysis (MCVA) was explored and the mode was identified by Gaussian mixture models

[27]. Improved mixture of probabilistic principal component analysis (IMPPCA) could be utilized for multimode processes [34], where the monitoring model parameters and the mode identification parameters were jointly optimized. Generally, the number of modes is a priori and data from all possible modes are required before learning, which is infeasible and time-consuming [18]. When novel modes appear, sufficient data should be collected and new local models are relearned correspondingly. The monitoring model is only effective for the learned modes, but may be difficult to deliver excellent performance for similar modes. Multiple-model schemes seem to be redundant and difficult to identify modes accurately [8]. Furthermore, the model’s capacity and storage costs increase significantly with the emergence of modes.

Similarly, in the context of multimode process monitoring, new modes would often appear continuously and different modes may share the similar significant features [8]. In practical applications, it is often intractable to collect data from all modes. Zhang et al. applied continual learning into multimode process monitoring [35], where EWC was employed to settle the catastrophic forgetting of principal component analysis (PCA), referred to as PCA-EWC. However, data are static in each mode and the mode has to be identified by statistical characteristics of data, which makes it ineffective for multimode dynamic processes, as well as difficult to distinguish the operating deviations and dynamic anomalies.

Against this background, this paper considers continual learning by extending EWC to PSFA, which is regarded as underlying multimode dynamic processes for the observed sequential data. The proposed method is referred to as PSFA-EWC. Data from each mode arrive sequentially and unknown modes are allowed. Moreover, the proposed algorithm would be best to distinguish the real faults and the normal operating derivations. When a new mode is identified by PSFA and prior knowledge, it is assumed that a set of data are collected before learning. A quadratic penalty term is introduced to avoid the dramatic changes of mode-relevant parameters when a new mode is trained [2]. Similar to [35], EWC is adopted to estimate the PSFA model parameter importance. The information from novel modes is assimilated while consolidating the learned knowledge simultaneously, thus delivering continual learning ability for successive modes.

The contributions are summarized as follows:

1. Continual learning based PSFA is firstly investigated for multimode dynamic processes, where the mode is identified by the monitoring statistic and expert experience.

2. The previously learned knowledge is retained while consolidating new information, which may aid the learning of future relevant modes. Thus, PSFA-EWC furnishes the forward and backward transfer ability.

3. Within the probabilistic framework, PSFA-EWC could provide excellent interpretability, and can deal with missing data, measurement noise and uncertainty.

The rest of this paper is organized below. Section II reviews PSFA and EWC succinctly and outlines the basic idea of our proposed approach. The technical core of PSFA-EWC is detailed in Section III. The monitoring procedure and comparative experiments are designed in Section IV. The effectiveness of PSFA-EWC are illustrated by a continuous stirred tank heater (CSTH) and a practical coal pulverizing system in Section V. The conclusion is given in Section VI.

## Ii Preliminaries

For ease of exposition, we start with introducing the PSFA for a single mode, since it serves as basic ingredient of our proposed multimode PSFA. Then the basic idea of EWC as well as how to extend EWC to multimode PSFA is outlined.

### Ii-a PSFA for a single mode

In the probabilistic framework of SFA, the objective is to identify the slowest varying latent features from a sequence of time-varying observations , , which can be represented/generated via a state-space model [33]

with a first-order Markov chain architecture

[25, 20].

 xt= Vyt+et,et∼N(0,Σx) (1) yt= Λyt−1+wt,wt∼N(0,Σ) y1= u,u∼N(0,Σ1)

where the low dimensional latent variable , . , with the constraint to ensure the covariance matrix be the unit matrix . The emission matrix is

and measurement noise variance is

.

For a single mode, the observed data and latent slow features sequences are denoted as and , respectively. is the number of samples and the estimation of has been discussed in [6].

The joint distribution is given as

[33]

 P(Xs|Ys)=P(y1)T∏t=2P(yt|yt−1)T∏t=1P(xt|yt) (2)

Let , . The objective of PSFA is to estimate paramaters by maximizing the complete log likelihood function:

 logP(Xs,Ys|θ)= T∑t=1logP(xt|yt,θx)+logP(y1|Σ1) +T∑t=2logP(yt|yt−1,Λ) (3)

According to (1), (II-A) is reformulated as

 logP(Xs,Ys|θ) (4) = −12{(m+p)Tlog2π+(T−1)log|Σ|+yT1Σ−11y1 +Tlog|Σx|+T∑t=1(xt−Vyt)TΣ−1x(xt−Vyt) +log|Σ1|+T∑i=2(yi−Λyi−1)TΣ−1(yi−Λyi−1)}

The optimal parameter is optimized by maximizing (4

) using expectation maximization (EM) algorithm

[3].

### Ii-B Elastic weight consolidation for multimode PSFA

In the following we explain the basic idea of extending EWC for multimode PSFA processes and then summarize the key objectives of proposed PSFA-EWC. Consider also based on PSFA model (1), in the multimode scenario where data stream are generated as incoming new modes , one at time. For each mode , normal data are collected, where is the number of samples. Correspondingly it is assumed that need to be extracted from the th mode. Denote the total observed data and its latent slow features as , .

EWC initially considers the use of Bayesian rule for the sequential learning process in which the most probable parameters should be found by maximizing the conditional probability

[11]

 logP(θ|X,Y)=logP(X,Y|θ)+logP(θ)−logP(X,Y) (5)

where

is prior probability and

is the data probability. For illustration only the first two successive independent modes and are initially considered. Then, (5) can be reformulated as [11]:

 logP(θ|X,Y)= +don′t care terms (6)

where

is the posterior probability of the parameter given the entire dataset.

represents the loss function for mode

. Posterior distribution can reflect all information of mode [11, 35]. This equation reflects the key idea of EWC in continual learning framework of updating system parameters based on a composite cost function that is dependent on current parameters learned from previous data and new incoming data by using posterior distribution which acts as a constraint in future objective, so that the learned knowledge will not be forgotten.

Note that this is the first time that EWC is extended to PSFA for monitoring in which new optimization procedures of proposed PSFA-EWC algorithm will be introduced in Section III, as illustrated in Fig. 1 for three modes. The multimode slow features for each mode are extracted, while the model parameter is continually updated using only data of a new mode, while maintaining performance of all old modes. Black, blue and red circles represent the optimal parameter regions that the log likelihoods for modes , and are maximized, respectively. This process can be generalized to modes, with

 logP(θ|X,Y)= +don′t care terms (7)

where .

Note that the first term in (II-B) is complete likelihood for th mode. The second term in (II-B) is parameter estimate that reflects information from all previous modes, thus can be interpreted as log prior probability of parameter for th mode. Since it is assumed that data from all previous modes will not be accessible to obtain exactly, it is found by recursive approximation as detailed in Section III-A.

## Iii The proposed PSFA-EWC algorithm

### Iii-a Recursive Laplacian approximation of P(θ|MK−1i=1)

Consider the multimode PSFA process where data are collected sequentially with mode index . For the sake of notational simplicity, it is assumed in the sequel that the data and corresponding slow features start from at the beginning and end at of the th mode. The proposed PSFA-EWC algorithm starts with solving an initial single mode model as . It is initially assumed that an optimal parameter, denoted as , has been obtained from the first mode based on solving (4). For later modes (), the monitoring model is updated recursively based on the data from th mode and the current monitoring model before , where EM is employed [3] to solve the optimization problem of maximizing in Section III-B.

Specifically consider initially the case of two modes , then the term in (II-B) is approximated by the Laplace approximation [35, 11] as

 logP(θ|M1)≈ −12(θ−θ∗M1)T(T1F(θ∗M1)+λpriorI) ⋅(θ−θ∗M1)+constant

where is Fisher information matrix (FIM) and computed by (26) in Appendix -A. is the Guassian prior precision matrix for mode

. However, the sample size would have non-negligible influence on the approximation. To ensure the approximation quality, a mode-specific hyperparameter

is introduced to replace [9], namely,

 logP(θ|M1)=−12(θ−θ∗M1)TΩM1(θ−θ∗M1)+constant

where .

Then the th mode arrives (), we approximate by recursive Laplace approximation [35] as

 logP(θ|MK−1i=1)≈ −12(θ−θ∗MK−1)TΩMK−1(θ−θ∗MK−1) +constant

where

 ΩMK−1=ΩMK−2+ηK−1FMK−1,   K≥3 (8)

Note that is FIM of mode , and is the hyperparameter. Thus is approximated by a quadratic term centered at current optimum, with weighting acting as an importance measure regulating data from all previous modes.

Specifically when th mode has been learned (see PSFA-EWC algorithm in Section III-B), the importance measures specific to PSFA are updated and ready as th mode.

 (9)
 ΩΛMK=ΩΛMK−1+ηΛKFΛMK (10)

where and are FIMs and calculated in Appendix -A. and are mode-specific hyperparamaters. Similar to [11], and are optimized by hyperparameter search and fined-tune by prior knowledge, which may play an important role in accurate estimate of probability with sequential modes. Actually, it’s a parameter redistribution from equal importance since previous modes would be counted more than recent ones in recursion. Large values of and indicate that the th mode is significant and the weight of previous modes decreases. Small values mean that we expect to acquire better performance of previous modes by sacrificing the performance of current mode. In this case, it has been concluded by prior knowledge that the current mode may be unimportant and needs to be forgotten gracefully.

### Iii-B PSFA-EWC algorithm

Consider the objective of PSFA-EWC of maximizing

 J(θ)=logP(XK,YK|θ)+logP(θ|MK−1i=1) (11)

subject to PFSA model (1). Recall (II-A), the log-likelihood function for the current mode is represented by

 logP(XK,YK|θ)= TK∑t=1logP(xt|yt,θx)+logP(y1|Σ1) +TK∑t=2logP(yt|yt−1,Λ) (12)

The regularization term is designed as

 logP(θ|MK−1i=1)≈ −γ1,K∥∥V−VMK−1∥∥2ΩVMK−1 (13) −γ2,Kp∑i=1ΩλMK−1,i(λi−λMK−1,i)2

where and measure the importance of and , . and are the th elements of diagonal matrices and , which are the optimal parameters of last mode . and are user-defined hyperparameters. The setting and makes it flexible to adjust the weights of previous modes. Then we illustrate the difference and to and . Combined with the importance of current mode , the setting and is beneficial to assign the importance of all previous modes again. and focus on the importance of the mode , which allow users to obtain models with more focus on particular mode. Through the reasonable setting of four hyperparameters, the human-level performance may be obtained.

For the proposed PSFA-EWC, the total objective function of modes can be formally described by

 J(θ)= TK∑t=1logP(xt|yt,θx)+TK∑t=2logP(yt|yt−1,Λ) (14) +logP(y1|Σ1)−γ1,K∥∥V−VMK−1∥∥2ΩVMK−1 −γ2,Kp∑i=1ΩλMK−1,i(λi−λMK−1,i)2

subject to the PSFA model (1). Note that for , since the quadratic penalty is added, it slows down the changes to parameters with respect to the previously optimum values that are obtained in learned modes [15, 26]. In other words, the parameters that result in significant deterioration in performance of previous modes will be penalized, avoiding catastrophic forgetting problem.

The EM [3] is employed to optimize the parameter . Note that when , , . There is no need to provide and , this means that the proposed PSFA-EWC algorithm has a unified formulation as a sequential single mode based on th mode data only, with current parameters used as quadratic penalty, which are updated via recursive Laplacian approximation between each mode in Section III-A.

#### Iii-B1 E-step

Assume that is available, the E-step estimates three sufficient statistics, namely, , and . Similar to [20, 33]

, Kalman filter and Tanch-Tung-Striebel (RTS) smoother

[19] are adopted, which contains the forward and backward recursion steps.

First, the forward recursions are adopted to estimate the posterior distribution sequentially. The posterior marginal is calculated by

where is the variance.

Then, parameters of the posterior distribution are acquired by backward recursion steps. The procedure is summarized in Algorithm 1.

#### Iii-B2 M-step

In the M-step, it is assumed that three sufficient statistics are fixed, the parameters are updated alternately.

Since and are contained in and the regularization term , then

 (15)

where

Let the derivative with respect to be zero, then

 TK∑t=1xtE[yTt|XK]+γ1,KΣxΩVMK−1VMK−1 (16) =VTK∑t=1E[ytyTt|XK]+γ1,KΣxΩVMK−1V

This problem is actually the Sylvester equation and the solution is denoted as .

For , let the derivative be 0, then

 (σ2i)new= 1TKTK∑t=1{E[x2i,t]−2(vT⋅i)newE[yt|XK]xi,t +(vT⋅i)newE[ytyTt|XK](v⋅i)new} (17)

where is the th row of matrix , , and .

With regard to , it is only contained in , thus

 Σnew1= argmaxΣ1missingE[P(y1|Σ1)] (18) = E[y1yT1|XK]

For , . is contained in and , thus

 Λnew=argmaxΛJ(Λ) (19)

where

Let the derivative with respect be zero, we drive the following equation

 ai5λ5i+ai4λ4i+ai3λ3i+ai2λ2i+ai1λi+ai0=0 (20)

where the coefficients of (20) are derived as

Thus, the updated could be calculated numerically as the root of (20) within the range , and .

The learning procedure of PSFA-EWC is summarized in Algorithm 2. The transformation and emission matrices are denoted as and , respectively. Since noise information about and is only effective for the current mode, the subscript is neglected. After the mode has been learned, the parameter importance measures should be updated by (9-10).

## Iv Monitoring procedure and experiment design

Analogous to traditional PSFA [6], three monitoring statistics are designed to provide a comprehensive operating status. Then, several representative methods are adopted as comparisons to illustrate the superiorities of PSFA-EWC algorithm.

### Iv-a Monitoring procedure

In this paper, the Hotelling’s and SPE statistics are used to reflect the steady variations, and is calculated to evaluate the temporal dynamics[6].

According to Kalman filter equation,

 yt=ΛMKyt−1+K[xt−VMKΛMKyt−1] (21)

When , converges to a steady matrix . is stable after the training phase. Then, statistic is defined as

 T2=yTtyt (22)

To design the SPE statistic, we calculate the bias between the true value and one-step prediction at instant. At

instant, the inferred slow features follow Gaussian distribution, namely,

 P(yt−1|x1,⋯,xt−1)∼N(μt−1,Pt−1)

Then, the conditional distribution of is described as

Similarly,

where . The prediction error follows Gaussian distribution, namely

 εt=xt−VMKΛMKμt−1∼N(0,Φt) (23)

When , converges to . SPE is calculated by

 SPE=εTtΦ−1εt (24)

statistic is designed to reflect the temporal dynamics, which is beneficial to distinguish the operating variations and dynamics anomalies [21, 6].

 S2=˙yTtΞ−1˙yt (25)

where , is the covariance matrix and analytically calculated as [6].

The thresholds of three statistics are calculated by kernel density estimation (KDE)

[34], and denoted as , and . The monitoring rule is summarized below:

1. Three statistics are within their thresholds, the process is normal;

2. If or SPE is over its threshold, while is below its threshold, the dynamic law remains unchanged and the static variations occur. This may be caused by step faults or drifts may occur [21, 32]. In this case, we need to further confirm whether the system is actually abnormal based on the data trend and expert experience. When a new mode occurs, a set of new data are collected to update the PSFA-EWC model in Algorithm 2. The process is monitored by statistic in this period;

3. If is over threshold, the dynamic behaviors are unusual and the system is out of control. A fault occurs and the alarm would be triggered.

The off-line training procedure and online monitoring phase have been summarized in Algorithm 2 and Algorithm 3, respectively. Fault detection rates (FDRs) and false alarm rates (FARs) are adopted to evaluate the performance.

### Iv-B Comparative design

In this paper, RSFA [22], PCA-EWC [35], IMPPCA [34] and MCVA [27] are selected as the comparative methods in Table I. PSFA-EWC, RSFA and PCA-EWC can be regarded as adaptive methods, which avoid storing data and alleviating storage requirement. IMPPCA and MCVA belong to multiple-model approaches, where the mode is identified and local models are built within each mode.

For Situations 1-11, PSFA and PSFA-EWC are compared to illustrate the catastrophic forgetting issue of PSFA and the continual learning ability of PSFA-EWC for successive dynamic modes. When a new mode is identified by statistic and expert experience, a set of normal data are collected and then the model is updated off-line by consolidating new information while retaining the learned knowledge. PSFA-EWC furnishes the backward and forward transfer ability. The updated model is able to monitor the previous modes and the learned knowledge is valuable to learn new relevant modes. Equivalently, the simulation results of Situations 2, 3, 6-8 should be excellent. Conversely, the consequences of Situations 5, 10 and 11 are expected to be poor, thus the catastrophic forgetting issue is reflected. The RSFA monitoring model is updated in real time and expected to track the system adaptively, as Situations 12-14 illustrated. For Situations 15-20, the design process of PCA-EWC is similar to that of PFSA-EWC. PCA-EWC is desired to provide the continual learning ability comparable to PSFA-EWC.

For IMPPCA and MCVA, data from all possible modes are required and stored before learning. When a novel mode arrives, sufficient samples should be collected and the model needs to be retrained on the entire dataset. For example, when mode appears, the model is relearned based on data from three modes. The model can deliver optimal monitoring consequences for the learned modes. Intuitively, IMPPCA and MCVA should provide the outstanding performance for Situations 21-30. However, it is intractable and time-consuming to collect all mode data in practical systems [18]. Besides, the computational resources would increase for each retraining with the increasing number of modes.

## V Case studies

### V-a CSTH case

The CSTH process is a nonlinear dynamic process and widely utilized as a benchmark for multimode process monitoring [17, 8]. Thornhill et al. built the CSTH model and the detail information was described in [24]. CSTH aims to mix the hot and cold water with desirable settings. Level, temperature and flow are manipulated by PI controllers. Six critical variables are selected for monitoring in this paper.

This paper designs two cases and three successive modes are considered in each case, as listed in Table II. For each mode, 1000 normal samples are collected and 1000 testing samples are generated as follows:

Case 1: a random fault occurs in the level from 501 sample and the fault amplitude is 0.15;

Case 2: a random fault occurs in the temperature from 501 sample and the fault amplitude is 0.18.

For PSFA-EWC, PSFA and RSFA, the evaluation indices of three monitoring statistics are summarized in Table III. statistic is established to reflect the dynamic behaviors and the occurrence of a fault is confirmed for multimode processes. Three comparative methods calculate two statistics and a fault is detected when SPE or is beyond the corresponding threshold. The simulation results of PCA-EWC, IMPPCA and MCVA are concluded in Table IV.