1 Introduction
Cyberphysical systems (CPSs) are computer systems that monitor and control physical processes with software Wu et al. (2018). They integrate computing, communication, control and physical components. Due to potential security threats such as component performance degradation, human factors, network attacks, etc., their security and reliability have attracted lots of attention Jiang and Yin (2017)
. With the increasing complexity of modern CPSs, traditional rulebased attack detection methods cannot meet the requirements gradually. Recently, with the development of AIoT (AI + IoT), it has become a trend to build attack detection frameworks of CPSs based on datadriven machine learning algorithms.
Attack detection for CPSs has been an active research topic for a long time Chalapathy and Chawla (2019); Ahmed et al. (2016) and many methods have been proposed. For example, traditional unsupervised attack detection methods like isolation forest Liu et al. (2008) and DBSCANSchubert et al. (2017)
have been successfully applied to industrial production. Recently, deep learningbased models have achieved great performance. For example, predictionbased methods like recurrent neural networks
Hundman et al. (2018); Salinas et al. (2020); Czarnowski et al. (2020)use features extracted from historical data to predict future values. Then we can detect attacks based on the difference between the predicted values and actual observations. Reconstructionbased methods like autoencoders
Xu et al. (2018) detect attacks according to the difference between reconstructed values and inputs. In addition, probabilistic generative methods which model the distribution of data are demonstrated more effective than deterministic methods Su et al. (2019). To some extent, existing methods have achieved good results on various attack detection tasks in CPSs.Despite their success, how to fully and effectively extract features of the data is still a challenge. For a set of time series data collected from CPSs, the data of different channels are usually correlated with each other. However, it’s difficult to obtain explicit formulas for modeling the correlation of different dimensions. Methods based on recurrent neural networks Wen et al. (2017) or attention Qin et al. (2017) usually do not explicitly model the correlation of the data in different dimensions, which limits the prediction performance of the model.
In addition, learning the distribution of data is also extremely challenging. Generally, there is no explicit form for the distribution and the amount of data is limited. Therefore, the maximum likelihood estimation method cannot be directly used for distribution estimation. Existing methods usually employ tractable distributions to approximate the distribution of data, which set the model with a certain form
Song and Kingma (2021). For example, flowbased methods Kobyzev et al. (2020) model the data distribution by constructing strictly invertible transformations. In addition, the data need to be modeled as a directed latentvariable model in variational autoencoders. The assumption of the data distribution makes these models easier to optimize. However, tractable distributions are not always suitable for modeling data distributions.To address these problems, we propose TFDPM, a general framework based on conditional energybased generative models. Firstly, TFDPM simultaneously extracts temporal pattern and feature pattern from historical data. In particular, we use graph neural networks to explicitly model the correlation of the data in different channels. Then the extracted features are served as the condition of a conditional probabilistic generative model. In this paper, we use energybased models (EBMs) because they are much less restrictive in functional forms and have no assumptions on the forms of data distribution compared with other generative models Song and Kingma (2021). The predicted values can be obtained through the conditional generation network. What’s more, in order to meet the requirements of realtime detection, an extra conditional noise scheduling network is proposed to accelerate the prediction process. The experimental results on three realworld datasets show that our method achieves significant improvement compared to all baseline methods. In addition, the noise scheduling network increases the detection speed by three times.
In summary, the contributions of this paper can be summarized as

We propose TFDPM, a general framework for anomaly detection for CPSs. The network consists of two modules at each time step: temporal pattern and feature pattern extraction module and conditional diffusion probabilistic generative module.

To the best of our knowledge, TFDPM is the first model that uses diffusionbased probabilistic models on anomaly detection tasks for CPSs.

In order to meet the requirements of online realtime detection, a conditional noise scheduling network is proposed to accelerate the prediction process of TFDPM.

Experimental results demonstrate that our method outperforms other latest models on three realworld CPS datasets. And the noise scheduling network increases the detection speed by three times.
The paper is organized as follows. Some related works are introduced in Section 2. Then some preliminaries of TFDPM are shown in Section 3. In Section 4, the network architecture, training process and prediction process of TFDPM are introduced. The conditional noise scheduling network is also introduced. Then we perform extensive experiments and results and analysis are presented in Section 5. In the end, conclusions and future works can be found in Section 6.
2 Related work
In this section, methods on attack detection tasks for CPSs are firstly introduced. Then energybased generative models are presented. In the end, we briefly introduce graph neural networks.
2.1 Attack detection for CPSs.
Many works have been proposed for datadriven attack detection tasks in recent years and can be classified into three categories. The first category of approaches is machine learningbased methods (
Ding et al. (2015); Huang et al. (2016); Li et al. (2018)). These methods usually divide time series into many segments and then cluster data based on distance, such as dynamic time warping and shapebased distance. Outliers of clusters are considered anomalies. The selection of similarity measurement is very important for clusteringbased methods. The second category of approaches is reconstructionbased methods (
Malhotra et al. (2016); Mirsky et al. (2018); Li et al. (2019); Su et al. (2019)). These methods usually construct an autoencoder network, which model the distribution of the entire time series and reconstruct the original input based on latent representations. Attacks are detected by the reconstruction probability (
An and Cho (2015)) or the difference between reconstruction values and real values. The dimension of latent variables is very important for autoencoder networks. The third category of approaches is forecastingbased methods (Hundman et al. (2018); Zong et al. (2018); Yan et al. (2021)), which detect attacks based on prediction errors. In this paper, we also propose a forecastingbased model for attack detection based on the collected multivariate time series data.2.2 Energybased generative models
Energybased generative models (EBMs) are also called nonnormalized probabilistic models, which directly estimate probability density with an unknown normalizing constant (Song and Kingma (2021)). Compared with other probabilistic methods like VAE (Kingma and Welling (2013)) and normalizing flows (Rezende and Mohamed (2015)), EBMs do not require tractability of the normalizing constants. Instead of specifying a normalized probability, they only estimate the unnormalized negative logprobability, which is called energy function. Therefore, EBMs are more flexible and less restrictive in functional forms.
Although EBMs have significant modeling advantages, exact likelihood is usually intractable and the inference process is usually slow. Recently some works SanRoman et al. (2021); Lam et al. (2021)
that aim to construct a new inference process have been proposed for acceleration. Besides, the unknown normalizing constant will make the training particularly difficult. There are currently three categories for training EBMs: (1) Noise contrastive estimation
Gutmann and Hyvärinen (2010); Gutmann and Hirayama (2012); Bose et al. (2018) is used to learn EBMs by contrasting it with another known density. (2) Maximum likelihood training with MCMC. (3) Score matchingbased methods (Hyvärinen and Dayan (2005)). Instead of estimating the log probability density functions (PDFs), they aim to estimate the first derivatives of the logPDF. In this paper, we adopt score matchingbased methods for training, which achieve stateoftheart results (
Ho et al. (2020); Song et al. (2020)) on image generation tasks recently.2.3 Graph neural networks
Graph neural networks (GNNs) have achieved impressive results on representation learning over nonEuclidean data (Zhou et al. (2020)). They can well deal with the dependencies of different nodes. There are now two mainstreams of GNNs (Wu et al. (2020)): spectralbased methods and spatialbased methods. The main difference lies in the design of the kernel. In this paper, we use graph attention networks (Veličković et al. (2017)) to explicitly model the correlation between features of multivariate time series.
3 Preliminaries
In this section, attack detection architecture in CPSs is firstly introduced. Then graph attention network used for feature pattern extraction is presented. In the end, the denoising score matching method is presented.
3.1 Attack detection in CPSs
CPSs are integrations of computation, networking, control and physical processes Lee and Seshia (2017). As shown in Fig. 1, sensors convert observations from the physical process to electronic signals. The controller sends control commands to actuators according to the received signals. The actuators convert control commands to mechanical motion. To monitor the status of CPSs, a commonly used anomaly detection architecture is shown in Fig. 1. The control signals and sensor measurements are collected and monitored by an anomaly detector.
3.2 Graph attention networks
Consider a graph , where and
are node vector and corresponding feature matrix respectively. GAT injects the graph structure by performing
masked attention. Assume is the neighborhood of node , the attention coefficients can be calculated as(1)  
where W is a trainable matrix, is the feature vector of node . represents concatenation operation and LeakyRELU is
(2) 
3.3 Score matching method
Hyvärinen and Dayan (2005) proposes score matching for learning nonnormalized statistical models. Instead of using maximum likelihood estimation, they minimize the distance of derivatives of the log density function between data and estimated distributions. Denote as score function , then with a simple trick of partial integration, the objective function can be simplified as
(4)  
where and are the data distribution and estimated distribution, respectively. represents trainable parameters. The estimated distribution will be equal to data distribution when Eq. 4 takes the minimum value.
However, Eq. 4 is difficult to calculate. Vincent (2011)
connects denoising autoencoders and score matching and proposes
denoising score matching for density estimation. It firstly adds a bit of noise to each data: , then the objective can be formulated as(5) 
The expectation is approximate by the average of samples and the second derivatives do not need to be calculated compared with Eq. 4.
4 Methodology
In this section, symbols and the problem definition will be firstly introduced. Then the basic procedure for anomaly detection and the framework of TFDPM are presented. In the end, an extra model is trained for accelerating prediction.
4.1 Symbols and problem definition
Consider a multivariate time series , where is the total length and is the number of features. represents the collected data from a CPS, which is . Our target is to determine whether the system is attacked given historical data up to time and observation at time . Predictionbased methods are used to learn the normal pattern of data in the CPS. Specifically, sliding window data is used to predict and the prediction method is shown in Eq. 6. Then attacks are detected based on the difference between predicted values and actual observations.
(6) 
4.2 Overall structure
As shown in Fig. 2, TFDPM consists of three modules: data preprocessing, offline training and online detection. The data preprocessing module is shared by offline training and online detection. In this module, we discard missing values and encode discrete signals with onehot vectors. Data is processed with minmax normalization firstly as shown in Eq. 7. Then it’s segmented into subsequences with sliding windows of length . The processed data is sent to the training and prediction module. In the end, the difference between predicted values and actual observations are served as anomaly scores. A proper threshold can be selected for attack detection. In the end, the trained model and selected threshold are deployed for online detection.
(7) 
4.3 Network architecture
The network architecture of TFDPM at time step is shown in Fig. 3. It consists of two modules at each time step: temporal and feature pattern extraction module and conditional diffusion probabilistic generative module.
4.3.1 Temporal pattern and feature pattern extraction
As shown in Fig. 3, the module extracts feature given historical data up to time . It’s a general module and many methods can be used for feature extraction. In this paper, in order to explicitly model the relationship between different channels, we use graph attention networks (GAT) to extract features. Graph attention networks can also be used for temporal pattern extraction. In addition, temporal convolutional networks (TCN) Oord et al. (2016) with dilations can also be used to capture temporal patterns. As shown in Fig. 4, we construct the following two networks for temporal pattern and feature pattern extraction module.
DoubleGAT uses onedimensional convolution with kernel size 5 to smooth the data. Then two GATs are used to obtain temporal and featureoriented representations. Then we concatenate the output of GATs and the output of onedimensional convolution and send them to GRU for feature extraction.
Onedimensional convolution with kernel size 5 is also carried out firstly to smooth the data in TCNGAT. Different from DoubleGAT, temporal convolutional layers (TCN) are used for temporal pattern extraction. As shown in Fig. 4 (c), in order to capture temporal features from different scales, we use three convolutional layers with filter sizes of , ,
. In addition, different paddings are used to align the outputs of the convolutional layers. The output of TCN is the averaged value of the outputs of convolutional layers.
As shown in Fig. 4 (b), the module that consists of a TCN and graph attention layer forms a single block of TCNGAT. Two blocks are used here for capturing temporal and feature representations. The input of the second block is the average value of the output of the first block and the output of the 1D convolutional layer. Similarly, we concatenate the outputs of the blocks and the 1D convolution and send them to GRU for further feature extraction.
4.3.2 Conditional diffusion probabilistic model
As shown in Fig. 3, the extracted feature at each time step serves as the condition for the conditional diffusion probabilistic model. Then the prediction problem in Eq. 6 can be approximated by Eq. 8, where represents trainable parameters of the temporal and feature pattern extraction module and the conditional diffusion probabilistic model.
(8) 
Let denote the distribution of the dataset given , denote a parameterized function that used to approximate . Inspired by diffusion probabilistic models Ho et al. (2020), we take a sequence of latent variables to estimate , which is . If the approximated posterior satisfies Markov property, then it can be factorized as
(9) 
The Markov chain, which gradually adds Gaussian noise to the input, is constructed based on a variance schedule
. The transition operator is formulated as(10) 
where and
are the expectation and variance of the Gaussian distribution. Denote
, , then the conditional distribution of given can be obtained as(11) 
With enough steps of state transition, the distribution of the last state gradually tends to standard Normal distribution, which is
. If the reverse process is constructed similarly with that of the approximated posterior(12) 
(13) 
then according to Jensen’s inequality, the conditional loglikelihood function can be estimated by
(14)  
The first term can be viewed as a constant. In addition, the second term can be parameterized as . Therefore, we only need to calculate . Fortunately, in the third term is tractable due to the good property of Gaussian distribution, that is
(15) 
where
(16) 
(17) 
If we set , the third term in Eq. 14 can be simplified as
(18)  
where is a constant. What’s more, combining Eq. 11 and Eq. 16, one can get , where . If is parameterized similar with , that is , then Eq. 18 can be simplified as
(19) 
where is parameterized by , . In order to balance the noise and the signal, we use a weighted variational lower bound as training objective:
(20) 
where SNR represents the signaltonoise ratio function and can be calculated as
(21) 
Actually, Eq. 20 can be viewed as a weighted version of the denoising score matching objective in Eq. 5, detail derivation can be found in A.
For the trained conditional diffusion probabilistic model, the generation process depends on , which can be calculated as
(22) 
where
. All in all, starting from a sample from white noise
, can be reconstructed by iteratively calling Eq. 22 times.As for the network, we design a network inspired by WaveNet Oord et al. (2016). The network consists of 4 residual blocks and the architecture of a single block is shown in Fig. 5. We firstly transform to and use Fourier feature embeddings (Tancik et al. (2020)) for . The observed values are smoothed with 1dimensional convolution and we sum the output with the embeddings and send them to a 1dimensional dilated convolutional layer firstly. Secondly, the sum of the output and the feature extracted from the condition are served as input for the gated activation unit. In the end, one part of the output is fed into a 1dimensional convolutional layer and served as the output of the block while the other part is summed with the feature extracted from observations and served as the input of the next block.
4.4 Training
The training procedure at each time step is shown in Algorithm 1. Given the historical data , the feature state is first extracted based on temporal and feature pattern extraction module. Then we get samples via the tractable distribution . In addition, the signaltonoise ratio at certain step can be obtained given
. In the end, the loss function
can be calculated according to Eq. 20, where represents the trainable parameters in TFDPM.4.5 Prediction and attack detection
The prediction process is shown in Fig. 6. Feature state is firstly obtained based on historical data . can be sampled from the target distribution . Then the feature state , samples from target distribution and parameters at corresponding step are sent to the sampler of diffusion probabilistic models for prediction. The detailed procedure of the sampler at is shown in Algorithm 2.
Let denote the predicted value of the ith feature at time . In this paper, the mean square error (MSE) between the predicted value and the actual observation at each time step serves as anomaly scores, which can be obtained via Eq. 23. A larger anomaly score indicates the system is more likely to be attacked at the corresponding time. Denote the anomaly scores on the test set as , where represents the number of observations in the test set. Then anomalies can be detected by setting a threshold, which can be tuned by various methods. For example, Siffer et al. (2017) proposes peaks over threshold (POT) algorithm based on extreme value theory. Hundman et al. (2018) proposes a dynamic thresholding method. These methods are not the focus of this paper and we simply use the best metrics obtained from the gridsearch method for comparison.
(23) 
4.6 Efficient sampling
The major limitation of diffusion probabilistic models is that transforming data to target distribution takes many diffusion steps, which makes the generation process slower than other generative models like VAEs and GANs. As shown in Algorithm 2, the conditional network has to be called times for each prediction step. To solve the problem, we propose a conditional noise scheduling network for the scheduling of noise sequences .
Assume the learned noise scales are , where . Similarly, we can get , . In addition, it’s easy to obtain that satisfies . As a result, we can design a neural network such that
(24) 
where represents trainable parameters in . In order to find a shorter noise schedule, we set , where is a positive integer. The constraint makes one step of diffusion using equal to steps of diffusion using .
The training of the noise scheduling network depends on the trained TFDPM. We want to optimize the gap between the loglikelihood and the variational lower bound given the condition at each time step. Firstly, for all , the loglikelihood function can be estimated by a new variational lower bound which is equivalent to the lower bound in Eq. 14.
(25)  
Denote the optimal parameters of TFDPM as , we have . When , , the gap between and can be formulated as
(26) 
where . Detail derivation can be found in B. It’s worth noting that compared with in TFDPM, is also used as the condition in when learning new noise scales because depends on and simultaneously. The new lower bound can be formulated as
(27) 
Substitute and in , it can be simplified as Eq. 28. The corresponding training procedure is shown in Algorithm 3. Since the model contains a bilateral modeling objective for both score network and noise scheduling network, we call this method as TFBDM.
(28) 
where
(29) 
Once the conditional noise scheduling network is trained, we can sample from the target distribution based on the new noise schedule. Since the generation process starts from the last noise scale, we need to set two hyperparameters
and firstly. The smallest noise scale is used as a threshold to determine when to stop the sampling process. In the end, the sampler can be modified as Algorithm 4.5 Experiments
In this section, TFDPM is evaluated on three realworld CPS datasets and we compare the results with some other stateoftheart attack detection methods. Then we compare the performance of attack detection and prediction speed of TFBDM and TFDPM. In addition, we perform extensive experiments to show the effectiveness of our models.
5.1 Datasets and evaluation metrics
In this paper, we use the following three realworld CPS datasets for evaluation:

PUMP: It consists of the data collected from a water pump system from a small town. The dataset is collected every minute for 5 months.

SWAT: The dataset comes from a water treatment testbed which is a smallscale version of the modern CPS Goh et al. (2016). The system has been widely used and it’s very important to detect potential attacks from malicious attackers. In our experiments, the original data samples are downsampled to a data point every 10 seconds.

WADI: Water Distribution (WADI) is a distribution system which consists of many water distribution pipelines Ahmed et al. (2017). It forms a complex system that contains water treatment, storage and distribution networks. Two weeks of normal operations are used as training data. Similarly, we downsample the the samples to a data point every 10 seconds. In addition, the data in the last day is ignored because they have different distributions compared with the data in previous days.
Detailed properties of these datasets are shown in Table 2. It shows the number of sensors and actuators in each dataset. Note that the labels of anomalies are only available in test sets.
As for evaluation metrics, we use precision, recall and F1score to compare the performance of TFDPM and other methods:
, where , , TP,TN, FP and FN represent the number of true positives, true negatives, false positives and false negatives. What’s more, the anomalous points are usually continuous and form many continuous anomaly segments Su et al. (2019). Therefore, the attacks are considered as successfully detected if an alert is sent within the continuous anomalous points segment. In this paper, we mainly compare the best F1scores on various datasets which indicate the upper bound of the model capability.




ANOMALIES(%)  
PUMP  76901  143401  44  0  10.05  
SWAT  49668  44981  25  26  11.97  
WADI  120960  15701  67  26  7.09 
5.2 Competitive methods
We compare the performance of our proposed method with the following stateoftheart attack detection methods.

Isolation forest : Isolation forest Liu et al. (2008)
is an unsupervised attack detection method based on decision tree algorithms and has been successfully used in various fields. It usually has fast training speed and good generalization performance.

LSTMAE: LSTMAE Malhotra et al. (2016) uses a LSTMbased autoencoder to model time series and detects attacks based on the difference between actual observations and reconstruction values.

SparseAE: SparseAE Ng and others (2011) also adopts an autoencoder for attack detection. Compared with LSTMAE, it uses sparse hidden embeddings to get better latent representations of input data.

LSTMPRED: LSTMPRED Goh et al. (2017) uses LSTMbased methods for prediction. It detects attacks based on the difference between predicted values and actual observations.

DAGMM: DAGMM Zong et al. (2018)
combines deep generative models and Gaussian mixture models, which can generate a lowdimensional representation for each observation. The attacks are also detected based on the reconstruction error.

USAD: USAD Audibert et al. (2020) is a recently proposed method based on autoencoders. The architecture allows it to isolate anomalies and provides fast training. The adversarial training trick is also demonstrated helpful for attack detection.
Since SparseAE and Isolation forest are not directly designed for time series anomaly detection, the observed values and states and actuators in time points are stacked as the inputs for the models.
5.3 Hyperparameter settings
The length of the sequence of noise scales , which is also called diffusion steps, is set as for all experiments. In addition, the sequence of noise scales for the forward process is set constants increasing from to . In addition, the batch size for training is set as 100 for all three datasets. The default length of sliding window
is set as 12. TFDPM is trained for 20 epochs with early stopping.
As for the noise scheduling network, there are three important parameters, , , . A larger makes the length of constructed noise scales shorter. We set for all datasets. In order to find a good set of initial values for , we set the range of these parameters from to and apply a grid search algorithm for parameter tuning. The noise scheduling network is also trained for 20 epochs with early stopping.
5.4 Results and Analysis
In this section, we demonstrate that TFDPM outperforms other baselines based on the performance on three CPS datasets. Then the prediction speed and performance of TFBDM and TFDPM are compared. What’s more, the influence of hyperparameters on the performance is explored.
5.4.1 Comparison of performance
We use different temporal pattern and feature pattern extraction modules for TFDPM. They firstly fuse the features extracted with models like GAT and TCN and then send the fused features to GRU to generate . As a comparison, we also GRU for feature extraction which directly obtains based on the input data. Table 2
shows the best F1 scores, corresponding precision and recall for each method on all datasets. The best two results for each dataset are shown in bold.
Isolation forest performs very well on SWAT which has minimum number of sensors. However, we find that the observed values from sensors are usually more complex than those collected data from actuators in these three datasets. Isolation forest performs worse than some recently proposed deep learningbased methods on the other two datasets which have much more sensors.
In addition, SparseAE performs better than LSTMAE on all datasets, which indicates that proper latent representations are very important for autoencoders. What’s more, deterministic methods like LSTMAE and SparseAE usually perform worse than those probabilistic generative models like DAGMM, OmniAnomaly and USAD. It indicates that modeling the distribution of datasets is more effective and robust than deterministic methods.
Our methods obtain best results on all three datasets, especially on WADI and PUMP. The best F1 score based on TFDPM is basically the same with that of OmniAnomaly on SWAT. We analyze the data in SWAT and find that the states of actuators change frequently and some areas marked as abnormal are very similar to normal data. These characteristics make the recall of all the models relatively low. The best F1scores on WADI and PUMP based on TFDPM which adopts TCNGAT for feature extraction are about 4% and 2% higher than the best results of existing baselines respectively. In addition, TFDPM based on DoubleGAT and GRU also obtain good results on all three datasets. It indicates that the conditional diffusion probabilistic model is very effective and can better model the distribution of normal data compared with variational autoencoder. In addition, TFDPM based on GRU performs worse than TFDPM based on TCNGAT or DoubleGAT. Therefore, using GAT to explicitly model the relationship between different channels of the multivariate dataset is effective. Compared with TFDPM based on DoubleGAT, TFDPM based on TCNGAT usually obtains slightly better results on these three datasets, which indicates that TCN is more suitable for extracting features of different time scales compared with the timeoriented GAT.
Models  PUMP  WADI  SWAT  
PRE  REC  F1  PRE  REC  F1  PRE  REC  F1  
Isolation forest  0.977  0.852  0.729  0.826  0.772  0.798  0.975  0.754  0.850  
LSTMAE  0.438  0.796  0.565  0.589  0.887  0.708  0.945  0.620  0.749  
SparseAE  0.798  0.737  0.766  0.769  0.771  0.770  0.999  0.666  0.799  
LSTMPRED  0.925  0.581  0.714  0.620  0.876  0.726  0.996  0.686  0.812  
DAGMM  0.931  0.798  0.859  0.886  0.772  0.825  0.946  0.747  0.835  
OmniAnomaly  0.937  0.840  0.886  0.846  0.893  0.869  0.979  0.753  0.851  
USAD  0.984  0.682  0.731  0.806  0.879  0.841  0.987  0.740  0.846  

0.938  0.831  0.881  0.893  0.865  0.879  0.974  0.741  0.842  

0.959  0.856  0.905  0.939  0.881  0.909  0.989  0.749  0.852  

0.893  0.906  0.899  0.916  0.878  0.897  0.988  0.742  0.848 
In order to make the attack detection process more intuitive, two typical anomalous areas in PUMP are shown in Fig. 7. The figures of the first row display observed values and predicted values in the certain channel of the PUMP dataset in different periods. It shows that the observed values drop abruptly over a period of time. Compared with the observed values, the predicted values in this channel remain stable and fluctuate slightly around a certain value. It indicates that TFDPM can well model the distribution of normal data. Even if there are anomalies caused by attacks in some channels, TFDPM can still give reasonable predicted values. As a result, the difference between observed values and predicted values in these areas is larger than those in the normal area. Therefore, we can perform attack detection based on the difference between predicted values and observed values. After selecting a proper threshold, the anomalous area can be detected. The figures of the second row confirm this view and they show the mean square error of predictive values and observed values in corresponding periods. The red areas represent correct alarms of TFDPM while the points marked in red represent false positives. It’s obvious that TFDPM can well locate the anomalous areas based on the MSE values.
In order to intuitively show the effect of GAT in TFDPM, part of the attention weights of the featureoriented GAT in DoubleGAT are shown in Fig. 8. It shows the correlation between the data in the first channel and the corresponding channels. Darker color represents lower attention scores. For the data in the same channels under normal and abnormal status, the greater the difference between colors, the correlation between the data of this channel and the first channel changes greater in case of system failure. Therefore, the correlation of the data between certain channels is significantly different from normal conditions when the system fails. In addition, it indicates that GAT can explicitly model the correlation of the data in different channels, which is extremely important for feature pattern extraction of the data.
Models  PUMP  WADI  SWAT  
PRE  REC  F1 

PRE  REC  F1 

PRE  REC  F1 



0.958  0.840  0.896  3.1  0.918  0.874  0.895  2.8  0.985  0.757  0.856  2.3  

0.902  0.906  0.904  3.2  0.921  0.878  0.899  2.6  0.984  0.753  0.853  2.6 
5.4.2 Comparison of TFBDM and TFDPM
Based on the trained TFDPM models in Table 2, corresponding noise scheduling networks are trained based on Algorithm 3. Then for the extracted feature , we construct a sequence of noise vectors and predict values according to Algorithm 4 at each time step. The results are shown in Table 3.
As shown in Table 3, the best F1 scores of TFBDM are basically the same as those of TFBDM. It’s worth noting that TFBDM based on DoubleGAT performs slightly better than TFDPM based on DoubleGAT on all datasets. In addition, TFBDM based on TCNGAT also performs better than that of TFDPM based on TCNGAT on SWAT. Therefore, given the trained TFDPM models, training the noise scheduling network which optimizes the gap between the loglikelihood and the variational lower bound can help further improve the performance of anomaly detection. What’s more, the computation cost of the prediction procedure of TFDPM is large, which makes it unable to meet the requirements of online realtime detection. In order to accelerate the prediction process, we construct shorter noise sequences based on the noise scheduling network. Table 3 shows the ratio of the prediction speed of TFBDM to the prediction speed of TFDPM. The results show that the prediction speed of TFBDM can be up to three times that of TFDPM, which greatly reduces the computation cost in the prediction process.
5.4.3 Influence of hyperparameters
For TFDPM, the length of historical data , the number of diffusion steps and batch size for training are very important. Therefore, we study the influence of these hyperparameters on the performance of anomaly detection.
Influence of batch size: We set the batch size for training as while keeping all the other parameters unchanged and the results are shown in Fig. 9. Firstly, TFDPM based on GRU performs worse than the other models on all the three datasets in most cases. Therefore, compared with directly obtaining with GRU, we can get better features of historical data by fusing temporal pattern and feature pattern. In addition, the performance of anomaly detection firstly increases and then decreases with the increase of batch size. For SWAT, TFDPM based on GRU, TCNGAT and DoubleGAT gets their best performance when batch size is set as 50, 100 and 159 respectively. As for the other two datasets, we can get good detection results if the batch size is set as 100. What’s more, TFDPM based on TCNGAT performs slightly better than TFDPM based on DoubleGAT on all datasets, which again indicates that TCN is more suitable for temporal pattern extraction than GAT.
Influence of diffusion steps: The number of diffusion steps is very important for the conditional diffusion probabilistic models. According to Eq. 11, the target distribution will tend to standard Normal distribution when is big enough. In addition, with the signaltonoise ratio implemented into the training objective in Eq. 20, we can get Kingma et al. (2021). Therefore, large diffusion steps can help further improve the performance. However, large diffusion steps will increase the computational cost. So there is a tradeoff between performance and computational cost. We set diffusion steps as while keeping all the other parameters unchanged and the results are shown in Fig. 10. TFDPM performs better with the increase of the number of diffusion steps, which demonstrate above derivations. However, we can not get better performance when if all the parameters are kept unchanged. For the PUMP dataset, TFDPM gets the best results when . As for the SWAT dataset, TFDPM based on TCNGAT and DoubleGAT get the best results when while for TFDPM based on GRU. As for the WADI dataset, TFDPM based TCNGAT, GRU and DoubleGAT get the best results when respectively. In addition, TFDPM based on GRU usually performs worse than the other two variants, which again demonstrates that we can get better representations by explicitly modeling the feature pattern.
Influence of window size: The length of the historical data is crucial for the temporal pattern and feature pattern extraction module. We set the window size as while keeping all the other parameters unchanged and the results are shown in Fig. 11. The best F1 scores of TFDPM based on TCNGAT and DoubleGAT on WADI fluctuate in a small range when changes from 12 to 36. But the performance of these two models on WADI decreases when . As for the other datasets, the performance of TFDPM usually decreases when is greater than 24. Therefore, longer sequences may not be helpful for feature extraction of temporal pattern and feature pattern while keeping all other parameters unchanged. What’s more, the performance of TFDPM based on GRU usually performs worse than the other variants, which again indicates that we can get better representations of historical data by fusing the features extracted from temporal and feature pattern.
In summary, we explore the influence of hyperparameters on these three datasets. The performance of TFDPM based on TCNGAT is generally slightly better than TFDPM based on DoubleGAT. In addition, TFDPM based on GRU usually performs worse than the other two models. The reasonable values for batch size, the number of diffusion steps and the length of window size can be set as 100, 100, 12 for all datasets.
6 Conclusion and future works
In this paper, we propose TFDPM, a general framework for attack detection tasks in CPSs. It consists of two components: temporal pattern and feature pattern extraction module and conditional diffusion probabilistic module. Particularly, we use graph attention networks to explicitly model the correlation of data in different channels in the first module. In addition, the energybased generative model used in the second module is less restrictive on functional forms of the data distribution. To realize realtime detection, a noise scheduling network is proposed for accelerating the prediction process. Experiments show that TFDPM outperforms existing stateoftheart methods and the noise scheduling network is efficient.
There is still much work to be done. For example, discrete flows Tran et al. (2019) can be used to model discrete signals in collected data. Besides, the conditional diffusion probabilistic model used in this paper can be view as the discrete form of stochastic differential equations (SDEs) Song et al. (2020). Therefore, more powerful generative models can be designed based on SDEs.
Appendix A
(30)  
In addition, according to the definition of , we can get
(31) 
Combining Eq. 30 and Eq. 31, we can get . If we set , then with similar derivation, Eq. 20 can be written as
(32)  
Therefore, the training objective in Eq. 20 is a weighted version of denoising score matching objective.
Appendix B
For the trained TFDPM, assume , then the gap between the loglikelihood and the variational lower bound can be formulated as
(33) 