1 Introduction
Anomaly detection has been one of the core research areas in machine learning for decades, with wide applications such as cyberintrusion detection [5], medical care [72], sensor networks [3], video anomaly detection [26]
and so on. Anomaly detection seems to be a simple twocategory classification, i.e., we can learn to classify the normal or abnormal data. However, it is also faced with the following challenges. First, training data is highly imbalanced since the anomalies are often extremely rare in a dataset compared to the normal instances. Standard classifiers try to maximize accuracy in classification, so it often falls into the trap of overlapping problem, which means that the model classifies the overlapping region as belonging to the majority class while assuming the minority class as noise. Second, there is no easy way for users to manually label each training data, especially the anomalies. In many cases, it is prohibitively hard to represent all types of anomalous behaviors. Due to above challenges, there is a growing trend to use unsupervised learning approaches for anomaly detection compared with semisupervised and supervised learning approaches since unsupervised methods can handle the imbalanced and unlabeled data in a more principled way
[48, 45, 51, 71, 8].Nowadays, the prevalence of sensors in machine learning and pervasive computing research areas such as Health Care (HC) [7, 65] and Human Activity Recognition (HAR) [63, 64] generate a substantial amount of multivariate timeseries data. These learning algorithms based on multisensor timeseries signals give priority to dealing with spatialtemporal correlation of multisensor data. Many approaches for spatialtemporal dependency amongst multiple sensors [34, 68, 46] have been studied. It seems intuitive to apply previous unsupervised anomaly detection methods on multisensor timeseries data. Unfortunately, there are still several challenges.
First, anomaly detection in spatialtemporal domain becomes more complicated due to the temporal component in timeseries data. Conventional anomaly detection techniques such as PCA [43]
, kmeans
[31], OCSVM [53] and Autoencoder [52] are unable to deal with multivariate timeseries signals since they cannot simultaneously capture the spatial and temporal dependencies. Second, these reconstructionbased models such as Convolutional AutoEncoders (CAEs) [20]and Denoising AutoEncoders (DAEs)
[62] are usually used for anomaly detection. It is generally assumed that the compression of anomalous samples is different from that on normal samples, and the reconstruction error becomes higher for these anomalous samples. In reality, being influenced by the high complexity of model and the noise of data, the reconstruction error for the abnormal input could also be fit so well by the training model [75, 15]. That is, the model is robust to noise and anomalies. Third, in order to reduce the dimensionality of multisensor data and detect anomalies, twostep approaches are widely adopted. As for the drawback of some works [33, 1], the joint performance of two baseline models can easily get stuck in local optima, since two models are trained separately.In order to solve the above three challenges, this paper presents a novel unsupervised deep learning based anomaly detection approach for multisensor timeseries data called Deep Convolutional Autoencoding Memory network (CAEM). The CAEM network composes of two main subnetworks: characterization network and memory network. Specifically, we employ deep convolutional autoencoder as feature extraction module, with attentionbased Bidirectional LSTMs and Autoregressive model as forecasting module. By simultaneously minimizing reconstruction error and prediction error, the CAEM model can be jointly optimized. During the training phase, the CAEM model is trained to explicitly describe the normal pattern of multisensor timeseries data. During the detection phase, the CAEM model calculate the compound objective function for each captured testing data. Through combining these errors as a composite anomaly score, a finegrained anomaly detection decision can be made. To summarize, the main contributions of this paper are fourfold:
1) The proposed composite model is designed to characterize complex spatialtemporal patterns by concurrently performing the reconstruction and prediction analysis. In reconstruction analysis, we build Deep Convolutional Autoencoder to fuse and extract lowdimensional spatial features from multisensor signals. In prediction analysis, we build Attentionbased Bidirectional LSTM to capture complex temporal dependencies. Moreover, we incorporate Autoregressive linear model in parallel to improve the robust and adapt for different use cases and domains.
2) To reduce the influence of noisy data, we improve Deep Convolutional Autoencoder with a Maximum Mean Discrepancy (MMD) penalty. MMD is used to encourage the distribution of the lowdimensional representation to approximate some target distribution. It aims to make the distribution of noisy data close to the distribution of normal training data, thereby reducing the risk of overfitting. Experiments demonstrate that it is effective to enhance the robustness and generalization ability of our method.
3) The CAEM is an endtoend learning model that two subnetworks can cooptimize by a compound objective function with weight coefficients. This singlestage approach can not only streamline the learning procedure for anomaly detection, but also avoid the model getting stuck in local minimum through joint optimization.
4) Experiments on three multisensor timeseries datasets demonstrate that CAEM model has superior performance over stateoftheart techniques. In order to further verify the effect of our proposed model, finegrained analysis, effectiveness evaluation, parameter sensitivity analysis and convergence analysis show that all the components of CAEM together leads to the robust performance on all datasets.
The rest of the paper is organized as follows. Section 2 provides an overview of existing methods for anomaly detection. Our proposed methodology and detailed framework is described in Section 3. Performance evaluation and analysis of experiment is followed in Section 4. Finally, Section 5 concludes the paper and sketches directions for possible future work.
2 Related Work
Anomaly detection has been studied for decades. Based on whether the labels are used in the training process, they are grouped into supervised, semisupervised and unsupervised anomaly detection. Our main focus is the unsupervised setting. In this section, we demonstrate various types of existing approaches for unsupervised anomaly detection, which can be categorized into traditional anomaly detection and deep anomaly detection.
2.1 Traditional anomaly detection
Conventional methods can be divided into three categories. 1) Reconstructionbased methods are proposed to represent and reconstruct accurately normal data by a model, for example, PCA [43], Kernel PCA [54, 21] and Robust PCA [44]
. Specifically, RPCA is used to identify a low rank representation including random noise and outliers by using a convex relaxation of the rank operator; 2) Clustering analysis is used for anomaly detection, such as Gaussian Mixture Models (GMM)
[32], kmeans [31]and Kernel Density Estimator (KDE)
[25]. They cluster different data samples and find anomalies via a predefined outlierness score; 3) the methods of oneclass learning model are also widely used for anomaly detection. For instance, OneClass Support Vector Machine (OCSVM)
[53]and Support Vector Data Description (SVDD)
[60] seek to learn a discriminative hypersphere surrounding the normal samples and then classify new data as normal or abnormal.It is notable that these conventional methods for anomaly detection are designed for static data. To capture the temporal dependencies appropriately, Autoregression (AR) [17], Autoregressive Moving Average (ARMA) [19] and Autoregressive Integrated Moving Average (ARIMA) model [42]
are widely used. These models represent time series that are generated by passing the input through a linear or nonlinear filter which produces the output at any time using the previous output values. Once we have the forecast, we can use it to detect anomalies and compare with groundtruth. Nevertheless, AR model and its variants are rarely used in multisensor multivariate time series due to their high computational cost.
2.2 Deep anomaly detection
In deep learningbased anomaly detection, the reconstruction models, forecasting models as well as composite models will be discussed.
2.2.1 Reconstruction models
The reconstruction model focuses on reducing the expected reconstruction error by different methods. For instance, Autoencoders [52] are often utilized for anomaly detection by learning to reconstruct a given input. The model is trained exclusively on normal data. Once it is not able to reconstruct the input with equal quality compared to the reconstruction of normal data, the input sequence is treated as anomalous data. LSTM EncoderDecoder model [40] is proposed to learn temporal representation of the input time series by LSTM networks and use reconstruction error to detect anomalies. Despite its effectiveness, LSTM does not take spatial correlation into consideration. Convolutional Autoencoders (CAEs) [20]
are an important method of video anomaly detection, which are able of capturing the 2D image structure since the weights are shared among all locations in the input image. Furthermore, since Convolutional long shortterm memory (ConvLSTM) can model spatialtemporal correlations by using convolutional layers instead of fully connected layers, some researchers
[68, 38] add ConvLSTM layers to autoencoder, which better encodes the change of appearance for normal data.Variational Autoenocders (VAEs) are a special form of autoencoder that models the relationship between two random variables, latent variable
and visible variable . A prior for is usually multivariate unit Gaussian . For anomaly detection, authors [2]define the reconstruction probability that is the average probability of the original data generating from the distribution. Data points with high reconstruction probability is classified as anomalies, vice versa. Others like Denoising AutoEncoders (DAEs)
[62], Deep Belief Networks (DBNs)
[67] and Robust Deep Autoencoder (RDA) [74] have also been reported good performance for anomaly detection.2.2.2 Forecasting models
The forecasting model can also be used for anomaly detection. It aims to predict one or more continuous values, e.g. forecasting the current output values for the past values . RNN and LSTM is the standard model for sequence prediction. In the work [13, 11], authors perform anomaly detection by using RNNbased forecasting models to predict values for the next time period and minimize the mean squared error (MSE) between predicted and future values. Recently, there have also been attempted to perform anomaly detection using other feedforward networks. For instance, Shalyga et al. [55]
develop Neural Network (NN) based forecasting approach to early anomaly detection. Kravchik and Shabtai
[27] apply different variants of convolutional and recurrent networks to perform forecasting model. And the results show that 1D convolutional networks obtain the best accuracy for anomaly detection in industrial control systems. In another work [30], Lai et al. propose a forecasting model, which uses CNN and RNN, namely LSTNet, to extract shortterm local dependency pattern and longterm pattern for multivariate time series, and incorporates Linear SVR model in the LSTNet model. Besides, other efforts have been performed in [36] using GANbased anomaly detection. The model adopts UNet as generator to predict next frame in video and leverages the adversarial training to discriminate whether the prediction is real or fake, thus abnormal events can be easily identified by comparing the prediction and ground truth.2.2.3 Composite models
Besides single model, composite model for unsupervised anomaly detection has gained a lot attention recently. Zong et al. [75] utilize a deep autoencoder to generate a lowdimensional representation and reconstruction error, which is further fed into a Gaussian Mixture Model to model density distribution of multidimensional feature. However, they cannot consider the spatialtemporal dependency for multivariate time series data. Different from this work, the Composite LSTM model [57] uses single encoder LSTM and multiple decoder LSTMs to perform different tasks such as reconstructing the input sequence and predicting the future sequence. In [41], the authors use ConvLSTM model as a unit within the composite LSTM model following a branch for reconstruction and another for prediction. This type of composite model is currently used to extract features from video data for the tasks of action recognition. Similarly, authors in [73] propose SpatialTemporal AutoEncoder (STAE) for video anomaly detection, which utilizes 3D convolutional architecture to capture the spatialtemporal changes. The architecture of the network is an encoder followed by two branches of decoder for reconstructing past sequence and predicting future sequence respectively.
As mentioned above, unsupervised anomaly detection techniques have still many deficiencies. For traditional anomaly detection, it is hard to learn representations of spatialtemporal patterns in multisensor timeseries signals. For a reconstruction model, a single task could make the model suffer from the tendency to store information only about the inputs that are memorized by the AE. And for the forecasting model, this task could suffer from only storing the last few values that are most important for predicting the future [57, 41]. Hence, their performance will be limited since model only learn trivial representations. For composite model, these researchers design their models for different purposes. Zong et al. [75]
could solve problem that the model is robust to noise and anomalies through performing density estimation in a lowdimensional space. Zhao
et al. [73] could consider the spatialtemporal dependency through 3D convolutional reconstructing and forecasting architectures. However, few studies could address these issues simultaneously.Different from these works, our research makes the following contributions: 1) The proposed model is designed to characterize complex spatialtemporal dependencies, thus discovering generalized pattern of multisensor data; 2) Adding a Maximum Mean Discrepancy (MMD) penalty could avoid the model generalizing so well for noisy data and anomalies; 3) Combining Attentionbased Bidirectional LSTM (BiLSTM) and traditional Autoregressive linear model could boost the model’s performance from different time scale; 4) The composite baseline model is generated based on endtoend training which means all the components within the model are jointly trained with compound objective function.
Besides, some learning algorithms based on timeseries data have been studied for decades. [69] propose Unsupervised Salient Subsequence Learning to extract subsequence as new representations of the time series data. Due to the internally sequential relationship, many neural networkbased models can be applied to time series in an unsupervised learning manner. For example, some 1DCNN models [12, 59] have been proposed to solve time series tasks with a very simple structure and the sota performance. Moreover, the multiple time series signal usually has some kinds of corelations, [66] propose a method to learn the relation graph on multiple time series. Some anomaly detection based on multiple time series applications are available for wastewater treatment [23], for ICU [6], and for sensors [70].
3 The Proposed Method
3.1 Notation
In a multisensor time series anomaly detection problem, we are given a dataset generated by sensors (). Without loss of generality, we assume each sensor generates signals (e.g., an accelerometer often generates 3axis signals). Denote the signal set, we have signals in total. For each signal , , where denotes the length of signal . Note that even each sensor signal may have different length, we are often interested in their intersections, i.e., all sensors are having the same length , i.e., denotes an input sample containing all sensors.
Definition 1 (Unsupervised anomaly detection).
It is nontrivial to formally define an anomaly. In this paper, we are interested in detecting anomalies in a classification problem. Let be the classification label set, and the total number of classes, then the dataset . Eventually, our goal is to detect whether an input sample belongs to one of the predefined classes with a high confidence. If not, then we call an anomaly. Note that in this paper, we are dealing with an unsupervised anomaly detection problem, where the labels are unseen during training, which is more obviously challenging.
3.2 Overview
There are some existing works [53, 20, 33] attempting to resolve the unsupervised anomaly detection problem. Unfortunately, they may face several critical challenges. First, conventional anomaly detection techniques such as PCA [43], kmeans [31] and OCSVM [53] are unable to capture the temporal dependencies appropriately because they cannot deliver temporal memory states. Second, since the normal samples might contain noise and anomalies, using deep anomaly detection approaches such as standard Autoencoders [20, 52] is likely to affect the generalization capability. Third, the multistage approaches, i.e., feature extraction and predictive model building are separated [33, 1], can easily get stuck in local optima.
In this paper, we present a novel approach called Convolutional Autoencoding Memory network (CAEM) to tackle the above challenges. Fig. 1 gives an overview of the proposed method. In a nutshell, CAEM is built upon a convolutional autoencoder, which is then fed into a predictive network. Concretely, we encode the spatial information in multisensor timeseries signals into the lowdimensional representation via Deep Convolutional Autoencoder (CAE). In order to reduce the effect of noisy data, some existing works have tried to add Memory module [15] or Gaussian Mixture Model (GMM) [75]
. In our proposed method, we simplify these modules into penalty item, which called Maximum Mean Discrepancy (MMD) penalty. Adding a MMD term can encourage the distribution of training data to approximate the same distribution such as Gaussian distribution, thus reducing the risk of overfitting caused by noise and anomalies in training data
[75]. And then we feed the representation and reconstruction error to the subsequent prediction network based on Bidirectional LSTM (BiLSTM) with Attention mechanism and Autoregressive model (AR) which could predict future feature values by modeling the temporal information. Through the composite model, the spatialtemporal dependencies of multisensor timeseries signals can be captured. Finally, we propose a compound objective function with weight coefficients to guide endtoend training. For normal data, the reconstructed value generated by data coding is similar to the original input sequence and the predicted value is similar to the future value of time series, while the reconstructed value and the predicted value generated by abnormal data change greatly. Therefore, in inference process, we can detect anomalies precisely by computing the loss function in composite model.
3.3 Characterization Network
In the characterization network, we perform representative learning by fusing multivariate signals in multiple sensors. The lowdimensional representation contains two components: (1) the features which are abstracted from the multivariate signals; (2) the reconstruction error over the distance metrics such as Euclidean distance and Minkowski distance. To avoid the autoencoder generalizing so well for abnormal inputs, optimization function combines reconstruction loss by measuring how close the reconstructed input is to the original input and the regularization term by measuring the similarity between the two distributions (i.e., the distribution of lowdimensional features and Gaussian distribution).
3.3.1 Deep feature extraction
We employ a deep convolutional autoencoder to learn the lowdimensional features. Specifically, given time series with length , we pack into a matrix with multisensor timeseries data. The matrix is then fed to deep convolutional autoencoder (CAE). The CAE model is composed of two parts, an encoder and a decoder as in Eq. (1) and Eq. (2). Assuming that denotes the reconstruction of the same shape as , the model is to compute lowdimensional representation , as follows:
(1) 
(2) 
The encoder in Eq. (1) maps an input matrix
by many convolutional and pooling layers. Each convolutional layer will be followed by a maxpooling layer to reduce the dimensions of the layers. A maxpooling layer pools features by taking the maximum value for each patch of the feature maps and produce the output feature map with reduced size according to the size of pooling kernel.
The decoder in Eq. (2) maps the hidden representation back to the original input space as a reconstruction. In particular, a decoding operation needs to convert from a narrow representation to a wide reconstructed matrix, therefore the transposed convolution layers are used to increase the width and height of the layers. They work almost exactly the same as convolutional layers, but in reverse.
The difference between the original input vector and the reconstruction is called the reconstruction error . The error typically used in the autoencoder is Mean Squared Error (MSE), which measures how close the reconstructed input is to the original input , as follows in Eq. (3).
(3) 
where is the norm.
3.3.2 Handling noisy data
To reduce the influence of noisy data, we need to observe the changes in lowdimensional features and the changes of distribution over the samples in a more granular way, thus distinguishing between normal and abnormal data obviously.
Inspired by [75], in order to avoid the autoencoder generalizing so well for noisy data and abnormal data, we hope to detect ”lurking” anomalies that reside in lowdensity areas in the reduced lowdimensional space. Our proposed method is conceptually similar to Gaussian Mixture Model (GMM) as target distributions. The loss function is complemented by MMD as a regularization term that encourages the distribution of the lowdimensional representation to be similar to a target distribution. It aims to make the distribution of noisy data close to the distribution of normal training data, thereby reducing the risk of overfitting. Specifically, Maximum Mean Discrepancy (MMD) [56] is a distancemeasure between the samples of the distributions. Given the latent representation , where is a latent space (usually ) and denotes all of the time steps at one iteration. For CAE with MMD penalty, the Gaussian distribution in reproduction kernel Hilbert space is chosen as the target distribution. We compute the Kernel MMD as the follows:
(4) 
Here we have the distribution of the lowdimensional representation and the target distribution over a set . The MMD is defined by a feature map where is a reproducing kernel Hilbert space (RKHS).
During the training process, we could apply the kernel trick to compute the MMD. And it turns out that many kernels, including the Gaussian kernel, lead to the MMD being zero if and only the distributions are identical. Letting , we yield an alternative characterization of the MMD as follows:
(5)  
Here the kernel is defined as . The latent representation with Gaussian distribution is performed by sampling from and approximating by averaging the kernel evaluated at all pairs of samples.
Note that we usually do batch training for neural network training. It means that the model is trained using a subsample of data at one iteration. In this work, we need to compute the MMD over a set of at one iteration, where the number of is equal to . That is, the latent representation is denoted as , where .
3.4 Memory Network
To simultaneously capture the spatial and temporal dependencies, our proposed model is designed to characterize complex spatialtemporal patterns by concurrently performing the reconstruction analysis and prediction analysis. Considering the importance of temporal component in time series, we propose nonlinear prediction and linear prediction to detect anomalies by comparing the future prediction and the next value appearance in the feature space.
The characterization network generates feature representations, which include reconstruction error and reduced lowdimensional features learned by the CAE at time steps. Denote input features as for :
(6) 
Our goal is to predict the current value for the past values . The memory network combines nonlinear function based predictor and linear function based predictor to tackle temporal dependency problem.
3.4.1 Nonlinear prediction
Nonlinear predictor function has different types such as Recurrent neural networks (RNNs), Long ShortTerm Memory (LSTM)
[16]and Gated Recurrent Unit (GRU)
[9]. Original RNNs fall short of learning longterm dependencies. In this work, we adopt a Bidirectional LSTM with attention mechanism [35] which could consider the whole/local context while calculating the relevant hidden states. Specifically, the Bidirectional LSTM (BiLSTM) runs the input in two ways, one LSTM from past to future and one LSTM from future to past. Different from unidirectional, the two hidden states combined are able in any point in time to preserve information from both past and future. A BiLSTM unit consists of four components: input gate , forget gate , output gate and cell activation vector . The hidden state given input is computed as follows:(7) 
(8) 
(9) 
(10) 
(11) 
(12) 
(13) 
where , , , represent the value of
at the moment
respectively, anddenote the weight matrix and bias vector,
andare activation function, the operator
denotes elementwise multiplication, the current cell state consists of two components, namely previous memory and modulated new memory , the output combines the forward and backward pass outputs. Note that the merge mode by which outputs of the forward and backward are combined has different types, e.g. sum, multiply, concatenate, average. In this work, we use the mode “sum” to obtain the output .Attention mechanism for processing sequential data that could focus on the features of the keywords to reduce the impact of nonkey temporal context. Hence, we adopt temporal attention mechanism to produce a weight vector and merge raw features from each time step into a segmentlevel feature vector, by multiplying the weight vector. The work process of attention mechanism is following detailed.
(14) 
(15) 
(16) 
(17) 
Here and are represented as the weight and bias. A weighted sum of the based on the weight is computed as the context representation . The context representation is considered as the predicted value of for temporal features .
3.4.2 Linear prediction
Autoregressive (AR) model is a regression model that uses the dependencies between an observation and a number if lagged observations. Nonlinear Recurrent Networks are theoretically more expressive and powerful than AR models. In fact, AR models also yield good results in forecasting short term modeling. In specific real datasets, such infinitehorizon memory isn’t always effective. Therefore, we incorporate AR model in parallel to the nonlinear memory network part.
The AR model is formulated as follows:
(18) 
where are the weights of the AR model, is a constant, represents the predicted value for past temporal value . We implement this model using Dense layer of network to combine the weights and data.
In the output layer, the prediction error is obtained by computing the difference between the output of predictor model and true value . The final prediction error integrates the output of nonlinear prediction model and linear prediction model. The following equation is written as:
(19) 
where is a subsample of training data, is the Frobenius norm.
3.5 Joint optimization
As for multistep approach, it can easily get stuck in local optima, since models are trained separately. Therefore, we propose an endtoend hybrid model by minimizing compound objective function.
The CAEM objective has four components, MSE (reconstruction error) term, MMD (regularization) term, prediction error (nonlinear forecasting task) term and prediction error (linear forecasting task) term. Given samples , the objective function is constructed as:
(20) 
where is batch size used for training, is current time step, , and are the meta parameters controlling the importance of the loss function.
Restating our goals more formally, we would like to:

Minimize the reconstruction error in the characterization network, that is, minimize the error in reconstructing from at all time step . We need to compute the average error at each time step of sample. The purpose is to obtain better lowdimensional representation for multisensor data.

Minimize the MMD loss that encourages the distribution of the lowdimensional representation to be similar to a target distribution . It can make anomalies deviate from normal data in the reduced dimensions.

Minimize the prediction error by integrating nonlinear predictor and linear predictor. We split the set obtained by characterization network into the current value and the past values . And then the predicted values and are obtained by minimizing prediction errors. The purpose is to accurately express the information of the next temporal slice using different predictor, thus updating lowdimensional feature and reconstruction error.

, and are the meta parameters in CAEM. In practice, , , and usually achieve desirable results. Here MMD is complemented as a regularization term. The parameter selection is performed in Section 4.8.1.
3.6 Inference
Given samples as training dataset , we are able to compute the corresponding decision threshold ():
(21) 
where we denote as the sum of loss function for , and is the average value of for . The setting is similar to the normal training distribution
following with 1 standard deviation
of the mean .In inference process, the decision rule is that if , the testing sample in a sequence can be predicted to be “abnormal”, otherwise “normal”.
The complete training and inference procedure of CAEM is shown in Algorithm 1.
4 Experiments
In this section, we conduct extensive experiments to evaluate the performance of our proposed CAEM approach for anomaly detection on several realworld datasets.
4.1 Datasets
We adopt two large publiclyavailable datasets and a private dataset: PAMAP2, CAP and Mental fatigue dataset. These datasets are exploiting multisensor time series for activity recognition, sleep state detection, and mental fatigue detection, respectively. Therefore, they are ideal testbeds for evaluating anomaly detection algorithms.
PAMAP2 [49] dataset is a mobile dataset with respect to actions or activities from UCI repository, containing data of 18 different physical activities performed by 9 subjects wearing 3 inertial measurement units, e.g. accelerator, gyroscope and magnetometer. There are 18 activity categories in total. For experiments, we treat these classes with relatively smaller samples as the anomaly classes (including running, ascending stairs, descending stairs and rope jumping), while the rest categories are combined to form the normal classes.
CAP Sleep Database [61], which stands for the Cyclic Alternating Pattern (CAP) database, is a clinical dataset from PhysioNet repository. It is characterized by periodic physiological signals occurring during wake, S1S4 sleep stages and REM sleep. The waveforms include at least 3 EEG channels, 2 EOG channels, EMG signal, respiration signal and EKG signal. There are 16 healthy subjects and 92 patients in the database. The pathological recordings include the patients diagnosed with bruxism, insomnia, narcolepsy, nocturnal frontal lobe epilepsy, periodic leg movements, REM behavior disorder and sleepdisordered breathing. In this task, we extracted 7 valid channels of all the channels like ROCLOC, C4P4, C4A1, F4C4, P4O2, ECG1ECG2, EMG1EMG2 etc. For detecting sleep apnea events, we chose healthy subjects as normal class and the patients with sleepdisordered breathing as anomaly class.
Mental Fatigue Dataset [72] is a real world healthcare dataset. Aiming to detect mental fatigue in the healthy group, we collected the physiological signals (e.g., GSR, HR, RR intervals and skin temperature) using wearable device. There are 6 healthy young subjects participated in the mental fatigue experiments. In this task, nonfatigue data samples are labeled as normal class and fatigue data samples are labeled as anomaly class. Fatigue data accounts for a fifth of the total.
The detailed information of the datasets is shown in TABLE I.
4.2 Baseline Methods
In order to extensively evaluate the performance of the proposed CAEM approach, we compare it with several traditional and deep anomaly detection methods:
(1) KPCA
(Kernel principal component analysis)
[21], which is a nonlinear extension of PCA commonly used for anomaly detection. (2) ABOD(Anglebased outlier detection)
[28], which is a probabilistic model that well suited for high dimensional data. (3)
OCSVM (Oneclass support vector machine) [39], which is the oneclass learning method that classifies new data as similar or different to the training set. (4) HMM[22]is a finite set of states, each of which is associated with a probability distribution. In a particular state an observation can be generated, according to the associated probability distribution. (5)
CNNLSTM [10], which is a forecasting model composed of convolutional and LSTM networks. It can obtain the forecast by estimating the current data, and detect anomalies on comparing the forecasting value with actuals. (6) LSTMAE (LSTM based autoencoder) [40], which is an unsupervised detection technique used in time series that can induce a representation by learning an approximation of the identity function of data. (7) ConvLSTMCOMPOSITE [41], which utilizes a composite structure that is able to encoder the input, reconstruct it, and predict its near future. To simplify the name, “ConvLSTMCOMP” denotes ConvLSTMCOMPOSITE. We choose the “conditional” version to build a single model called ConvLSTMAE by removing the forecasting decoder. (8) UODA (Unsupervised sequential outlier detection with deep architecture) [37], which utilizes autoencoders to capture the intrinsic difference between normal and abnormal samples, and then integrates the model to RNNs that perform finetuning to update the parameters in DAE. (9) MSCRED (Multiscale convolutional recurrent encoderdecoder) [68], which is a reconstructionbased anomaly detection and diagnosis method.Method  PAMAP2  CAP dataset  Fatigue dataset  
mPre  mRec  mF1  mPre  mRec  mF1  mPre  mRec  mF1  
KPCA  0.7236  0.6579  0.6892  0.7603  0.5847  0.6611  0.5341  0.5014  0.5173 
ABOD  0.8653  0.9022  0.8834  0.7867  0.6365  0.7037  0.6679  0.6145  0.6401 
OCSVM  0.7600  0.7204  0.7397  0.9267  0.9259  0.9263  0.5605  0.5710  0.5290 
HMM  0.6950  0.6553  0.6745  0.8238  0.8078  0.8157  0.6066  0.6076  0.6071 
CNNLSTM  0.6680  0.5392  0.5968  0.6159  0.5217  0.5649  0.5780  0.5042  0.5386 
LSTMAE  0.8619  0.7997  0.8296  0.7147  0.6253  0.6671  0.7140  0.6820  0.6870 
UODA  0.8957  0.8513  0.8730  0.7557  0.5124  0.6107  0.8280  0.7770  0.8017 
MSCRED  0.6997  0.7301  0.7146  0.6410  0.5784  0.6081  0.8016  0.6802  0.7359 
ConvLSTMAE  0.7359  0.7361  0.7360  0.8150  0.8194  0.8172  0.9010  0.9346  0.9175 
ConvLSTMCOMP  0.8844  0.8842  0.8843  0.8367  0.8377  0.8372  0.9373  0.9316  0.9344 
CAEM (Ours)  0.9608  0.9670  0.9639  0.9939  0.9952  0.9961  0.9962  0.9959  0.9960 
Improvement  7.64%  6.48%  7.96%  6.72%  6.93%  6.98%  5.89%  6.13%  6.16% 
4.3 Implementation details
For traditional anomaly detection, we scale the sequential data into segments and extract the features from each segment. In PAMAP2 dataset, multiple sensors are worn on three different position (wrist, chest, ankle). Hence, we extract 324 features including time and frequency domain features. In CAP Sleep dataset, we first pass through the Hanning window low pass filter for removing the high frequency components of signals. And then we extract 91 features for EEG, EMG and ECG signals
[58, 47, 14]; In Mental Fatigue dataset, we preprocess physiological signals by interpolation and filtering algorithm. Then we extract 23 features for Galvanic Skin Response (GSR), Heart Rate (HR), RR intervals and skin temperature sensors
[72].For Deep Anomaly Detection (DAD) method, we filter multisensor signals and then pack these signals into matrix as input to construct the deep model.
We reimplement these methods based on opensource repositories
^{1}^{1}1https://pyod.readthedocs.io/en/stable/, https://github.com/7fantasysz/MSCRED or our own implementations. For KPCA, we employ Gaussian kernel with a bandwidth of 600, 500, 0.5, respectively for PAMAP2, CAP, and Mental Fatigue datasets. For ABOD, we usenearest neighbors to approximate the complexity reduction. For an observation, the variance of its weighted cosine scores to all neighbors could be viewed as the abnormal score. For OCSVM, we adopt PCA for OCSVM as a dimension reduction tool and employ the Gaussian kernel with a bandwidth of 0.1. For HMM, we build a Markov model after extracting features and calculate the anomaly probability from the state sequence generated by the model. For CNNLSTM, we define a CNNLSTM model in
Keras by first defining 2D convolutional network as comprised of Conv2D and MaxPooling2D layers ordered into a stack of the required depth, wrapping them in a TimeDistributed layer and then defining the LSTM and output layers. For LSTMAE, we use singlelayer LSTM on both encoder and decoder in the task. For ConvLSTMCOMPOSITE, we choose ”conditional” version and adapt this technique to anomaly detection in multivariate time series. Here we also build a single model called ConvLSTMAE by removing forecasting decoder. For UODA, we reimplement this algorithm by customizing the number of layers and hyperparameters. For MSCRED, we first construct multiscale matrices for multisensor data, and then fed it into MSCRED model and evaluate the performance.For our own CAEM, we use library Hyperopt [4]
to select the best hyperparameters (i.e., time window, the number of neurons, learning rate, activation function, optimization criteria and iterations). The characterization network runs with
, i.e., Conv1Conv5 with 32 kernels of size 4 4, 64 kernels of size 4 4, 64 kernels of size 4 4, 32 kernels of size 4 4, 1 kernels of size 4 4, and Maxpooling with size 22. We use Rectified Linear Unit (ReLU) as the activation function of convolutional layers. The memory network contains nonlinear prediction and linear prediction, where the nonlinear network runs with
, and the linear network runs with . The CAEM model is trained in an endtoend fashion using Keras [24]. The optimization algorithm is Adam and the batch size is set as 32. And we set parameters of compound objective function , and . The time step usually gives desirable results as or .Note that in addition to the complete CAEM approach, we further evaluate its several variants as baselines to justify the effectiveness of each component:

CAEMw/oPre. The CAEM model removes the linear and nonlinear prediction. That is, this variant only adopts the characterization network with reconstruction loss and MMD loss. (i.e., )

CAEMw/oRec+MMD. The CAEM model removes the reconstruction error and MMD. Different from CNNLSTM model, the characterization network is still performed as the deep convolutional autoencoder. We put the latent representation without reconstruction error into the memory network. (i.e., )

CAEMw/oATTENTION. The CAEM model without Attention component is implemented. (i.e., )

CAEMw/oAR. The CAEM model without AR component is implemented. (i.e., )

CAEMw/oMMD. The CAEM model without MMD component is implemented. (i.e., )
Note that anomaly detection problems are often with highlyimbalanced classes, hence accuracy
is not suitable as the evaluation metric. In order to thoroughly evaluate the performance of our proposed method, we follow existing works
[75, 18, 68] to adopt the mean precision, recall, and F1 score as the evaluation metrics. The mean precision means the average precision of normal and abnormal class. The same pattern goes for mean recall, F1 score.In the experiments, the trainvalidationtest sets are split by following existing works [68, 37]. Concretely speaking, for each dataset, we split normal samples into training, validation, and test with the ratio of , where the training and validation set only contain normal samples and have no overlapping with testing set. The anomalous samples are only used in the testing set. The model selection criterion, i.e., hyperparameters, used for tuning is the validation error on the validation set.
Method  WAKE  S1  S2  S3  S4  REM  
mPre  mRec  mF1  mPre  mRec  mF1  mPre  mRec  mF1  mPre  mRec  mF1  mPre  mRec  mF1  mPre  mRec  mF1  
KPCA  0.9162  0.8213  0.8662  0.8267  0.7598  0.7918  0.9257  0.9353  0.9305  0.9039  0.8689  0.8861  0.9402  0.9604  0.9502  0.9536  0.9614  0.9575  
ABOD  0.9872  0.8686  0.9242  0.9347  0.5522  0.6942  0.9389  0.6550  0.7716  0.8489  0.6184  0.7155  0.6749  0.6448  0.6595  0.5915  0.5909  0.5912  
OCSVM  0.9784  0.9492  0.9636  0.9655  0.9504  0.9579  0.9395  0.9448  0.9421  0.9714  0.9499  0.9605  0.8701  0.9488  0.9077  0.9784  0.9492  0.9636  
HMM  0.8417  0.8406  0.8411  0.8790  0.8856  0.8823  0.8967  0.8887  0.8927  0.6880  0.6747  0.6813  0.7279  0.7286  0.7282  0.8024  0.8649  0.8325  
LSTMAE  0.6990  0.7178  0.7082  0.6517  0.6492  0.6504  0.7430  0.7331  0.7380  0.7689  0.7828  0.7758  0.7274  0.7569  0.7418  0.6590  0.6887  0.6735  
UODA  0.6159  0.6326  0.6241  0.6762  0.6762  0.6762  0.7290  0.5223  0.6086  0.5716  0.5766  0.5741  0.6626  0.8498  0.6807  0.5626  0.6116  0.5861  

0.9889  0.9772  0.9830  0.9755  0.9850  0.9864  0.9250  0.9127  0.9188  0.9401  0.9041  0.9217  0.8647  0.8866  0.9023  0.9675  0.9949  0.9810  
CAEM  0.9974  0.9949  0.9961  0.9958  0.9950  ß0.9954  0.9950  0.9950  0.9950  0.9294  0.8842  0.9063  0.9842  0.9950  0.9895  0.9681  0.9950  0.9813 
4.4 Results and Analysis
As shown in TABLE II, we compare our proposed method with traditional and deep anomaly detection methods using the mean precision, recall and F1 score. We can see that our method outperforms most of the existing methods, which demonstrates the effectiveness of our method. From TABLE II, we can observe the following results.
For the PAMAP2 dataset, the CAEM achieves the highest precision and recall compared by 10 popular methods. Traditional methods perform differently on PAMAP2 dataset since they are limited by the feature extraction and feature selection methods. In deep learning method, CNNLSTM has a lowest F1 score. This means that more constraints such as data preprocessing method and anomaly evaluation strategy need to be added for predictionbased anomaly detection. For LSTMAE, MSCRED and ConvLSTMAE, they both are reconstructionbased anomaly detection methods. Their performance is limited by the “noisy data” problem, resulting in reconstruction error for the abnormal input could be fit so well. For UODA, it performs reasonably well on the PAMAP2 dataset, but it is not endtoend training, which is needed by pretraining denoising autoencoder (DAEs) and deep recurrent networks (RNNs), and then finetuning the UODA model composing of the DAE and RNN. For ConvLSTMCOMPOSITE model, it performs better than other baseline models. The model consists of a single encoder, two decoders of reconstruction branch and prediction branch. In fact, since its efficiency is influenced by reconstruction error and prediction error respectively, its performance could be limited by one of encoderdecoder models.
For the CAP dataset, most of methods show a low F1 score. As CAP dataset contains different sleep stages of subjects, some methods are limited by high complexity of data. For OCSVM and HMM, they achieve better performance because of dimensionality reduction from 36 dimensions of PAMAP2 dataset to 7 dimensions. For MSCRED, due to batch size =1 for the training model in the open source code, the loss function couldn’t converge during training model and the training speed is slow. Our proposed method achieves about 7% improvement at F1 score, compared with the existing methods.
For Fatigue dataset, it is difficult to label fatigue and nonfatigue data manually. Therefore, it may be a lot of noise or misclassification patterns in the data, so that most of methods fail to solve this problem. For UODA, MSCRED and ConvLSTM, they have ability to overcome noise and misclassification of training data. Our proposed method also solves this problem successfully and achieves at least 6% improvement at F1 score.
Besides, in order to indicate significant differences from our proposed method and other baselines, we use Wilcoxon signed rank test[50] to analyze these results in TABLE II. We compute average pvalue of CAEM compared with other baselines. A pvalue = 0.0077 indicates that the performance of our proposed method differs from other methods. This pvalue is also computed in TABLE III.
4.5 Finegrained Analysis
In addition to the anomaly detection of different classes on each dataset, we conduct a finegrained analysis to evaluate the performance of each method within each class. Considering intraclass diversity, we conduct a group of experiments to detect anomalies in different sleep stages. In fact, these physiological signals in different sleep stages have significant differences. We choose 4 traditional methods and 3 deep methods with good performance in global domain as comparison methods. As shown in TABLE III, we can observe that our architecture is most robust across same experiment settings. Several observations from these results are worth highlighting. For ABOD, the testing performance is unstable in local domain, which the highest F1 score is 0.92 in WAKE and the lowest F1 score is 0.59 in REM. For KPCA and ConvLSTMCOMPOSITE, the testing performance in local domain far exceeds the performance in global domain. This demonstrates that the two model can achieve better performance when intraclass data have similar distribution or regular pattern. For other methods, the testing performance is consistent in local and global domain. For our proposed method, the best testing performance can be achieved no matter in local domain or global domain. This study clearly justifies the superior representational capacity of our architecture to solve intraclass diversity.
4.6 Effectiveness Evaluation
Method  Worst mF1  Best mF1  Mean mF1 
ABOD  0.6093  0.8507  0.7706 
ConvLSTMCOMP  0.7033  0.9224  0.8493 
UODA  0.5938  0.9336  0.7984 
CAEM  0.8009  0.9433  0.8616 
4.6.1 Leave One Subject Out
In this section, we measure the generalization ability of models using Leave One Subject Out (LOSO). The fact is that when training and testing datasets contain the same subject, the model is likely to know more about the current subject which may be biased towards a new one. Therefore, LOSO could help to evaluate the generalization ability. We choose the PAMAP2 dataset to conduct subjectindependent experiments which contain 8 subjects. As can be seen in Fig. 2(a), we evaluate our proposed method and three methods with relatively high F1 score. By examining the results, one can easily notice that deep learningbased methods obtain better performance than traditional methods. However, complex models such as deep neural networks are prone to overfitting because of their flexibility in memorizing the idiosyncratic patterns in the training set, instead of generalizing to unseen data.
TABLE IV shows the best, the worst and average performance among 8 subjects. We can observe that UODA and ConvLSTMCOMPOSITE model perform well in some specific subjects, but they fail to reduce the effects of overfitting to each test subject, even drop to 0.70 and 0.59 for some subjects. Compared to these methods, CAEM can generalize well on testing subjects it hasn’t appeared before, which reach the average F1 score of 0.86. Besides, we perform an analysis of variance on repeated measures within subject 1 (corresponding to numbers in Fig. 2(a)). As shown in TABLE V, we observe that CAEM remains a more stable performance on repeated measurements. In summary, the above demonstrates that our model can be motivated to improve the generalization ability.
Method  mPre  mRec  mF1 
ABOD  0.62400.000  0.59460.000  0.60900.000 
ConvLSTMCOMP  0.89530.029  0.80810.038  0.84880.019 
UODA  0.81550.063  0.74640.027  0.77820.031 
CAEM  0.94370.024  0.81910.003  0.87700.012 
4.6.2 Ablation Study
ID  Method  PAMAP2  CAP dataset  Fatigue dataset  
mPre  mRec  mF1  mPre  mRec  mF1  mPre  mRec  mF1  
1  CAEM  0.8103  0.8023  0.8063  0.8299  0.8101  0.8199  0.6005  0.6096  0.6050 
2  CAEM  0.5693  0.5440  0.5563  0.8896  0.7784  0.8303  0.7050  0.6814  0.6930 
3  CAEM  0.9151  0.9276  0.9213  0.9251  0.9291  0.9271  0.9605  0.9551  0.9578 
4  CAEM  0.9060  0.8691  0.8872  0.9634  0.9381  0.9506  0.9046  0.9048  0.9047 
5  CAEM  0.9437  0.9550  0.9493  0.9293  0.9213  0.9253  0.9407  0.9288  0.9347 
6  CAEM  0.9608  0.9670  0.9639  0.9939  0.9952  0.9961  0.9962  0.9959  0.9960 
The proposed CAEM approach consists of several components such as CAE, MMD, Attention mechanism, BiLSTM and Autoregressive. To demonstrate the effectiveness of each component, we conduct ablation studies in this section. The ablation study is shown in Fig. 2(b). These ID numbers represent CAEM without nonlinear and linear prediction, CAEM without reconstruction error and MMD, CAEM without attention module, CAEM without AR, CAEM without MMD and CAEM, respectively. The experimental results indicate that for the removal of different component above, there is corresponding performance drop at F1 score. We can observe that CAEM model without prediction or reconstruction error achieves a low F1 score relatively. This demonstrates that our composite model is effective and necessary for anomaly detection in multisensor timeseries data. Compared to original CAEM model, removing the AR component (in CAEM) from the full model causes significant performance drops on most of the datasets. This shows the critical role of the AR component in general. Moreover, attention and MMD components can also cause big performance rises on all the datasets. More details are shown in TABLE VI. Here, these ID numbers are corresponding to numbers in Fig. 2(b).
4.7 Robustness to Noisy Data
In realworld applications, the collection of multisensor timeseries data can be easily polluted with noise due to changes in the environment or the data collection devices. The noisy data bring critical challenges to the unsupervised anomaly detection methods. In this section, we evaluate the robustness of different methods to noisy data. We manually control the noisy data ratio in the training data. We inject Gaussian noise (=0, =0.3) in a random selection of samples with a ratio varying between 1% to 30%. We compare the performance of three methods on PAMAP2 dataset: UODA, ConvLSTMCOMPOSITE, and CAEM in Fig. 3. These methods have good stability in the above experiments. As the noise increases, the performance of all methods decreases. For CAEM, the F1 score, precision and recall have no significant decline. Among them, our model remains significantly superior to others, demonstrating its robustness to noisy data.
4.8 Further Analysis
4.8.1 Parameter Sensitivity Analysis
In this section, we evaluate the parameter sensitivity of CAEM model. It is worth noting that CAEM achieves the best performance by adjusting weight coefficient of compound objective function. We apply control variate reduction technique [29] to empirically evaluate the sensitivity of parameter with a wide range. The results are shown in Fig. 4. As the value of MMD loss is greater than others, we select its weight coefficient within e04 e07 and other weight coefficients within [0.1, 0.5, 1, 5, 10, 50]. We adjust one of while fixing the other respective to keep the optimal value (, , and ). When weight coefficient is increased, we observe that F1 score tends to decline. The optimal parameter is , , and . It can be seen that the performance of CAEM stays robust within a wide range of parameter choice.
4.8.2 Convergence Analysis
Since CAEM involves several components, it is natural to ask whether and how quickly it can converge. In this section, we analyze the convergence to answer this question. We extensively show the results of each component on three datasets in Fig. 5. These results demonstrate that even if the proposed CAEM approach involves several components, it could reach a steady performance within fewer than 40 iterations. Therefore, in real applications, CAEM can be applied more easily with a fast and steady convergence performance.
5 Conclusion and Future Work
In this paper, we introduced a Deep Convolutional Autoencoding Memory network named CAEM to detect anomalies. The CAEM model uses a composite framework to model generalized pattern of normal data by capturing spatialtemporal correlation in multisensor timeseries data. We first build Deep Convolutional Autoencoder with a Maximum Mean Discrepancy (MMD) penalty to characterize multisensor timeseries signals and reduce the risk of overfitting caused by noise and anomalies in training data. To better represent temporal dependency of sequential data, we use nonlinear Bidirectional LSTM with Attention and linear Autoregressive model for prediction. Extensive empirical studies on HAR and HC datasets demonstrate that CAEM performs better than other baseline methods.
In the future work, we will focus on the pointbased finegrained anomaly detection approach and further improve our method for multisensor data by designing proper sparse operations.
Acknowledgment
This work was supported by KeyArea Research and Development Program of Guangdong Province (No.2019B010109001), Science and Technology Service Network Initiative, Chinese Academy of Sciences (No. KFJSTSQYZD202111001), and Natural Science Foundation of China (No.61972383, No.61902377, No.61902379).
References
 [1] (2019) A datadriven metric learningbased scheme for unsupervised network anomaly detection. Computers & Electrical Engineering 73, pp. 71–83. Cited by: §1, §3.2.
 [2] (2015) Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE 2 (1). Cited by: §2.2.1.
 [3] (2017) Comprehensive survey of deep learning in remote sensing: theories, tools, and challenges for the community. Journal of Applied Remote Sensing 11 (4), pp. 042609. Cited by: §1.
 [4] (2013) Hyperopt: a python library for optimizing the hyperparameters of machine learning algorithms. In Proceedings of the 12th Python in science conference, pp. 13–20. Cited by: §4.3.
 [5] (2019) Deep learning for anomaly detection: a survey. arXiv preprint arXiv:1901.03407. Cited by: §1.
 [6] (2018) Dynamic illness severity prediction via multitask rnns for intensive care unit. In 2018 IEEE International Conference on Data Mining (ICDM), pp. 917–922. Cited by: §2.2.3.

[7]
(2020)
Fedhealth: a federated transfer learning framework for wearable healthcare
. IEEE Intelligent Systems 35 (4), pp. 83–93. Cited by: §1.  [8] (2019) Crossposition activity recognition with stratified transfer learning. Pervasive and Mobile Computing 57, pp. 1–13. Cited by: §1.
 [9] (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §3.4.1.

[10]
(2015)
Longterm recurrent convolutional networks for visual recognition and description.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 2625–2634. Cited by: §4.2.  [11] (2017) Unsupervised and semisupervised anomaly detection with lstm neural networks. arXiv preprint arXiv:1710.09207. Cited by: §2.2.2.
 [12] (2020) Inceptiontime: finding alexnet for time series classification. Data Mining and Knowledge Discovery 34 (6), pp. 1936–1962. Cited by: §2.2.3.
 [13] (2017) Rnnbased early cyberattack detection for the tennessee eastman process. arXiv preprint arXiv:1709.02232. Cited by: §2.2.2.
 [14] (2017) An overview of feature extraction techniques of ecg. AmericanEurasian Journal of Scientific Research 12 (1), pp. 54–60. Cited by: §4.3.
 [15] (2019) Memorizing normality to detect anomaly: memoryaugmented deep autoencoder for unsupervised anomaly detection. arXiv preprint arXiv:1904.02639. Cited by: §1, §3.2.
 [16] (2017) LSTM: a search space odyssey. IEEE transactions on neural networks and learning systems 28 (10), pp. 2222–2232. Cited by: §3.4.1.
 [17] (2014) Robust multivariate autoregression for anomaly detection in dynamic product ratings. In Proceedings of the 23rd international conference on World wide web, pp. 361–372. Cited by: §2.1.
 [18] (2018) Multidimensional time series anomaly detection: a grubased gaussian mixture variational autoencoder approach. In Asian Conference on Machine Learning, pp. 97–112. Cited by: §4.3.
 [19] (1994) Time series analysis. Vol. 2, Princeton university press Princeton, NJ. Cited by: §2.1.
 [20] (2016) Learning temporal regularity in video sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 733–742. Cited by: §1, §2.2.1, §3.2.

[21]
(2007)
Kernel pca for novelty detection
. Pattern recognition 40 (3), pp. 863–874. Cited by: §2.1, §4.2.  [22] (2005) Investigating hidden markov models capabilities in anomaly detection. In Proceedings of the 43rd annual Southeast regional conferenceVolume 1, pp. 98–103. Cited by: §4.2.
 [23] (2011) Fault detection using dynamic time warping (dtw) algorithm and discriminant analysis for swine wastewater treatment. Journal of hazardous materials 185 (1), pp. 262–268. Cited by: §2.2.3.
 [24] (2017) Introduction to keras. In Deep learning with Python, pp. 97–111. Cited by: §4.3.
 [25] (2012) Robust kernel density estimation. Journal of Machine Learning Research 13 (Sep), pp. 2529–2565. Cited by: §2.1.
 [26] (2018) An overview of deep learning based methods for unsupervised and semisupervised anomaly detection in videos. Journal of Imaging 4 (2), pp. 36. Cited by: §1.

[27]
(2018)
Detecting cyber attacks in industrial control systems using convolutional neural networks
. In Proceedings of the 2018 Workshop on CyberPhysical Systems Security and PrivaCy, pp. 72–83. Cited by: §2.2.2.  [28] (2008) Anglebased outlier detection in highdimensional data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 444–452. Cited by: §4.2.
 [29] (2015) Application of the control variate technique to estimation of total sensitivity indices. Reliability Engineering & System Safety 134, pp. 251–259. Cited by: §4.8.1.
 [30] (2018) Modeling longand shortterm temporal patterns with deep neural networks. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 95–104. Cited by: §2.2.2.
 [31] (2007) Outlier detection with kernel density functions. In International Workshop on Machine Learning and Data Mining in Pattern Recognition, pp. 61–75. Cited by: §1, §2.1, §3.2.
 [32] (2009) Anomaly detection in sea traffica comparison of the gaussian mixture model and the kernel density estimator. In 2009 12th International Conference on Information Fusion, pp. 756–763. Cited by: §2.1.

[33]
(2018)
Anomaly detection with generative adversarial networks for multivariate time series
. arXiv preprint arXiv:1809.04758. Cited by: §1, §3.2.  [34] (2019) MADgan: multivariate anomaly detection for time series data with generative adversarial networks. In International Conference on Artificial Neural Networks, pp. 703–716. Cited by: §1.
 [35] (2019) Bidirectional lstm with attention mechanism and convolutional layer for text classification. Neurocomputing 337, pp. 325–338. Cited by: §3.4.1.
 [36] (2018) Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6536–6545. Cited by: §2.2.2.
 [37] (2017) Unsupervised sequential outlier detection with deep architectures. IEEE transactions on image processing 26 (9), pp. 4321–4330. Cited by: §4.2, §4.3.
 [38] (2017) Remembering history with convolutional lstm for anomaly detection. In 2017 IEEE International Conference on Multimedia and Expo (ICME), pp. 439–444. Cited by: §2.2.1.
 [39] (2003) Timeseries novelty detection using oneclass support vector machines. In Proceedings of the International Joint Conference on Neural Networks, 2003., Vol. 3, pp. 1741–1745. Cited by: §4.2.
 [40] (2016) LSTMbased encoderdecoder for multisensor anomaly detection. arXiv preprint arXiv:1607.00148. Cited by: §2.2.1, §4.2.
 [41] (2016) Anomaly detection in video using predictive convolutional long shortterm memory networks. arXiv preprint arXiv:1612.00390. Cited by: §2.2.3, §2.2.3, §4.2.
 [42] (2008) Arima model for network traffic prediction and anomaly detection. In 2008 International Symposium on Information Technology, Vol. 4, pp. 1–6. Cited by: §2.1.
 [43] (2013) Spacetime signal processing for distributed pattern detection in sensor networks. IEEE Journal of Selected Topics in Signal Processing 7 (1), pp. 38–49. Cited by: §1, §2.1, §3.2.
 [44] (2018) Robust pca for anomaly detection in cyber networks. arXiv preprint arXiv:1801.01571. Cited by: §2.1.
 [45] (2021) Deep learning for anomaly detection: a review. ACM Computing Surveys (CSUR) 54 (2), pp. 1–38. Cited by: §1.
 [46] (2015) Spatiotemporal video autoencoder with differentiable memory. arXiv preprint arXiv:1511.06309. Cited by: §1.
 [47] (2009) A novel feature extraction for robust emg pattern recognition. arXiv preprint arXiv:0912.3973. Cited by: §4.3.
 [48] (2018) Unsupervised anomaly detection for high dimensional data—an exploratory analysis. In Computational Intelligence for Multimedia Big Data on the Cloud with Engineering Applications, pp. 233–251. Cited by: §1.
 [49] (2012) Introducing a new benchmarked dataset for activity monitoring. In 2012 16th International Symposium on Wearable Computers, pp. 108–109. Cited by: TABLE I, §4.1.
 [50] Wilcoxonsignedrank test. Springer Berlin Heidelberg. Cited by: §4.4.
 [51] (2020) A unifying review of deep and shallow anomaly detection. Cited by: §1.
 [52] (2014) Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis, pp. 4. Cited by: §1, §2.2.1, §3.2.
 [53] (2001) Estimating the support of a highdimensional distribution. Neural computation 13 (7), pp. 1443–1471. Cited by: §1, §2.1, §3.2.

[54]
(1998)
Nonlinear component analysis as a kernel eigenvalue problem
. Neural computation 10 (5), pp. 1299–1319. Cited by: §2.1.  [55] (2018) Anomaly detection for water treatment system based on neural network with automatic architecture optimization. arXiv preprint arXiv:1807.07282. Cited by: §2.2.2.
 [56] (2007) A hilbert space embedding for distributions. In International Conference on Algorithmic Learning Theory, pp. 13–31. Cited by: §3.3.2.
 [57] (2015) Unsupervised learning of video representations using lstms. In International conference on machine learning, pp. 843–852. Cited by: §2.2.3, §2.2.3.
 [58] (2010) EEG signal analysis: a survey. Journal of medical systems 34 (2), pp. 195–212. Cited by: §4.3.
 [59] (2020) Rethinking 1dcnn for time series classification: a stronger baseline. arXiv preprint arXiv:2002.10061. Cited by: §2.2.3.
 [60] (2004) Support vector data description. Machine learning 54 (1), pp. 45–66. Cited by: §2.1.
 [61] (2002) Atlas, rules, and recording techniques for the scoring of cyclic alternating pattern (cap) in human sleep. Sleep medicine 3 (2), pp. 187–199. Cited by: TABLE I, §4.1.
 [62] (2008) Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pp. 1096–1103. Cited by: §1, §2.2.1.
 [63] (2019) Deep learning for sensorbased activity recognition: a survey. Pattern Recognition Letters 119, pp. 3–11. Cited by: §1.
 [64] (2018) Stratified transfer learning for crossdomain activity recognition. In 2018 IEEE International Conference on Pervasive Computing and Communications (PerCom), pp. 1–10. Cited by: §1.
 [65] (2018) Deep transfer learning for crossdomain activity recognition. In proceedings of the 3rd International Conference on Crowd Science and Engineering, pp. 1–8. Cited by: §1.
 [66] (2020) Connecting the dots: multivariate time series forecasting with graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 753–763. Cited by: §2.2.3.
 [67] (2010) Semisupervised anomaly detection for eeg waveforms using deep belief nets. In 2010 Ninth International Conference on Machine Learning and Applications, pp. 436–441. Cited by: §2.2.1.

[68]
(2019)
A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 1409–1416. Cited by: §1, §2.2.1, §4.2, §4.3, §4.3.  [69] (2018) Salient subsequence learning for time series clustering. IEEE transactions on pattern analysis and machine intelligence 41 (9), pp. 2193–2207. Cited by: §2.2.3.
 [70] (2018) Multimodality sensor data classification with selective attention. arXiv preprint arXiv:1804.05493. Cited by: §2.2.3.
 [71] (2021) Deep unsupervised multimodal fusion network for detecting driver distraction. Neurocomputing 421, pp. 26–38. Cited by: §1.
 [72] (2018) A deep temporal model for mental fatigue detection. In 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 1879–1884. Cited by: §1, TABLE I, §4.1, §4.3.
 [73] (2017) Spatiotemporal autoencoder for video anomaly detection. In Proceedings of the 25th ACM international conference on Multimedia, pp. 1933–1941. Cited by: §2.2.3, §2.2.3.
 [74] (2017) Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 665–674. Cited by: §2.2.1.
 [75] (2018) Deep autoencoding gaussian mixture model for unsupervised anomaly detection. Cited by: §1, §2.2.3, §2.2.3, §3.2, §3.3.2, §4.3.
Comments
There are no comments yet.