Unsupervised Deep Anomaly Detection for Multi-Sensor Time-Series Signals

Nowadays, multi-sensor technologies are applied in many fields, e.g., Health Care (HC), Human Activity Recognition (HAR), and Industrial Control System (ICS). These sensors can generate a substantial amount of multivariate time-series data. Unsupervised anomaly detection on multi-sensor time-series data has been proven critical in machine learning researches. The key challenge is to discover generalized normal patterns by capturing spatial-temporal correlation in multi-sensor data. Beyond this challenge, the noisy data is often intertwined with the training data, which is likely to mislead the model by making it hard to distinguish between the normal, abnormal, and noisy data. Few of previous researches can jointly address these two challenges. In this paper, we propose a novel deep learning-based anomaly detection algorithm called Deep Convolutional Autoencoding Memory network (CAE-M). We first build a Deep Convolutional Autoencoder to characterize spatial dependence of multi-sensor data with a Maximum Mean Discrepancy (MMD) to better distinguish between the noisy, normal, and abnormal data. Then, we construct a Memory Network consisting of linear (Autoregressive Model) and non-linear predictions (Bidirectional LSTM with Attention) to capture temporal dependence from time-series data. Finally, CAE-M jointly optimizes these two subnetworks. We empirically compare the proposed approach with several state-of-the-art anomaly detection methods on HAR and HC datasets. Experimental results demonstrate that our proposed model outperforms these existing methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

page 7

page 8

page 9

page 10

page 12

page 13

07/03/2019

Time Series Anomaly Detection with Variational Autoencoders

Anomaly detection is a very worthwhile question. However, the anomaly is...
02/19/2022

A Novel Anomaly Detection Method for Multimodal WSN Data Flow via a Dynamic Graph Neural Network

Anomaly detection is widely used to distinguish system anomalies by anal...
02/10/2022

Two-Stage Deep Anomaly Detection with Heterogeneous Time Series Data

We introduce a data-driven anomaly detection framework using a manufactu...
10/14/2021

A Semi-Supervised Approach for Abnormal Event Prediction on Large Operational Network Time-Series Data

Large network logs, recording multivariate time series generated from he...
11/02/2020

BayesBeat: A Bayesian Deep Learning Approach for Atrial Fibrillation Detection from Noisy Photoplethysmography Data

The increasing popularity of smartwatches as affordable and longitudinal...
09/12/2017

Cascaded Region-based Densely Connected Network for Event Detection: A Seismic Application

Automatic event detection from time series signals has wide applications...
11/20/2018

A Deep Neural Network for Unsupervised Anomaly Detection and Diagnosis in Multivariate Time Series Data

Nowadays, multivariate time series data are increasingly collected in va...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Anomaly detection has been one of the core research areas in machine learning for decades, with wide applications such as cyber-intrusion detection [5], medical care [72], sensor networks [3], video anomaly detection [26]

and so on. Anomaly detection seems to be a simple two-category classification, i.e., we can learn to classify the normal or abnormal data. However, it is also faced with the following challenges. First, training data is highly imbalanced since the anomalies are often extremely rare in a dataset compared to the normal instances. Standard classifiers try to maximize accuracy in classification, so it often falls into the trap of overlapping problem, which means that the model classifies the overlapping region as belonging to the majority class while assuming the minority class as noise. Second, there is no easy way for users to manually label each training data, especially the anomalies. In many cases, it is prohibitively hard to represent all types of anomalous behaviors. Due to above challenges, there is a growing trend to use unsupervised learning approaches for anomaly detection compared with semi-supervised and supervised learning approaches since unsupervised methods can handle the imbalanced and unlabeled data in a more principled way

[48, 45, 51, 71, 8].

Nowadays, the prevalence of sensors in machine learning and pervasive computing research areas such as Health Care (HC) [7, 65] and Human Activity Recognition (HAR) [63, 64] generate a substantial amount of multivariate time-series data. These learning algorithms based on multi-sensor time-series signals give priority to dealing with spatial-temporal correlation of multi-sensor data. Many approaches for spatial-temporal dependency amongst multiple sensors [34, 68, 46] have been studied. It seems intuitive to apply previous unsupervised anomaly detection methods on multi-sensor time-series data. Unfortunately, there are still several challenges.

First, anomaly detection in spatial-temporal domain becomes more complicated due to the temporal component in time-series data. Conventional anomaly detection techniques such as PCA [43]

, k-means

[31], OCSVM [53] and Autoencoder [52] are unable to deal with multivariate time-series signals since they cannot simultaneously capture the spatial and temporal dependencies. Second, these reconstruction-based models such as Convolutional AutoEncoders (CAEs) [20]

and Denoising AutoEncoders (DAEs)

[62] are usually used for anomaly detection. It is generally assumed that the compression of anomalous samples is different from that on normal samples, and the reconstruction error becomes higher for these anomalous samples. In reality, being influenced by the high complexity of model and the noise of data, the reconstruction error for the abnormal input could also be fit so well by the training model [75, 15]. That is, the model is robust to noise and anomalies. Third, in order to reduce the dimensionality of multi-sensor data and detect anomalies, two-step approaches are widely adopted. As for the drawback of some works [33, 1], the joint performance of two baseline models can easily get stuck in local optima, since two models are trained separately.

In order to solve the above three challenges, this paper presents a novel unsupervised deep learning based anomaly detection approach for multi-sensor time-series data called Deep Convolutional Autoencoding Memory network (CAE-M). The CAE-M network composes of two main sub-networks: characterization network and memory network. Specifically, we employ deep convolutional autoencoder as feature extraction module, with attention-based Bidirectional LSTMs and Autoregressive model as forecasting module. By simultaneously minimizing reconstruction error and prediction error, the CAE-M model can be jointly optimized. During the training phase, the CAE-M model is trained to explicitly describe the normal pattern of multi-sensor time-series data. During the detection phase, the CAE-M model calculate the compound objective function for each captured testing data. Through combining these errors as a composite anomaly score, a fine-grained anomaly detection decision can be made. To summarize, the main contributions of this paper are four-fold:

1) The proposed composite model is designed to characterize complex spatial-temporal patterns by concurrently performing the reconstruction and prediction analysis. In reconstruction analysis, we build Deep Convolutional Autoencoder to fuse and extract low-dimensional spatial features from multi-sensor signals. In prediction analysis, we build Attention-based Bidirectional LSTM to capture complex temporal dependencies. Moreover, we incorporate Auto-regressive linear model in parallel to improve the robust and adapt for different use cases and domains.

2) To reduce the influence of noisy data, we improve Deep Convolutional Autoencoder with a Maximum Mean Discrepancy (MMD) penalty. MMD is used to encourage the distribution of the low-dimensional representation to approximate some target distribution. It aims to make the distribution of noisy data close to the distribution of normal training data, thereby reducing the risk of overfitting. Experiments demonstrate that it is effective to enhance the robustness and generalization ability of our method.

3) The CAE-M is an end-to-end learning model that two sub-networks can co-optimize by a compound objective function with weight coefficients. This single-stage approach can not only streamline the learning procedure for anomaly detection, but also avoid the model getting stuck in local minimum through joint optimization.

4) Experiments on three multi-sensor time-series datasets demonstrate that CAE-M model has superior performance over state-of-the-art techniques. In order to further verify the effect of our proposed model, fine-grained analysis, effectiveness evaluation, parameter sensitivity analysis and convergence analysis show that all the components of CAE-M together leads to the robust performance on all datasets.

The rest of the paper is organized as follows. Section 2 provides an overview of existing methods for anomaly detection. Our proposed methodology and detailed framework is described in Section 3. Performance evaluation and analysis of experiment is followed in Section 4. Finally, Section 5 concludes the paper and sketches directions for possible future work.

2 Related Work

Anomaly detection has been studied for decades. Based on whether the labels are used in the training process, they are grouped into supervised, semi-supervised and unsupervised anomaly detection. Our main focus is the unsupervised setting. In this section, we demonstrate various types of existing approaches for unsupervised anomaly detection, which can be categorized into traditional anomaly detection and deep anomaly detection.

2.1 Traditional anomaly detection

Conventional methods can be divided into three categories. 1) Reconstruction-based methods are proposed to represent and reconstruct accurately normal data by a model, for example, PCA [43], Kernel PCA [54, 21] and Robust PCA [44]

. Specifically, RPCA is used to identify a low rank representation including random noise and outliers by using a convex relaxation of the rank operator; 2) Clustering analysis is used for anomaly detection, such as Gaussian Mixture Models (GMM)

[32], k-means [31]

and Kernel Density Estimator (KDE)

[25]

. They cluster different data samples and find anomalies via a predefined outlierness score; 3) the methods of one-class learning model are also widely used for anomaly detection. For instance, One-Class Support Vector Machine (OCSVM)

[53]

and Support Vector Data Description (SVDD)

[60] seek to learn a discriminative hypersphere surrounding the normal samples and then classify new data as normal or abnormal.

It is notable that these conventional methods for anomaly detection are designed for static data. To capture the temporal dependencies appropriately, Autoregression (AR) [17], Autoregressive Moving Average (ARMA) [19] and Autoregressive Integrated Moving Average (ARIMA) model [42]

are widely used. These models represent time series that are generated by passing the input through a linear or nonlinear filter which produces the output at any time using the previous output values. Once we have the forecast, we can use it to detect anomalies and compare with groundtruth. Nevertheless, AR model and its variants are rarely used in multi-sensor multivariate time series due to their high computational cost.

2.2 Deep anomaly detection

In deep learning-based anomaly detection, the reconstruction models, forecasting models as well as composite models will be discussed.

2.2.1 Reconstruction models

The reconstruction model focuses on reducing the expected reconstruction error by different methods. For instance, Autoencoders [52] are often utilized for anomaly detection by learning to reconstruct a given input. The model is trained exclusively on normal data. Once it is not able to reconstruct the input with equal quality compared to the reconstruction of normal data, the input sequence is treated as anomalous data. LSTM Encoder-Decoder model [40] is proposed to learn temporal representation of the input time series by LSTM networks and use reconstruction error to detect anomalies. Despite its effectiveness, LSTM does not take spatial correlation into consideration. Convolutional Autoencoders (CAEs) [20]

are an important method of video anomaly detection, which are able of capturing the 2D image structure since the weights are shared among all locations in the input image. Furthermore, since Convolutional long short-term memory (ConvLSTM) can model spatial-temporal correlations by using convolutional layers instead of fully connected layers, some researchers

[68, 38] add ConvLSTM layers to autoencoder, which better encodes the change of appearance for normal data.

Variational Autoenocders (VAEs) are a special form of autoencoder that models the relationship between two random variables, latent variable

and visible variable . A prior for is usually multivariate unit Gaussian . For anomaly detection, authors [2]

define the reconstruction probability that is the average probability of the original data generating from the distribution. Data points with high reconstruction probability is classified as anomalies, vice versa. Others like Denoising AutoEncoders (DAEs)

[62]

, Deep Belief Networks (DBNs)

[67] and Robust Deep Autoencoder (RDA) [74] have also been reported good performance for anomaly detection.

2.2.2 Forecasting models

The forecasting model can also be used for anomaly detection. It aims to predict one or more continuous values, e.g. forecasting the current output values for the past values . RNN and LSTM is the standard model for sequence prediction. In the work [13, 11], authors perform anomaly detection by using RNN-based forecasting models to predict values for the next time period and minimize the mean squared error (MSE) between predicted and future values. Recently, there have also been attempted to perform anomaly detection using other feed-forward networks. For instance, Shalyga et al. [55]

develop Neural Network (NN) based forecasting approach to early anomaly detection. Kravchik and Shabtai

[27] apply different variants of convolutional and recurrent networks to perform forecasting model. And the results show that 1D convolutional networks obtain the best accuracy for anomaly detection in industrial control systems. In another work [30], Lai et al. propose a forecasting model, which uses CNN and RNN, namely LSTNet, to extract short-term local dependency pattern and long-term pattern for multivariate time series, and incorporates Linear SVR model in the LSTNet model. Besides, other efforts have been performed in [36] using GAN-based anomaly detection. The model adopts U-Net as generator to predict next frame in video and leverages the adversarial training to discriminate whether the prediction is real or fake, thus abnormal events can be easily identified by comparing the prediction and ground truth.

2.2.3 Composite models

Besides single model, composite model for unsupervised anomaly detection has gained a lot attention recently. Zong et al. [75] utilize a deep autoencoder to generate a low-dimensional representation and reconstruction error, which is further fed into a Gaussian Mixture Model to model density distribution of multi-dimensional feature. However, they cannot consider the spatial-temporal dependency for multivariate time series data. Different from this work, the Composite LSTM model [57] uses single encoder LSTM and multiple decoder LSTMs to perform different tasks such as reconstructing the input sequence and predicting the future sequence. In [41], the authors use ConvLSTM model as a unit within the composite LSTM model following a branch for reconstruction and another for prediction. This type of composite model is currently used to extract features from video data for the tasks of action recognition. Similarly, authors in [73] propose Spatial-Temporal AutoEncoder (STAE) for video anomaly detection, which utilizes 3D convolutional architecture to capture the spatial-temporal changes. The architecture of the network is an encoder followed by two branches of decoder for reconstructing past sequence and predicting future sequence respectively.

As mentioned above, unsupervised anomaly detection techniques have still many deficiencies. For traditional anomaly detection, it is hard to learn representations of spatial-temporal patterns in multi-sensor time-series signals. For a reconstruction model, a single task could make the model suffer from the tendency to store information only about the inputs that are memorized by the AE. And for the forecasting model, this task could suffer from only storing the last few values that are most important for predicting the future [57, 41]. Hence, their performance will be limited since model only learn trivial representations. For composite model, these researchers design their models for different purposes. Zong et al. [75]

could solve problem that the model is robust to noise and anomalies through performing density estimation in a low-dimensional space. Zhao

et al. [73] could consider the spatial-temporal dependency through 3D convolutional reconstructing and forecasting architectures. However, few studies could address these issues simultaneously.

Different from these works, our research makes the following contributions: 1) The proposed model is designed to characterize complex spatial-temporal dependencies, thus discovering generalized pattern of multi-sensor data; 2) Adding a Maximum Mean Discrepancy (MMD) penalty could avoid the model generalizing so well for noisy data and anomalies; 3) Combining Attention-based Bidirectional LSTM (BiLSTM) and traditional Auto-regressive linear model could boost the model’s performance from different time scale; 4) The composite baseline model is generated based on end-to-end training which means all the components within the model are jointly trained with compound objective function.

Besides, some learning algorithms based on time-series data have been studied for decades. [69] propose Unsupervised Salient Subsequence Learning to extract subsequence as new representations of the time series data. Due to the internally sequential relationship, many neural network-based models can be applied to time series in an unsupervised learning manner. For example, some 1D-CNN models [12, 59] have been proposed to solve time series tasks with a very simple structure and the sota performance. Moreover, the multiple time series signal usually has some kinds of co-relations, [66] propose a method to learn the relation graph on multiple time series. Some anomaly detection based on multiple time series applications are available for wastewater treatment [23], for ICU [6], and for sensors [70].

Fig. 1: The overview of the proposed CAE-M model.

3 The Proposed Method

3.1 Notation

In a multi-sensor time series anomaly detection problem, we are given a dataset generated by sensors (). Without loss of generality, we assume each sensor generates signals (e.g., an accelerometer often generates 3-axis signals). Denote the signal set, we have signals in total. For each signal , , where denotes the length of signal . Note that even each sensor signal may have different length, we are often interested in their intersections, i.e., all sensors are having the same length , i.e., denotes an input sample containing all sensors.

Definition 1 (Unsupervised anomaly detection).

It is non-trivial to formally define an anomaly. In this paper, we are interested in detecting anomalies in a classification problem. Let be the classification label set, and the total number of classes, then the dataset . Eventually, our goal is to detect whether an input sample belongs to one of the predefined classes with a high confidence. If not, then we call an anomaly. Note that in this paper, we are dealing with an unsupervised anomaly detection problem, where the labels are unseen during training, which is more obviously challenging.

3.2 Overview

There are some existing works [53, 20, 33] attempting to resolve the unsupervised anomaly detection problem. Unfortunately, they may face several critical challenges. First, conventional anomaly detection techniques such as PCA [43], k-means [31] and OCSVM [53] are unable to capture the temporal dependencies appropriately because they cannot deliver temporal memory states. Second, since the normal samples might contain noise and anomalies, using deep anomaly detection approaches such as standard Autoencoders [20, 52] is likely to affect the generalization capability. Third, the multi-stage approaches, i.e., feature extraction and predictive model building are separated [33, 1], can easily get stuck in local optima.

In this paper, we present a novel approach called Convolutional Autoencoding Memory network (CAE-M) to tackle the above challenges. Fig. 1 gives an overview of the proposed method. In a nutshell, CAE-M is built upon a convolutional autoencoder, which is then fed into a predictive network. Concretely, we encode the spatial information in multi-sensor time-series signals into the low-dimensional representation via Deep Convolutional Autoencoder (CAE). In order to reduce the effect of noisy data, some existing works have tried to add Memory module [15] or Gaussian Mixture Model (GMM) [75]

. In our proposed method, we simplify these modules into penalty item, which called Maximum Mean Discrepancy (MMD) penalty. Adding a MMD term can encourage the distribution of training data to approximate the same distribution such as Gaussian distribution, thus reducing the risk of overfitting caused by noise and anomalies in training data

[75]

. And then we feed the representation and reconstruction error to the subsequent prediction network based on Bidirectional LSTM (Bi-LSTM) with Attention mechanism and Auto-regressive model (AR) which could predict future feature values by modeling the temporal information. Through the composite model, the spatial-temporal dependencies of multi-sensor time-series signals can be captured. Finally, we propose a compound objective function with weight coefficients to guide end-to-end training. For normal data, the reconstructed value generated by data coding is similar to the original input sequence and the predicted value is similar to the future value of time series, while the reconstructed value and the predicted value generated by abnormal data change greatly. Therefore, in inference process, we can detect anomalies precisely by computing the loss function in composite model.

3.3 Characterization Network

In the characterization network, we perform representative learning by fusing multivariate signals in multiple sensors. The low-dimensional representation contains two components: (1) the features which are abstracted from the multivariate signals; (2) the reconstruction error over the distance metrics such as Euclidean distance and Minkowski distance. To avoid the autoencoder generalizing so well for abnormal inputs, optimization function combines reconstruction loss by measuring how close the reconstructed input is to the original input and the regularization term by measuring the similarity between the two distributions (i.e., the distribution of low-dimensional features and Gaussian distribution).

3.3.1 Deep feature extraction

We employ a deep convolutional autoencoder to learn the low-dimensional features. Specifically, given time series with length , we pack into a matrix with multi-sensor time-series data. The matrix is then fed to deep convolutional autoencoder (CAE). The CAE model is composed of two parts, an encoder and a decoder as in Eq. (1) and Eq. (2). Assuming that denotes the reconstruction of the same shape as , the model is to compute low-dimensional representation , as follows:

(1)
(2)

The encoder in Eq. (1) maps an input matrix

to a hidden representation

by many convolutional and pooling layers. Each convolutional layer will be followed by a max-pooling layer to reduce the dimensions of the layers. A max-pooling layer pools features by taking the maximum value for each patch of the feature maps and produce the output feature map with reduced size according to the size of pooling kernel.

The decoder in Eq. (2) maps the hidden representation back to the original input space as a reconstruction. In particular, a decoding operation needs to convert from a narrow representation to a wide reconstructed matrix, therefore the transposed convolution layers are used to increase the width and height of the layers. They work almost exactly the same as convolutional layers, but in reverse.

The difference between the original input vector and the reconstruction is called the reconstruction error . The error typically used in the autoencoder is Mean Squared Error (MSE), which measures how close the reconstructed input is to the original input , as follows in Eq. (3).

(3)

where is the -norm.

3.3.2 Handling noisy data

To reduce the influence of noisy data, we need to observe the changes in low-dimensional features and the changes of distribution over the samples in a more granular way, thus distinguishing between normal and abnormal data obviously.

Inspired by [75], in order to avoid the autoencoder generalizing so well for noisy data and abnormal data, we hope to detect ”lurking” anomalies that reside in low-density areas in the reduced low-dimensional space. Our proposed method is conceptually similar to Gaussian Mixture Model (GMM) as target distributions. The loss function is complemented by MMD as a regularization term that encourages the distribution of the low-dimensional representation to be similar to a target distribution. It aims to make the distribution of noisy data close to the distribution of normal training data, thereby reducing the risk of overfitting. Specifically, Maximum Mean Discrepancy (MMD) [56] is a distance-measure between the samples of the distributions. Given the latent representation , where is a latent space (usually ) and denotes all of the time steps at one iteration. For CAE with MMD penalty, the Gaussian distribution in reproduction kernel Hilbert space is chosen as the target distribution. We compute the Kernel MMD as the follows:

(4)

Here we have the distribution of the low-dimensional representation and the target distribution over a set . The MMD is defined by a feature map where is a reproducing kernel Hilbert space (RKHS).

During the training process, we could apply the kernel trick to compute the MMD. And it turns out that many kernels, including the Gaussian kernel, lead to the MMD being zero if and only the distributions are identical. Letting , we yield an alternative characterization of the MMD as follows:

(5)

Here the kernel is defined as . The latent representation with Gaussian distribution is performed by sampling from and approximating by averaging the kernel evaluated at all pairs of samples.

Note that we usually do batch training for neural network training. It means that the model is trained using a subsample of data at one iteration. In this work, we need to compute the MMD over a set of at one iteration, where the number of is equal to . That is, the latent representation is denoted as , where .

3.4 Memory Network

To simultaneously capture the spatial and temporal dependencies, our proposed model is designed to characterize complex spatial-temporal patterns by concurrently performing the reconstruction analysis and prediction analysis. Considering the importance of temporal component in time series, we propose non-linear prediction and linear prediction to detect anomalies by comparing the future prediction and the next value appearance in the feature space.

The characterization network generates feature representations, which include reconstruction error and reduced low-dimensional features learned by the CAE at time steps. Denote input features as for :

(6)

Our goal is to predict the current value for the past values . The memory network combines non-linear function based predictor and linear function based predictor to tackle temporal dependency problem.

3.4.1 Non-linear prediction

Non-linear predictor function has different types such as Recurrent neural networks (RNNs), Long Short-Term Memory (LSTM)

[16]

and Gated Recurrent Unit (GRU)

[9]. Original RNNs fall short of learning long-term dependencies. In this work, we adopt a Bidirectional LSTM with attention mechanism [35] which could consider the whole/local context while calculating the relevant hidden states. Specifically, the Bidirectional LSTM (BiLSTM) runs the input in two ways, one LSTM from past to future and one LSTM from future to past. Different from unidirectional, the two hidden states combined are able in any point in time to preserve information from both past and future. A BiLSTM unit consists of four components: input gate , forget gate , output gate and cell activation vector . The hidden state given input is computed as follows:

(7)
(8)
(9)
(10)
(11)
(12)
(13)

where , , , represent the value of

at the moment

respectively, and

denote the weight matrix and bias vector,

and

are activation function, the operator

denotes element-wise multiplication, the current cell state consists of two components, namely previous memory and modulated new memory , the output combines the forward and backward pass outputs. Note that the merge mode by which outputs of the forward and backward are combined has different types, e.g. sum, multiply, concatenate, average. In this work, we use the mode “sum” to obtain the output .

Attention mechanism for processing sequential data that could focus on the features of the keywords to reduce the impact of non-key temporal context. Hence, we adopt temporal attention mechanism to produce a weight vector and merge raw features from each time step into a segment-level feature vector, by multiplying the weight vector. The work process of attention mechanism is following detailed.

(14)
(15)
(16)
(17)

Here and are represented as the weight and bias. A weighted sum of the based on the weight is computed as the context representation . The context representation is considered as the predicted value of for temporal features .

3.4.2 Linear prediction

Autoregressive (AR) model is a regression model that uses the dependencies between an observation and a number if lagged observations. Non-linear Recurrent Networks are theoretically more expressive and powerful than AR models. In fact, AR models also yield good results in forecasting short term modeling. In specific real datasets, such infinite-horizon memory isn’t always effective. Therefore, we incorporate AR model in parallel to the non-linear memory network part.

The AR model is formulated as follows:

(18)

where are the weights of the AR model, is a constant, represents the predicted value for past temporal value . We implement this model using Dense layer of network to combine the weights and data.

In the output layer, the prediction error is obtained by computing the difference between the output of predictor model and true value . The final prediction error integrates the output of non-linear prediction model and linear prediction model. The following equation is written as:

(19)

where is a subsample of training data, is the Frobenius norm.

3.5 Joint optimization

As for multi-step approach, it can easily get stuck in local optima, since models are trained separately. Therefore, we propose an end-to-end hybrid model by minimizing compound objective function.

The CAE-M objective has four components, MSE (reconstruction error) term, MMD (regularization) term, prediction error (non-linear forecasting task) term and prediction error (linear forecasting task) term. Given samples , the objective function is constructed as:

(20)

where is batch size used for training, is current time step, , and are the meta parameters controlling the importance of the loss function.

Restating our goals more formally, we would like to:

  • Minimize the reconstruction error in the characterization network, that is, minimize the error in reconstructing from at all time step . We need to compute the average error at each time step of sample. The purpose is to obtain better low-dimensional representation for multi-sensor data.

  • Minimize the MMD loss that encourages the distribution of the low-dimensional representation to be similar to a target distribution . It can make anomalies deviate from normal data in the reduced dimensions.

  • Minimize the prediction error by integrating non-linear predictor and linear predictor. We split the set obtained by characterization network into the current value and the past values . And then the predicted values and are obtained by minimizing prediction errors. The purpose is to accurately express the information of the next temporal slice using different predictor, thus updating low-dimensional feature and reconstruction error.

  • , and are the meta parameters in CAE-M. In practice, , , and usually achieve desirable results. Here MMD is complemented as a regularization term. The parameter selection is performed in Section 4.8.1.

Dataset Domain Instances Dimensions Classes Permissions
PAMAP2 [49] Activity Recognition 1,140,000 27 18 Public
CAP [61] Sleep Stage Detection 921,700,000 21 8 Public
Mental Fatigue Dataset [72] Fatigue Detection 1,458,648 4 2 Private
TABLE I: The detailed statistics of three datasets

3.6 Inference

Given samples as training dataset , we are able to compute the corresponding decision threshold ():

(21)

where we denote as the sum of loss function for , and is the average value of for . The setting is similar to the normal training distribution

following with 1 standard deviation

of the mean .

In inference process, the decision rule is that if , the testing sample in a sequence can be predicted to be “abnormal”, otherwise “normal”.

Training process

0:  Normal Dataset , time steps , batch size

and hyperparameters

.
0:  Anomaly decision threshold THR and model parameter .
1:  Transform each sample into in the time axis;
2:  Randomly initialize parameter ;
3:  while not converge do
4:     Calculate low-dimensional representation and reconstruction error at each time step; // Eq. (1)  (3)
5:     Calculate MMD between and Gaussian distribution ; // Eq. (5)
6:     Combine and into for each sample; // Eq. (6)
7:     Predict the current value for the past values by Attention-based BiLSTM and AR model; // Eq. (7-18)
8:     Update by minimizing the compound objective function; // Eq. (20)
9:  end while
10:  Calculate the decision threshold THR by the training samples; // Eq.(21)
11:  return  Optimal and THR.

Inference process

0:  Normal and Anomalous dataset , threshold THR, model parameter , hyperparameters , and .
0:  Label of all .
1:  for all  do
2:     Calculate the loss ; // denotes CAE-M
3:     if  THR then
4:         = “anomaly”;
5:     else
6:         = “normal”;
7:     end if
8:  end for
9:  return  Label of all .
Algorithm 1 Training and Inference procedure of CAE-M

The complete training and inference procedure of CAE-M is shown in Algorithm 1.

4 Experiments

In this section, we conduct extensive experiments to evaluate the performance of our proposed CAE-M approach for anomaly detection on several real-world datasets.

4.1 Datasets

We adopt two large publicly-available datasets and a private dataset: PAMAP2, CAP and Mental fatigue dataset. These datasets are exploiting multi-sensor time series for activity recognition, sleep state detection, and mental fatigue detection, respectively. Therefore, they are ideal testbeds for evaluating anomaly detection algorithms.

PAMAP2 [49] dataset is a mobile dataset with respect to actions or activities from UCI repository, containing data of 18 different physical activities performed by 9 subjects wearing 3 inertial measurement units, e.g. accelerator, gyroscope and magnetometer. There are 18 activity categories in total. For experiments, we treat these classes with relatively smaller samples as the anomaly classes (including running, ascending stairs, descending stairs and rope jumping), while the rest categories are combined to form the normal classes.

CAP Sleep Database [61], which stands for the Cyclic Alternating Pattern (CAP) database, is a clinical dataset from PhysioNet repository. It is characterized by periodic physiological signals occurring during wake, S1-S4 sleep stages and REM sleep. The waveforms include at least 3 EEG channels, 2 EOG channels, EMG signal, respiration signal and EKG signal. There are 16 healthy subjects and 92 patients in the database. The pathological recordings include the patients diagnosed with bruxism, insomnia, narcolepsy, nocturnal frontal lobe epilepsy, periodic leg movements, REM behavior disorder and sleep-disordered breathing. In this task, we extracted 7 valid channels of all the channels like ROC-LOC, C4-P4, C4-A1, F4-C4, P4-O2, ECG1-ECG2, EMG1-EMG2 etc. For detecting sleep apnea events, we chose healthy subjects as normal class and the patients with sleep-disordered breathing as anomaly class.

Mental Fatigue Dataset [72] is a real world health-care dataset. Aiming to detect mental fatigue in the healthy group, we collected the physiological signals (e.g., GSR, HR, R-R intervals and skin temperature) using wearable device. There are 6 healthy young subjects participated in the mental fatigue experiments. In this task, non-fatigue data samples are labeled as normal class and fatigue data samples are labeled as anomaly class. Fatigue data accounts for a fifth of the total.

The detailed information of the datasets is shown in TABLE I.

4.2 Baseline Methods

In order to extensively evaluate the performance of the proposed CAE-M approach, we compare it with several traditional and deep anomaly detection methods:

(1) KPCA

(Kernel principal component analysis

[21], which is a non-linear extension of PCA commonly used for anomaly detection. (2) ABOD

(Angle-based outlier detection

[28]

, which is a probabilistic model that well suited for high dimensional data. (3) 

OCSVM (One-class support vector machine) [39], which is the one-class learning method that classifies new data as similar or different to the training set. (4) HMM

(Hidden Markov Model

[22]

is a finite set of states, each of which is associated with a probability distribution. In a particular state an observation can be generated, according to the associated probability distribution. (5) 

CNN-LSTM [10], which is a forecasting model composed of convolutional and LSTM networks. It can obtain the forecast by estimating the current data, and detect anomalies on comparing the forecasting value with actuals. (6) LSTM-AE (LSTM based autoencoder) [40], which is an unsupervised detection technique used in time series that can induce a representation by learning an approximation of the identity function of data. (7) ConvLSTM-COMPOSITE [41], which utilizes a composite structure that is able to encoder the input, reconstruct it, and predict its near future. To simplify the name, “ConvLSTM-COMP” denotes ConvLSTM-COMPOSITE. We choose the “conditional” version to build a single model called ConvLSTM-AE by removing the forecasting decoder. (8) UODA (Unsupervised sequential outlier detection with deep architecture) [37], which utilizes autoencoders to capture the intrinsic difference between normal and abnormal samples, and then integrates the model to RNNs that perform fine-tuning to update the parameters in DAE. (9) MSCRED (Multi-scale convolutional recurrent encoder-decoder) [68], which is a reconstruction-based anomaly detection and diagnosis method.

Method PAMAP2 CAP dataset Fatigue dataset
mPre mRec mF1 mPre mRec mF1 mPre mRec mF1
KPCA 0.7236 0.6579 0.6892 0.7603 0.5847 0.6611 0.5341 0.5014 0.5173
ABOD 0.8653 0.9022 0.8834 0.7867 0.6365 0.7037 0.6679 0.6145 0.6401
OCSVM 0.7600 0.7204 0.7397 0.9267 0.9259 0.9263 0.5605 0.5710 0.5290
HMM 0.6950 0.6553 0.6745 0.8238 0.8078 0.8157 0.6066 0.6076 0.6071
CNN-LSTM 0.6680 0.5392 0.5968 0.6159 0.5217 0.5649 0.5780 0.5042 0.5386
LSTM-AE 0.8619 0.7997 0.8296 0.7147 0.6253 0.6671 0.7140 0.6820 0.6870
UODA 0.8957 0.8513 0.8730 0.7557 0.5124 0.6107 0.8280 0.7770 0.8017
MSCRED 0.6997 0.7301 0.7146 0.6410 0.5784 0.6081 0.8016 0.6802 0.7359
ConvLSTM-AE 0.7359 0.7361 0.7360 0.8150 0.8194 0.8172 0.9010 0.9346 0.9175
ConvLSTM-COMP 0.8844 0.8842 0.8843 0.8367 0.8377 0.8372 0.9373 0.9316 0.9344
CAE-M (Ours) 0.9608 0.9670 0.9639 0.9939 0.9952 0.9961 0.9962 0.9959 0.9960
Improvement 7.64% 6.48% 7.96% 6.72% 6.93% 6.98% 5.89% 6.13% 6.16%
TABLE II: The mean precision, recall and F1 score of baselines and our proposed method, * p-value = 0.0077.

4.3 Implementation details

For traditional anomaly detection, we scale the sequential data into segments and extract the features from each segment. In PAMAP2 dataset, multiple sensors are worn on three different position (wrist, chest, ankle). Hence, we extract 324 features including time and frequency domain features. In CAP Sleep dataset, we first pass through the Hanning window low pass filter for removing the high frequency components of signals. And then we extract 91 features for EEG, EMG and ECG signals

[58, 47, 14]

; In Mental Fatigue dataset, we preprocess physiological signals by interpolation and filtering algorithm. Then we extract 23 features for Galvanic Skin Response (GSR), Heart Rate (HR), R-R intervals and skin temperature sensors

[72].

For Deep Anomaly Detection (DAD) method, we filter multi-sensor signals and then pack these signals into matrix as input to construct the deep model.

We reimplement these methods based on open-source repositories

111https://pyod.readthedocs.io/en/stable/, https://github.com/7fantasysz/MSCRED or our own implementations. For KPCA, we employ Gaussian kernel with a bandwidth of 600, 500, 0.5, respectively for PAMAP2, CAP, and Mental Fatigue datasets. For ABOD, we use

nearest neighbors to approximate the complexity reduction. For an observation, the variance of its weighted cosine scores to all neighbors could be viewed as the abnormal score. For OCSVM, we adopt PCA for OCSVM as a dimension reduction tool and employ the Gaussian kernel with a bandwidth of 0.1. For HMM, we build a Markov model after extracting features and calculate the anomaly probability from the state sequence generated by the model. For CNN-LSTM, we define a CNN-LSTM model in

Keras by first defining 2D convolutional network as comprised of Conv2D and MaxPooling2D layers ordered into a stack of the required depth, wrapping them in a TimeDistributed layer and then defining the LSTM and output layers. For LSTM-AE, we use single-layer LSTM on both encoder and decoder in the task. For ConvLSTM-COMPOSITE, we choose ”conditional” version and adapt this technique to anomaly detection in multivariate time series. Here we also build a single model called ConvLSTM-AE by removing forecasting decoder. For UODA, we reimplement this algorithm by customizing the number of layers and hyper-parameters. For MSCRED, we first construct multi-scale matrices for multi-sensor data, and then fed it into MSCRED model and evaluate the performance.

For our own CAE-M, we use library Hyperopt [4]

to select the best hyper-parameters (i.e., time window, the number of neurons, learning rate, activation function, optimization criteria and iterations). The characterization network runs with

, i.e., Conv1-Conv5 with 32 kernels of size 4 4, 64 kernels of size 4 4, 64 kernels of size 4 4, 32 kernels of size 4 4, 1 kernels of size 4 4, and Maxpooling with size 2

2. We use Rectified Linear Unit (ReLU) as the activation function of convolutional layers. The memory network contains non-linear prediction and linear prediction, where the non-linear network runs with

, and the linear network runs with . The CAE-M model is trained in an end-to-end fashion using Keras [24]. The optimization algorithm is Adam and the batch size is set as 32. And we set parameters of compound objective function , and . The time step usually gives desirable results as or .

Note that in addition to the complete CAE-M approach, we further evaluate its several variants as baselines to justify the effectiveness of each component:

  • CAE-Mw/oPre. The CAE-M model removes the linear and non-linear prediction. That is, this variant only adopts the characterization network with reconstruction loss and MMD loss. (i.e., )

  • CAE-Mw/oRec+MMD. The CAE-M model removes the reconstruction error and MMD. Different from CNN-LSTM model, the characterization network is still performed as the deep convolutional autoencoder. We put the latent representation without reconstruction error into the memory network. (i.e., )

  • CAE-Mw/oATTENTION. The CAE-M model without Attention component is implemented. (i.e., )

  • CAE-Mw/oAR. The CAE-M model without AR component is implemented. (i.e., )

  • CAE-Mw/oMMD. The CAE-M model without MMD component is implemented. (i.e., )

Note that anomaly detection problems are often with highly-imbalanced classes, hence accuracy

is not suitable as the evaluation metric. In order to thoroughly evaluate the performance of our proposed method, we follow existing works

[75, 18, 68] to adopt the mean precision, recall, and F1 score as the evaluation metrics. The mean precision means the average precision of normal and abnormal class. The same pattern goes for mean recall, F1 score.

In the experiments, the train-validation-test sets are split by following existing works [68, 37]. Concretely speaking, for each dataset, we split normal samples into training, validation, and test with the ratio of , where the training and validation set only contain normal samples and have no overlapping with testing set. The anomalous samples are only used in the testing set. The model selection criterion, i.e., hyperparameters, used for tuning is the validation error on the validation set.

Method WAKE S1 S2 S3 S4 REM
mPre mRec mF1 mPre mRec mF1 mPre mRec mF1 mPre mRec mF1 mPre mRec mF1 mPre mRec mF1
KPCA 0.9162 0.8213 0.8662 0.8267 0.7598 0.7918 0.9257 0.9353 0.9305 0.9039 0.8689 0.8861 0.9402 0.9604 0.9502 0.9536 0.9614 0.9575
ABOD 0.9872 0.8686 0.9242 0.9347 0.5522 0.6942 0.9389 0.6550 0.7716 0.8489 0.6184 0.7155 0.6749 0.6448 0.6595 0.5915 0.5909 0.5912
OCSVM 0.9784 0.9492 0.9636 0.9655 0.9504 0.9579 0.9395 0.9448 0.9421 0.9714 0.9499 0.9605 0.8701 0.9488 0.9077 0.9784 0.9492 0.9636
HMM 0.8417 0.8406 0.8411 0.8790 0.8856 0.8823 0.8967 0.8887 0.8927 0.6880 0.6747 0.6813 0.7279 0.7286 0.7282 0.8024 0.8649 0.8325
LSTM-AE 0.6990 0.7178 0.7082 0.6517 0.6492 0.6504 0.7430 0.7331 0.7380 0.7689 0.7828 0.7758 0.7274 0.7569 0.7418 0.6590 0.6887 0.6735
UODA 0.6159 0.6326 0.6241 0.6762 0.6762 0.6762 0.7290 0.5223 0.6086 0.5716 0.5766 0.5741 0.6626 0.8498 0.6807 0.5626 0.6116 0.5861
ConvLSTM-
COMP
0.9889 0.9772 0.9830 0.9755 0.9850 0.9864 0.9250 0.9127 0.9188 0.9401 0.9041 0.9217 0.8647 0.8866 0.9023 0.9675 0.9949 0.9810
CAE-M 0.9974 0.9949 0.9961 0.9958 0.9950 ß0.9954 0.9950 0.9950 0.9950 0.9294 0.8842 0.9063 0.9842 0.9950 0.9895 0.9681 0.9950 0.9813
TABLE III: Detection performance in different sleep stages of baselines and our proposed method, *p-value = 0.0074.

4.4 Results and Analysis

As shown in TABLE II, we compare our proposed method with traditional and deep anomaly detection methods using the mean precision, recall and F1 score. We can see that our method outperforms most of the existing methods, which demonstrates the effectiveness of our method. From TABLE II, we can observe the following results.

For the PAMAP2 dataset, the CAE-M achieves the highest precision and recall compared by 10 popular methods. Traditional methods perform differently on PAMAP2 dataset since they are limited by the feature extraction and feature selection methods. In deep learning method, CNN-LSTM has a lowest F1 score. This means that more constraints such as data preprocessing method and anomaly evaluation strategy need to be added for prediction-based anomaly detection. For LSTM-AE, MSCRED and ConvLSTM-AE, they both are reconstruction-based anomaly detection methods. Their performance is limited by the “noisy data” problem, resulting in reconstruction error for the abnormal input could be fit so well. For UODA, it performs reasonably well on the PAMAP2 dataset, but it is not end-to-end training, which is needed by pre-training denoising autoencoder (DAEs) and deep recurrent networks (RNNs), and then fine-tuning the UODA model composing of the DAE and RNN. For ConvLSTM-COMPOSITE model, it performs better than other baseline models. The model consists of a single encoder, two decoders of reconstruction branch and prediction branch. In fact, since its efficiency is influenced by reconstruction error and prediction error respectively, its performance could be limited by one of encoder-decoder models.

For the CAP dataset, most of methods show a low F1 score. As CAP dataset contains different sleep stages of subjects, some methods are limited by high complexity of data. For OCSVM and HMM, they achieve better performance because of dimensionality reduction from 36 dimensions of PAMAP2 dataset to 7 dimensions. For MSCRED, due to batch size =1 for the training model in the open source code, the loss function couldn’t converge during training model and the training speed is slow. Our proposed method achieves about 7% improvement at F1 score, compared with the existing methods.

For Fatigue dataset, it is difficult to label fatigue and non-fatigue data manually. Therefore, it may be a lot of noise or misclassification patterns in the data, so that most of methods fail to solve this problem. For UODA, MSCRED and ConvLSTM, they have ability to overcome noise and misclassification of training data. Our proposed method also solves this problem successfully and achieves at least 6% improvement at F1 score.

Besides, in order to indicate significant differences from our proposed method and other baselines, we use Wilcoxon signed rank test[50] to analyze these results in TABLE II. We compute average p-value of CAE-M compared with other baselines. A p-value = 0.0077 indicates that the performance of our proposed method differs from other methods. This p-value is also computed in TABLE III.

4.5 Fine-grained Analysis

In addition to the anomaly detection of different classes on each dataset, we conduct a fine-grained analysis to evaluate the performance of each method within each class. Considering intra-class diversity, we conduct a group of experiments to detect anomalies in different sleep stages. In fact, these physiological signals in different sleep stages have significant differences. We choose 4 traditional methods and 3 deep methods with good performance in global domain as comparison methods. As shown in TABLE III, we can observe that our architecture is most robust across same experiment settings. Several observations from these results are worth highlighting. For ABOD, the testing performance is unstable in local domain, which the highest F1 score is 0.92 in WAKE and the lowest F1 score is 0.59 in REM. For KPCA and ConvLSTM-COMPOSITE, the testing performance in local domain far exceeds the performance in global domain. This demonstrates that the two model can achieve better performance when intra-class data have similar distribution or regular pattern. For other methods, the testing performance is consistent in local and global domain. For our proposed method, the best testing performance can be achieved no matter in local domain or global domain. This study clearly justifies the superior representational capacity of our architecture to solve intra-class diversity.

4.6 Effectiveness Evaluation

Method Worst mF1 Best mF1 Mean mF1
ABOD 0.6093 0.8507 0.7706
ConvLSTM-COMP 0.7033 0.9224 0.8493
UODA 0.5938 0.9336 0.7984
CAE-M 0.8009 0.9433 0.8616
TABLE IV: The evaluation results on LOSO cross validation approach, including the best, the worst and the mean F1 score of 8 subjects.

4.6.1 Leave One Subject Out

In this section, we measure the generalization ability of models using Leave One Subject Out (LOSO). The fact is that when training and testing datasets contain the same subject, the model is likely to know more about the current subject which may be biased towards a new one. Therefore, LOSO could help to evaluate the generalization ability. We choose the PAMAP2 dataset to conduct subject-independent experiments which contain 8 subjects. As can be seen in Fig. 2(a), we evaluate our proposed method and three methods with relatively high F1 score. By examining the results, one can easily notice that deep learning-based methods obtain better performance than traditional methods. However, complex models such as deep neural networks are prone to overfitting because of their flexibility in memorizing the idiosyncratic patterns in the training set, instead of generalizing to unseen data.

(a) Leave One Subject Out Evaluation
(b) Ablation Study
Fig. 2: Effectiveness evaluation using LOSO method and ablation study.

TABLE IV shows the best, the worst and average performance among 8 subjects. We can observe that UODA and ConvLSTM-COMPOSITE model perform well in some specific subjects, but they fail to reduce the effects of overfitting to each test subject, even drop to 0.70 and 0.59 for some subjects. Compared to these methods, CAE-M can generalize well on testing subjects it hasn’t appeared before, which reach the average F1 score of 0.86. Besides, we perform an analysis of variance on repeated measures within subject 1 (corresponding to numbers in Fig. 2(a)). As shown in TABLE V, we observe that CAE-M remains a more stable performance on repeated measurements. In summary, the above demonstrates that our model can be motivated to improve the generalization ability.

Method mPre mRec mF1
ABOD 0.62400.000 0.59460.000 0.60900.000
ConvLSTM-COMP 0.89530.029 0.80810.038 0.84880.019
UODA 0.81550.063 0.74640.027 0.77820.031
CAE-M 0.94370.024 0.81910.003 0.87700.012
TABLE V: The repeated measures analysis of variance on LOSO cross validation approach of one subject.

4.6.2 Ablation Study

ID Method PAMAP2 CAP dataset Fatigue dataset
mPre mRec mF1 mPre mRec mF1 mPre mRec mF1
1 CAE-M 0.8103 0.8023 0.8063 0.8299 0.8101 0.8199 0.6005 0.6096 0.6050
2 CAE-M 0.5693 0.5440 0.5563 0.8896 0.7784 0.8303 0.7050 0.6814 0.6930
3 CAE-M 0.9151 0.9276 0.9213 0.9251 0.9291 0.9271 0.9605 0.9551 0.9578
4 CAE-M 0.9060 0.8691 0.8872 0.9634 0.9381 0.9506 0.9046 0.9048 0.9047
5 CAE-M 0.9437 0.9550 0.9493 0.9293 0.9213 0.9253 0.9407 0.9288 0.9347
6 CAE-M 0.9608 0.9670 0.9639 0.9939 0.9952 0.9961 0.9962 0.9959 0.9960
TABLE VI: The mean precision, recall and F1 score from variants.

The proposed CAE-M approach consists of several components such as CAE, MMD, Attention mechanism, BiLSTM and Auto-regressive. To demonstrate the effectiveness of each component, we conduct ablation studies in this section. The ablation study is shown in Fig. 2(b). These ID numbers represent CAE-M without non-linear and linear prediction, CAE-M without reconstruction error and MMD, CAE-M without attention module, CAE-M without AR, CAE-M without MMD and CAE-M, respectively. The experimental results indicate that for the removal of different component above, there is corresponding performance drop at F1 score. We can observe that CAE-M model without prediction or reconstruction error achieves a low F1 score relatively. This demonstrates that our composite model is effective and necessary for anomaly detection in multi-sensor time-series data. Compared to original CAE-M model, removing the AR component (in CAE-M) from the full model causes significant performance drops on most of the datasets. This shows the critical role of the AR component in general. Moreover, attention and MMD components can also cause big performance rises on all the datasets. More details are shown in TABLE VI. Here, these ID numbers are corresponding to numbers in Fig. 2(b).

4.7 Robustness to Noisy Data

(a) mF1
(b) mPre
(c) mRec
Fig. 3: Robustness to noisy data.

In real-world applications, the collection of multi-sensor time-series data can be easily polluted with noise due to changes in the environment or the data collection devices. The noisy data bring critical challenges to the unsupervised anomaly detection methods. In this section, we evaluate the robustness of different methods to noisy data. We manually control the noisy data ratio in the training data. We inject Gaussian noise (=0, =0.3) in a random selection of samples with a ratio varying between 1% to 30%. We compare the performance of three methods on PAMAP2 dataset: UODA, ConvLSTM-COMPOSITE, and CAE-M in Fig. 3. These methods have good stability in the above experiments. As the noise increases, the performance of all methods decreases. For CAE-M, the F1 score, precision and recall have no significant decline. Among them, our model remains significantly superior to others, demonstrating its robustness to noisy data.

4.8 Further Analysis

4.8.1 Parameter Sensitivity Analysis

In this section, we evaluate the parameter sensitivity of CAE-M model. It is worth noting that CAE-M achieves the best performance by adjusting weight coefficient of compound objective function. We apply control variate reduction technique [29] to empirically evaluate the sensitivity of parameter with a wide range. The results are shown in Fig. 4. As the value of MMD loss is greater than others, we select its weight coefficient within e-04 e-07 and other weight coefficients within [0.1, 0.5, 1, 5, 10, 50]. We adjust one of while fixing the other respective to keep the optimal value (, , and ). When weight coefficient is increased, we observe that F1 score tends to decline. The optimal parameter is , , and . It can be seen that the performance of CAE-M stays robust within a wide range of parameter choice.

(a) MMD loss
(b) Bi-LSTM loss
(c) Autoregressive loss
Fig. 4: Parameter sensitivity analysis of the proposed CAE-M approach.
(a) PAMAP2 dataset
(b) CAP dataset
(c) Fatigue dataset
Fig. 5: Convergence analysis of the proposed CAE-M approach on different datasets.

4.8.2 Convergence Analysis

Since CAE-M involves several components, it is natural to ask whether and how quickly it can converge. In this section, we analyze the convergence to answer this question. We extensively show the results of each component on three datasets in Fig. 5. These results demonstrate that even if the proposed CAE-M approach involves several components, it could reach a steady performance within fewer than 40 iterations. Therefore, in real applications, CAE-M can be applied more easily with a fast and steady convergence performance.

5 Conclusion and Future Work

In this paper, we introduced a Deep Convolutional Autoencoding Memory network named CAE-M to detect anomalies. The CAE-M model uses a composite framework to model generalized pattern of normal data by capturing spatial-temporal correlation in multi-sensor time-series data. We first build Deep Convolutional Autoencoder with a Maximum Mean Discrepancy (MMD) penalty to characterize multi-sensor time-series signals and reduce the risk of overfitting caused by noise and anomalies in training data. To better represent temporal dependency of sequential data, we use non-linear Bidirectional LSTM with Attention and linear Auto-regressive model for prediction. Extensive empirical studies on HAR and HC datasets demonstrate that CAE-M performs better than other baseline methods.

In the future work, we will focus on the point-based fine-grained anomaly detection approach and further improve our method for multi-sensor data by designing proper sparse operations.

Acknowledgment

This work was supported by Key-Area Research and Development Program of Guangdong Province (No.2019B010109001), Science and Technology Service Network Initiative, Chinese Academy of Sciences (No. KFJ-STS-QYZD-2021-11-001), and Natural Science Foundation of China (No.61972383, No.61902377, No.61902379).

References

  • [1] R. Aliakbarisani, A. Ghasemi, and S. F. Wu (2019) A data-driven metric learning-based scheme for unsupervised network anomaly detection. Computers & Electrical Engineering 73, pp. 71–83. Cited by: §1, §3.2.
  • [2] J. An and S. Cho (2015) Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE 2 (1). Cited by: §2.2.1.
  • [3] J. E. Ball, D. T. Anderson, and C. S. Chan (2017) Comprehensive survey of deep learning in remote sensing: theories, tools, and challenges for the community. Journal of Applied Remote Sensing 11 (4), pp. 042609. Cited by: §1.
  • [4] J. Bergstra, D. Yamins, and D. D. Cox (2013) Hyperopt: a python library for optimizing the hyperparameters of machine learning algorithms. In Proceedings of the 12th Python in science conference, pp. 13–20. Cited by: §4.3.
  • [5] R. Chalapathy and S. Chawla (2019) Deep learning for anomaly detection: a survey. arXiv preprint arXiv:1901.03407. Cited by: §1.
  • [6] W. Chen, S. Wang, G. Long, L. Yao, Q. Z. Sheng, and X. Li (2018) Dynamic illness severity prediction via multi-task rnns for intensive care unit. In 2018 IEEE International Conference on Data Mining (ICDM), pp. 917–922. Cited by: §2.2.3.
  • [7] Y. Chen, X. Qin, J. Wang, C. Yu, and W. Gao (2020)

    Fedhealth: a federated transfer learning framework for wearable healthcare

    .
    IEEE Intelligent Systems 35 (4), pp. 83–93. Cited by: §1.
  • [8] Y. Chen, J. Wang, M. Huang, and H. Yu (2019) Cross-position activity recognition with stratified transfer learning. Pervasive and Mobile Computing 57, pp. 1–13. Cited by: §1.
  • [9] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §3.4.1.
  • [10] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell (2015) Long-term recurrent convolutional networks for visual recognition and description. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 2625–2634. Cited by: §4.2.
  • [11] T. Ergen, A. H. Mirza, and S. S. Kozat (2017) Unsupervised and semi-supervised anomaly detection with lstm neural networks. arXiv preprint arXiv:1710.09207. Cited by: §2.2.2.
  • [12] H. I. Fawaz, B. Lucas, G. Forestier, C. Pelletier, D. F. Schmidt, J. Weber, G. I. Webb, L. Idoumghar, P. Muller, and F. Petitjean (2020) Inceptiontime: finding alexnet for time series classification. Data Mining and Knowledge Discovery 34 (6), pp. 1936–1962. Cited by: §2.2.3.
  • [13] P. Filonov, F. Kitashov, and A. Lavrentyev (2017) Rnn-based early cyber-attack detection for the tennessee eastman process. arXiv preprint arXiv:1709.02232. Cited by: §2.2.2.
  • [14] M. K. Gautama and V. K. Giri (2017) An overview of feature extraction techniques of ecg. American-Eurasian Journal of Scientific Research 12 (1), pp. 54–60. Cited by: §4.3.
  • [15] D. Gong, L. Liu, V. Le, B. Saha, M. R. Mansour, S. Venkatesh, and A. v. d. Hengel (2019) Memorizing normality to detect anomaly: memory-augmented deep autoencoder for unsupervised anomaly detection. arXiv preprint arXiv:1904.02639. Cited by: §1, §3.2.
  • [16] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber (2017) LSTM: a search space odyssey. IEEE transactions on neural networks and learning systems 28 (10), pp. 2222–2232. Cited by: §3.4.1.
  • [17] N. Günnemann, S. Günnemann, and C. Faloutsos (2014) Robust multivariate autoregression for anomaly detection in dynamic product ratings. In Proceedings of the 23rd international conference on World wide web, pp. 361–372. Cited by: §2.1.
  • [18] Y. Guo, W. Liao, Q. Wang, L. Yu, T. Ji, and P. Li (2018) Multidimensional time series anomaly detection: a gru-based gaussian mixture variational autoencoder approach. In Asian Conference on Machine Learning, pp. 97–112. Cited by: §4.3.
  • [19] J. D. Hamilton (1994) Time series analysis. Vol. 2, Princeton university press Princeton, NJ. Cited by: §2.1.
  • [20] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis (2016) Learning temporal regularity in video sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 733–742. Cited by: §1, §2.2.1, §3.2.
  • [21] H. Hoffmann (2007)

    Kernel pca for novelty detection

    .
    Pattern recognition 40 (3), pp. 863–874. Cited by: §2.1, §4.2.
  • [22] S. S. Joshi and V. V. Phoha (2005) Investigating hidden markov models capabilities in anomaly detection. In Proceedings of the 43rd annual Southeast regional conference-Volume 1, pp. 98–103. Cited by: §4.2.
  • [23] B. Jun (2011) Fault detection using dynamic time warping (dtw) algorithm and discriminant analysis for swine wastewater treatment. Journal of hazardous materials 185 (1), pp. 262–268. Cited by: §2.2.3.
  • [24] N. Ketkar (2017) Introduction to keras. In Deep learning with Python, pp. 97–111. Cited by: §4.3.
  • [25] J. Kim and C. D. Scott (2012) Robust kernel density estimation. Journal of Machine Learning Research 13 (Sep), pp. 2529–2565. Cited by: §2.1.
  • [26] B. Kiran, D. Thomas, and R. Parakkal (2018) An overview of deep learning based methods for unsupervised and semi-supervised anomaly detection in videos. Journal of Imaging 4 (2), pp. 36. Cited by: §1.
  • [27] M. Kravchik and A. Shabtai (2018)

    Detecting cyber attacks in industrial control systems using convolutional neural networks

    .
    In Proceedings of the 2018 Workshop on Cyber-Physical Systems Security and PrivaCy, pp. 72–83. Cited by: §2.2.2.
  • [28] H. Kriegel, M. Schubert, and A. Zimek (2008) Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 444–452. Cited by: §4.2.
  • [29] S. Kucherenko, B. Delpuech, B. Iooss, and S. Tarantola (2015) Application of the control variate technique to estimation of total sensitivity indices. Reliability Engineering & System Safety 134, pp. 251–259. Cited by: §4.8.1.
  • [30] G. Lai, W. Chang, Y. Yang, and H. Liu (2018) Modeling long-and short-term temporal patterns with deep neural networks. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 95–104. Cited by: §2.2.2.
  • [31] L. J. Latecki, A. Lazarevic, and D. Pokrajac (2007) Outlier detection with kernel density functions. In International Workshop on Machine Learning and Data Mining in Pattern Recognition, pp. 61–75. Cited by: §1, §2.1, §3.2.
  • [32] R. Laxhammar, G. Falkman, and E. Sviestins (2009) Anomaly detection in sea traffic-a comparison of the gaussian mixture model and the kernel density estimator. In 2009 12th International Conference on Information Fusion, pp. 756–763. Cited by: §2.1.
  • [33] D. Li, D. Chen, J. Goh, and S. Ng (2018)

    Anomaly detection with generative adversarial networks for multivariate time series

    .
    arXiv preprint arXiv:1809.04758. Cited by: §1, §3.2.
  • [34] D. Li, D. Chen, B. Jin, L. Shi, J. Goh, and S. Ng (2019) MAD-gan: multivariate anomaly detection for time series data with generative adversarial networks. In International Conference on Artificial Neural Networks, pp. 703–716. Cited by: §1.
  • [35] G. Liu and J. Guo (2019) Bidirectional lstm with attention mechanism and convolutional layer for text classification. Neurocomputing 337, pp. 325–338. Cited by: §3.4.1.
  • [36] W. Liu, W. Luo, D. Lian, and S. Gao (2018) Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6536–6545. Cited by: §2.2.2.
  • [37] W. Lu, Y. Cheng, C. Xiao, S. Chang, S. Huang, B. Liang, and T. Huang (2017) Unsupervised sequential outlier detection with deep architectures. IEEE transactions on image processing 26 (9), pp. 4321–4330. Cited by: §4.2, §4.3.
  • [38] W. Luo, W. Liu, and S. Gao (2017) Remembering history with convolutional lstm for anomaly detection. In 2017 IEEE International Conference on Multimedia and Expo (ICME), pp. 439–444. Cited by: §2.2.1.
  • [39] J. Ma and S. Perkins (2003) Time-series novelty detection using one-class support vector machines. In Proceedings of the International Joint Conference on Neural Networks, 2003., Vol. 3, pp. 1741–1745. Cited by: §4.2.
  • [40] P. Malhotra, A. Ramakrishnan, G. Anand, L. Vig, P. Agarwal, and G. Shroff (2016) LSTM-based encoder-decoder for multi-sensor anomaly detection. arXiv preprint arXiv:1607.00148. Cited by: §2.2.1, §4.2.
  • [41] J. R. Medel and A. Savakis (2016) Anomaly detection in video using predictive convolutional long short-term memory networks. arXiv preprint arXiv:1612.00390. Cited by: §2.2.3, §2.2.3, §4.2.
  • [42] H. Z. Moayedi and M. Masnadi-Shirazi (2008) Arima model for network traffic prediction and anomaly detection. In 2008 International Symposium on Information Technology, Vol. 4, pp. 1–6. Cited by: §2.1.
  • [43] R. Paffenroth, P. Du Toit, R. Nong, L. Scharf, A. P. Jayasumana, and V. Bandara (2013) Space-time signal processing for distributed pattern detection in sensor networks. IEEE Journal of Selected Topics in Signal Processing 7 (1), pp. 38–49. Cited by: §1, §2.1, §3.2.
  • [44] R. Paffenroth, K. Kay, and L. Servi (2018) Robust pca for anomaly detection in cyber networks. arXiv preprint arXiv:1801.01571. Cited by: §2.1.
  • [45] G. Pang, C. Shen, L. Cao, and A. V. D. Hengel (2021) Deep learning for anomaly detection: a review. ACM Computing Surveys (CSUR) 54 (2), pp. 1–38. Cited by: §1.
  • [46] V. Patraucean, A. Handa, and R. Cipolla (2015) Spatio-temporal video autoencoder with differentiable memory. arXiv preprint arXiv:1511.06309. Cited by: §1.
  • [47] A. Phinyomark, C. Limsakul, and P. Phukpattaranont (2009) A novel feature extraction for robust emg pattern recognition. arXiv preprint arXiv:0912.3973. Cited by: §4.3.
  • [48] A. Ramchandran and A. K. Sangaiah (2018) Unsupervised anomaly detection for high dimensional data—an exploratory analysis. In Computational Intelligence for Multimedia Big Data on the Cloud with Engineering Applications, pp. 233–251. Cited by: §1.
  • [49] A. Reiss and D. Stricker (2012) Introducing a new benchmarked dataset for activity monitoring. In 2012 16th International Symposium on Wearable Computers, pp. 108–109. Cited by: TABLE I, §4.1.
  • [50] D. Rey and M. Neuhäuser Wilcoxon-signed-rank test. Springer Berlin Heidelberg. Cited by: §4.4.
  • [51] L. Ruff, J. R. Kauffmann, R. A. Vandermeulen, G. Montavon, W. Samek, M. Kloft, T. G. Dietterich, and K. Müller (2020) A unifying review of deep and shallow anomaly detection. Cited by: §1.
  • [52] M. Sakurada and T. Yairi (2014) Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis, pp. 4. Cited by: §1, §2.2.1, §3.2.
  • [53] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson (2001) Estimating the support of a high-dimensional distribution. Neural computation 13 (7), pp. 1443–1471. Cited by: §1, §2.1, §3.2.
  • [54] B. Schölkopf, A. Smola, and K. Müller (1998)

    Nonlinear component analysis as a kernel eigenvalue problem

    .
    Neural computation 10 (5), pp. 1299–1319. Cited by: §2.1.
  • [55] D. Shalyga, P. Filonov, and A. Lavrentyev (2018) Anomaly detection for water treatment system based on neural network with automatic architecture optimization. arXiv preprint arXiv:1807.07282. Cited by: §2.2.2.
  • [56] A. Smola, A. Gretton, L. Song, and B. Schölkopf (2007) A hilbert space embedding for distributions. In International Conference on Algorithmic Learning Theory, pp. 13–31. Cited by: §3.3.2.
  • [57] N. Srivastava, E. Mansimov, and R. Salakhudinov (2015) Unsupervised learning of video representations using lstms. In International conference on machine learning, pp. 843–852. Cited by: §2.2.3, §2.2.3.
  • [58] D. P. Subha, P. K. Joseph, R. Acharya, and C. M. Lim (2010) EEG signal analysis: a survey. Journal of medical systems 34 (2), pp. 195–212. Cited by: §4.3.
  • [59] W. Tang, G. Long, L. Liu, T. Zhou, J. Jiang, and M. Blumenstein (2020) Rethinking 1d-cnn for time series classification: a stronger baseline. arXiv preprint arXiv:2002.10061. Cited by: §2.2.3.
  • [60] D. M. Tax and R. P. Duin (2004) Support vector data description. Machine learning 54 (1), pp. 45–66. Cited by: §2.1.
  • [61] M. G. Terzano, L. Parrino, A. Smerieri, R. Chervin, S. Chokroverty, C. Guilleminault, M. Hirshkowitz, M. Mahowald, H. Moldofsky, A. Rosa, et al. (2002) Atlas, rules, and recording techniques for the scoring of cyclic alternating pattern (cap) in human sleep. Sleep medicine 3 (2), pp. 187–199. Cited by: TABLE I, §4.1.
  • [62] P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol (2008) Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pp. 1096–1103. Cited by: §1, §2.2.1.
  • [63] J. Wang, Y. Chen, S. Hao, X. Peng, and L. Hu (2019) Deep learning for sensor-based activity recognition: a survey. Pattern Recognition Letters 119, pp. 3–11. Cited by: §1.
  • [64] J. Wang, Y. Chen, L. Hu, X. Peng, and S. Y. Philip (2018) Stratified transfer learning for cross-domain activity recognition. In 2018 IEEE International Conference on Pervasive Computing and Communications (PerCom), pp. 1–10. Cited by: §1.
  • [65] J. Wang, V. W. Zheng, Y. Chen, and M. Huang (2018) Deep transfer learning for cross-domain activity recognition. In proceedings of the 3rd International Conference on Crowd Science and Engineering, pp. 1–8. Cited by: §1.
  • [66] Z. Wu, S. Pan, G. Long, J. Jiang, X. Chang, and C. Zhang (2020) Connecting the dots: multivariate time series forecasting with graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 753–763. Cited by: §2.2.3.
  • [67] D. Wulsin, J. Blanco, R. Mani, and B. Litt (2010) Semi-supervised anomaly detection for eeg waveforms using deep belief nets. In 2010 Ninth International Conference on Machine Learning and Applications, pp. 436–441. Cited by: §2.2.1.
  • [68] C. Zhang, D. Song, Y. Chen, X. Feng, C. Lumezanu, W. Cheng, J. Ni, B. Zong, H. Chen, and N. V. Chawla (2019) A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 33, pp. 1409–1416. Cited by: §1, §2.2.1, §4.2, §4.3, §4.3.
  • [69] Q. Zhang, J. Wu, P. Zhang, G. Long, and C. Zhang (2018) Salient subsequence learning for time series clustering. IEEE transactions on pattern analysis and machine intelligence 41 (9), pp. 2193–2207. Cited by: §2.2.3.
  • [70] X. Zhang, L. Yao, C. Huang, S. Wang, M. Tan, G. Long, and C. Wang (2018) Multi-modality sensor data classification with selective attention. arXiv preprint arXiv:1804.05493. Cited by: §2.2.3.
  • [71] Y. Zhang, Y. Chen, and C. Gao (2021) Deep unsupervised multi-modal fusion network for detecting driver distraction. Neurocomputing 421, pp. 26–38. Cited by: §1.
  • [72] Y. Zhang, Y. Chen, and Z. Pan (2018) A deep temporal model for mental fatigue detection. In 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 1879–1884. Cited by: §1, TABLE I, §4.1, §4.3.
  • [73] Y. Zhao, B. Deng, C. Shen, Y. Liu, H. Lu, and X. Hua (2017) Spatio-temporal autoencoder for video anomaly detection. In Proceedings of the 25th ACM international conference on Multimedia, pp. 1933–1941. Cited by: §2.2.3, §2.2.3.
  • [74] C. Zhou and R. C. Paffenroth (2017) Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 665–674. Cited by: §2.2.1.
  • [75] B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, and H. Chen (2018) Deep autoencoding gaussian mixture model for unsupervised anomaly detection. Cited by: §1, §2.2.3, §2.2.3, §3.2, §3.3.2, §4.3.