Deep Anomaly Detection for Time-series Data in Industrial IoT: A Communication-Efficient On-device Federated Learning Approach

07/19/2020 ∙ by YI LIU, et al. ∙ Design and Development by: IEEE Nanyang Technological University Wuhan University of Technology 25

Since edge device failures (i.e., anomalies) seriously affect the production of industrial products in Industrial IoT (IIoT), accurately and timely detecting anomalies is becoming increasingly important. Furthermore, data collected by the edge device may contain the user's private data, which is challenging the current detection approaches as user privacy is calling for the public concern in recent years. With this focus, this paper proposes a new communication-efficient on-device federated learning (FL)-based deep anomaly detection framework for sensing time-series data in IIoT. Specifically, we first introduce a FL framework to enable decentralized edge devices to collaboratively train an anomaly detection model, which can improve its generalization ability. Second, we propose an Attention Mechanism-based Convolutional Neural Network-Long Short Term Memory (AMCNN-LSTM) model to accurately detect anomalies. The AMCNN-LSTM model uses attention mechanism-based CNN units to capture important fine-grained features, thereby preventing memory loss and gradient dispersion problems. Furthermore, this model retains the advantages of LSTM unit in predicting time series data. Third, to adapt the proposed framework to the timeliness of industrial anomaly detection, we propose a gradient compression mechanism based on Top-k selection to improve communication efficiency. Extensive experiment studies on four real-world datasets demonstrate that the proposed framework can accurately and timely detect anomalies and also reduce the communication overhead by 50% compared to the federated learning framework that does not use a gradient compression scheme.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 8

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The widespread deployment of edge devices in the Industrial Internet of Things (IIoT) paradigm has spawned a variety of emerging applications with edge computing, such as smart manufacturing, intelligent transportation, and intelligent logistics [1]. The edge devices provide powerful computation resources to enable real-time, flexible, and quick decision making for the IIoT applications, which has greatly promoted the development of Industry 4.0 [2]. However, the IIoT applications are suffering from critical security risks caused by abnormal IIoT nodes which hinders the rapid development of IIoT. For example, in smart manufacturing scenarios, industrial devices acting as IIoT nodes, e.g., engines with sensors, that have abnormal behaviors (e.g., abnormal traffic and irregular reporting frequency) may cause industrial production interruption thus resulting in huge economic losses for factories [3, 4]. Edge devices (e.g., industrial robots), generally collect sensing data from IIoT nodes, especially time-series data, to analyze and capture the behaviors and operating condition of IIoT nodes by edge computing [5]. Therefore, these sensing time series data can be used to detect the anomaly behaviors of IIoT nodes [6].

To solve the abnormality problems from IIoT devices, typical methods are to perform abnormal detection for the affected IIoT devices[7, 8, 9, 10]. Previous work focused on utilizing deep anomaly detection (DAD) [11] approaches to detect abnormal behaviors of IIoT devices by analyzing sensing time series data. DAD techniques can learn hierarchical discriminative features from historical time-series data. In [12, 13, 14]

, the authors proposed a Long Short Term Memory (LSTM) networks-based deep learning model to achieve anomaly detection in sensing time series data. Munir

et al. in [15] proposed a novel DAD approach, called DeepAnT, to achieve anomaly detection by utilizing deep Convolutional Neural Network (CNN) to predict anomaly value. Although the existing DAD approaches have achieved success in anomaly detection, they cannot be directly applied to the IIoT scenarios with distributed edge devices for timely and accurate anomaly detection. The reasons are two-fold: (i) the most of detection models are not flexible enough in traditional approaches, the edge devices lack dynamic and automatically updated detection models for different scenarios, and hence the models fail to accurately predict frequently updated time-series data [8]; (2) due to privacy concerns, the edge devices are not willing to share their own collected time-series data with each other, thus the data exists in the form of “islands.” The data islands significantly degrade the performance of anomaly detection. Furthermore, it is often overlooked that the data may contain sensitive private information, which leads to potential privacy leakage. There are some privacy issues in the anomaly detection context. For example, the anomaly detection model will reveal the patient’s heart disease history when detecting the patient’s abnormal pulse [16, 17].

To address the above challenges, a promising on-device privacy-preserving distributed machine learning paradigm, called on-device federated learning (FL), was proposed for edge devices to train a global DAD model while keeps the training datasets locally without sharing raw training data

[18]. Such a framework allows edge devices to collaboratively train an on-device DAD model without compromising privacy. For example, the authors in [19] proposed an on-device FL-based approach to achieve collaborative anomaly detection. Tsukada et al. in [20]

utilized FL framework to propose a Backpropagation Neural Networks (i.e., BPNNs) based approaches for anomaly detection. However, previous researches ignore the communication overhead during model training using FL among large-scale edge devices. Expensive communication overhead may cause excessive overhead and long convergence time for edge devices so that the on-device DAD model cannot quickly detect anomalies. Therefore, it is necessary to develop a communication-efficient on-device FL framework to achieve accurate and timely anomaly detection for edge devices.

In this paper, we propose a communication-efficient on-device FL framework that leverages an attention mechanism-based CNN-LSTM (AMCNN-LSTM) model to achieve accurate and timely anomaly detection for edge devices. First, we introduce a FL framework to enable distributed edge devices to collaboratively train a global DAD model without compromising privacy. Second, we propose an AMCNN-LSTM model to detect anomalies. Specifically, we use attention-based CNNs to extract fine-grained features of historical observation-sensing time-series data and use LSTM modules for time-series prediction. Such a model can prevent memory loss and gradient dispersion problems. Third, to further improve the communication efficiency of the proposed framework, we propose a gradient compression mechanism based on Top-k selection to reduce the number of gradients uploaded by edge devices. We evaluate the proposed framework on four real-world datasets: power demand, space shuttle, ECG, and engine. Experimental results show that the proposed framework can achieve high-efficiency communication and achieve accurate and timely anomaly detection. The contributions of this paper are summarized as follows:

  • We introduce a federated learning framework to develop an on-device collaborative deep anomaly detection model for edge devices in IIoT.

  • We propose an attention mechanism-based CNN-LSTM model to detect anomalies, which uses CNN to capture the fine-grained features of time series data and uses LSTM module to accurately and timely detect anomalies.

  • We propose a Top-k selection-based gradient compression scheme to improve the proposed framework’s communication efficiency. Such a scheme decreases communication overhead by reducing the exchanged gradient parameters between the edge devices and the cloud aggregator.

  • We conduct extensive experiments on four real-world datasets to demonstrate that the proposed framework can accurately detect anomalies with low communication overhead.

Ii Related Work

Ii-a Deep Anomaly Detection

Deep Anomaly Detection (DAD) has always been a hot issue in IIoT, which serves as a function of detecting anomalies. Previous researches about DAD generally can be divided into three categories: supervised DAD, semi-supervised DAD, and unsupervised DAD approaches.

Supervised Deep Anomaly Detection:

Supervised deep anomaly detection typically uses the labels of normal and abnormal data to train a deep-supervised binary or multi-class classifier. For example, Erfani

et al. in [21]

proposed a supervised Support Vector Machine (SVM) classifier for high-dimensional data to classify normal and abnormal data. Despite the success of supervised DAD methods in anomaly detection, these methods are not as popular as semi-supervised or unsupervised methods due to the lack of labeled training data

[22]. Furthermore, the supervised DAD method has poor performance for data with class imbalance (the total number of positive class data is much larger than the total number of negative class data) [12].

Semi-supervised Deep Anomaly Detection:

Since normal instances are easier to obtain the labels than that of anomalies, semi-supervised DAD techniques are proposed to utilize a single (normally positive class) existing label to separate outliers

[11]. For example, Wulsin et al. in [23] applied Deep Belief Nets (DBNs) in a semi-supervised paradigm to model Electroencephalogram (EEG) waveforms for classification and anomaly detection. The semi-supervised DBN performance is comparable to the standard classifier on EEG dataset. The semi-supervised DAD approach is popular because it can use only a single class of labels to detect anomalies.

Unsupervised Deep Anomaly Detection: Unsupervised deep anomaly detection techniques use the inherent properties of data instances to detect outliers [11]. For example, Zong et al. in [24]

proposed a deep automatic coding Gaussian Mixture Model (DAGMM) for unsupervised anomaly detection. Schlegl

et al. in [25]

proposed a deep convolutional generative adversarial network, called AnoGAN, which detects abnormal anatomical images by learning a variety of normal anatomical images. They trained such a model in an unsupervised manner. Unsupervised DAD is widely used since it does not require the characteristics of labeled training data.

Ii-B Communication-Efficient Federated Learning

Google proposed a privacy-preserving distributed machine learning framework, called FL, to train machine learning models without compromising privacy [26]. Inspired by this framework, different edge devices can contribute to the global model training while keeping the training data locally. However, communication overhead is the bottleneck of FL being widely used in IIoT [18]

. Previous work has focused on designing efficient stochastic gradient descent (SGD) algorithms and using model compression to reduce the communication overhead of FL. Agarwal

et al. in [27] proposed an efficient cpSGD algorithm to achieve communication-efficient FL. Reisizadeh et al. in [28] used Periodic Averaging and Quantization methods to design a communication-efficient FL framework. Jeong et al. in [29] proposed a federated model distillation method to reduce the communication overhead of FL.

However, the above methods do not substantially reduce the number of gradients exchanged between edge devices and the cloud aggregator. The fact is that a large number of gradients exchanged between the edge devices and the cloud aggregator may cause excessive communication overhead for FL [30]. Therefore, in this paper, we propose a Top-k selection-based gradient compression scheme to improve the communication efficiency of FL.

Iii Preliminary

In this section, we briefly introduce anomalies, federated deep learning, and gradient compression as follows.

Iii-a Anomalies

In statistics, anomalies (also called outliers and abnormalities) are the data points that are significantly different from other observations [11]. We assume that , , and are regions composed of most observations, so they are regarded as normal data instance regions. If data points and are far from these regions, they can be classified as anomalies. To define anomalies more formally, we assume that an n-dimensional dataset

follows a normal distribution and its mean

and variance

for each dimension where and . Specifically, for , under the assumption of the normal distribution, we have:

(1)

if there is a new vector

, the probability

of anomaly can be calculated as follows:

(2)

We can then judge whether vector belongs to an anomaly according to the probability value.

Iii-B Federated Learning

Traditional distributed deep learning techniques require a certain amount of private data to be aggregated and analyzed at central servers (e.g., cloud servers) during the model training phase by using distributed stochastic gradient descent (D-SGD) algorithm [31]. Such the training process suffers from potential data privacy leakage risks for IIoT devices. To address such privacy challenges, a collaboratively distributed deep learning paradigm, called federated deep learning, was proposed for edge devices to train a global model while keeping the training datasets locally without sharing raw training data [18]. The procedure of FL is divided into three phases: the initialization phase, the aggregation phase, and the update phase. In the initialization phase, we consider that FL with edge devices and a parameter aggregator, i.e., a cloud aggregator, distributes a pre-trained global model on the public datasets (e.g., MNIST [32], CIFAR-10 [33]) to each edge devices. Following that, each device uses local dataset of size to train and improve the current global model

in each iteration. In the aggregation phase, the cloud aggregator collects local gradients uploaded by the edge nodes (i.e., edge devices). To do so, the local loss function to be optimized is defined as follows:

(3)

where is the local loss function for edge device , , is a regularizer function for edge device , and is sampled from the local dataset on the device. In the update phase, the cloud aggregator uses Federated Averaging (FedAVG) algorithm [26] to obtain a new global model for the next iteration, thus we have:

(4)

where denotes model updates aggregation and denotes the average aggregation (i.e., FedAVG algorithm). Both the edge devices and the cloud aggregator repeat the above process until the global model reaches convergence. This paradigm significantly reduces the risks of privacy leakage by decoupling the model training from direct access to the raw training data on edge nodes.

Iii-C Gradient Compression

Large-scale FL training requires significant communication bandwidth for gradient exchange, which limits the scalability of multi-nodes training [34]. In this context, Lin et al. in [34] stated that 99.9% of the gradient exchange in D-SGD is redundant. To avoid expensive communication bandwidth limiting large-scale distributed training, gradient compression is proposed to greatly reduce communication bandwidth. Researchers generally use gradient quantization [35] and gradient sparsification [36] to achieve gradient compression. Gradient quantization reduces communication bandwidth by quantizing gradients to low-precision values. Gradient sparsification uses threshold quantization to reduce communication bandwidth.

For a fully connected (FC) layer in a deep neural network, we have: , where is the input, is the bias, is the weight, is the nonlinear mapping, and

is the output. This formula is the most basic operation in a neural network. For each specific neuron

, the above formula can be simplified to the following: , Where

is the activation function. Gradient compression compresses the corresponding weight matrix into a sparse matrix, and hence the corresponding formula is given as follows:

(5)

where represents the compressed weight matrix and represent the position information of the gradient in the weight matrix . Such a method reduces the communication overhead through sparsing the gradient in the weight matrix .

Fig. 1: The workflow of the on-device deep anomaly detection in IIoT.

Iv System Model

We consider the generic setting for on-device deep anomaly detection in IIoT, where a cloud aggregator and edge devices work collaboratively to train a DAD model by using a given training algorithm (e.g., LSTM) for a specific task (i.e., anomaly detection task), as illustrated in Fig. 1. The edge devices train a shared global model locally on their own local dataset (i.e., sensing time series data from IIoT nodes) and upload their model updates (i.e., gradients) to the cloud aggregator. The cloud aggregator uses the FedAVG algorithm or other aggregation algorithms to aggregate these model updates and obtains a new global model. In the end, the edge devices will receive the new global model sent by the cloud aggregator and use it to achieve accurate and timely anomaly detection.

Iv-a System Model Limitations

The proposed framework focus on a DAD model learning task involving distributed edge devices and a cloud aggregator. In this context, this framework has two limitations:: missing labels and communication overhead.

For missing-label limitation, we assume that the labels of the training sample with proportion are missing. The lack of the label of the sample will cause the problem of class imbalance, thereby reducing the accuracy of DAD model. For communication-overhead limitation, we consider that there exists an excessive communication overhead when a large number of gradients exchanged between the edge devices and the cloud aggregator, which may make the model fail to converge [29].

The above restrictions hinder the deployment of DAD model in edge devices, which motivates us to develop a communication-efficient FL-based unsupervised DAD framework to achieve accurate and timely anomaly detection.

Fig. 2: The overview of on-device communication-efficient deep anomaly detection framework in IIoT. This framework’s workflow consists of five steps, as follows: (i) The edge device uses the sensing time series data collected from IIoT nodes as a local dataset (as shown in 1⃝). (ii) The edge device performs local model (i.e., AMCNN-LSTM model) training on the local dataset (as shown in 2⃝). (iii) The edge device uploads the sparse gradients to the cloud aggregator by using a gradient compression mechanism (as shown in 3⃝). (iv) The cloud aggregator obtains a new global model by aggregating the sparse gradients uploaded by the edge device (as shown in 4⃝). (v) The cloud aggregator sends the new global model to each edge device. The above steps are executed cyclically until the global model reaches optimal convergence (as shown in 5⃝). Decentralized devices can use this optimal global model to perform anomaly detection tasks.

Iv-B The Proposed Framework

We consider an on-device communication-efficient deep anomaly detection framework that involves multiple edge devices for collaborative model training in IIoT, as illustrated in Fig. 2. In particular, this framework consists of a cloud aggregator and edge devices. Furthermore, the proposed framework also includes two mechanisms: an anomaly detection mechanism and a gradient compression mechanism. More details are described as follows:

  • Cloud Aggregator: The cloud aggregator is generally a cloud server with strong computing power and rich computing resources. The cloud aggregator contains two functions: (1) initializes the global model and sends the global model to the all edge devices; (2) aggregates the gradients uploaded by the edge devices until the model converges.

  • Edge Devices: Edge devices are generally agents and clients, such as whirlpool, wind turbine, and vehicle, which contain local models and functional mechanisms (see below for more details). Each edge device uses the local dataset (i.e., sensing time series data from IIoT nodes) to train the global model sent by the cloud aggregator and uploads the gradients to the cloud aggregator until the global model converges. The local model is deployed in the edge device, and it can perform anomaly detection. In this paper, we use AMCNN-LSTM model to detect anomalies, which uses CNN to capture the fine-grained features of sensing time series data and uses LSTM module to accurately and timely detect anomalies.

The functions of mechanisms are described as follows:

  • Deep Anomaly Detection Mechanism: The deep anomaly detection mechanism is deployed in the edge devices, which can detect anomalies to reduce economic losses.

  • Gradient Compression Mechanism: The gradient compression mechanism is deployed in the edge devices, which can compress the local gradients to reduce the number of gradients exchanged between the edge devices and the cloud aggregator, thereby reducing communication overhead.

Iv-C Design Goals

In this paper, our goal is to develop an on-device communication-efficient FL framework for deep anomaly detection in IIoT. First, the proposed framework needs to detect anomalies accurately in an unsupervised manner. The proposed framework uses an unsupervised AMCNN-LSTM model to detect anomalies. Second, the proposed framework can significantly improve communication efficiency by using a gradient compression mechanism. Third, the performance of the proposed framework is comparable to traditional FL frameworks.

V A Communication-Efficient On-device Deep Anomaly Detection Framework

In this section, we first present the attention mechanism-based CNN-LSTM model. This model uses CNN to capture the fine-grained features of sensing time series data and uses LSTM module to accurately and timely detect anomalies. We then propose a deep gradient compression mechanism to further improve the communication efficiency of the proposed framework.

Fig. 3: The overview of the attention mechanism-based CNN-LSTM Model.

V-a Attention Mechanism-based CNN-LSTM Model

We present an unsupervised AMCNN-LSTM model including an input layer, an attention mechanism-based CNN unit, an LSTM unit, and an output layer shown in Fig. 3. First, we use the preprocessed data as input to the input layer. Second, we use CNN to capture the fine-grained features of the input and utilize the attention mechanism to focus on the important features of CNN captured features. Third, we use the output of the attention mechanism-based CNN unit as the input of the LSTM unit and use LSTM to predict future time-series data. Finally, we propose an anomaly detection score to detect anomalies.

Preprocessing: We normalize the sensing time series data collected by the IIoT nodes into [0,1] to accelerate the model convergence.

Attention Mechanism-based CNN Unit: First, we introduce an attention mechanism in CNN unit to improve the focus on important features. In cognitive science, due to the bottleneck of information processing, humans will selectively focus on important parts of information while ignoring other visible information [37]

. Inspired by the above facts, attention mechanisms are proposed for various tasks, such as computer vision and natural language processing

[37, 38, 39]. Therefore, the attention mechanism can improve the performance of the model by paying attention to important features. The formal definition of the attention mechanism is given as follows:

(6)

where is the matching feature vector based on the current task and is used to interact with the context, is the feature vector of a timestamp in the time series, is the unnormalized attention score, is the normalized attention score, and is the context feature of the current timestamp calculated based on the attention score and feature sequence . In most instances, , where is the weight matrix.

Second, we use CNN unit to extract fine-grained features of time series data. The CNN module is formed by stacking multiple layers of one-dimensional (1-D) CNN, and each layer includes a convolution layer, a batch normalization layer, and a non-linear layer. Such modules implement sampling aggregation by using pooling layers and create hierarchical structures that gradually extract more abstract features through the stacking of convolutional layers. This module outputs

feature sequences of length , and the size can be expressed as (

). To further extract significant time-series data features, we propose a parallel feature extraction branch by combining the attention mechanisms and CNN. The attention mechanism module is composed of feature aggregation and scale restoration. The feature aggregation part uses the stacking of multiple convolutions and pooling layers to extract key features from the sequence and uses a convolution kernel of size

to mine the linear relationship. The scale restoration part restores the key features to (

), which is consistent with the size of the output features of CNN module, and then uses the sigmoid function to constrain the values to [0,1].

Third, we multiply element-wise the output features of CNN module and the output of the important features by the corresponding attention mechanism module. We assume that the sequence . The output of the sequence processed by CNN module is represented by , and the output of the corresponding attention module is represented as . We multiply the two outputs element by element, as follows:

(7)

where represents element-wise multiplication, is the corresponding position of the time series in the feature layer, and is the channel. We use the final feature layer as the input of LSTM block.

We introduce the attention mechanism to expand the receptive field of the input, which allows the model to obtain more comprehensive contextual information, thereby learning the important features of the current local sequence. Furthermore, we use the attention module to suppress the interference of unimportant features to the model, thereby solving the problem that the model cannot distinguish the importance of the time series data features.

LSTM Unit:

In this paper, we use a variant of a recurrent neural network, called LSTM, to support accurately predict the sensing time series data to detect anomalies, as shown in Fig.

3. LSTM uses a well-designed “gate” structure to remove or add information to the state of the cell. The “gate” structure is a method of selectively passing information. LSTM cells include forget gates , input gates , and output gates . The calculations on the three gate structures are defined as follows:

(8)

where , and

are the weight matrices and the bias vectors for input vector

at time step , respectively. is the activation function, represents element-wise multiplication of a matrix, represents the cell state, is the state of the hidden layer at time step , and is the state of the hidden layer at time step .

Anomaly Detection: We use AMCNN-LSTM model to predict real-time and future sensing time series data in different edge devices:

(9)

where is the prediction function. In this paper, we use LSTM unit for time series prediction. We use anomaly scores for anomaly detection, which is defined as follows:

(10)

where is the anomaly score, is the reconstruction error vector, and the error vectors for the time series in the sequences

are used to estimate the parameters

and of a Normal distribution using Maximum Likelihood Estimation.

In an unsupervised setting, when (), where is precision, is recall, and is the parameter, a point in a sequence can be predicted to be “anomalous”, otherwise “normal”.

V-B Gradient Compression Mechanism

If the gradients reach 99.9% sparsity, only the 0.1% gradients with the largest absolute value are useful for model aggregation [30]. Therefore, we only need to aggregate the gradient with a larger absolute value to update the model. This way reduces the byte size of the gradient matrix, which can reduce the number of gradients exchanged between the device and the cloud to improve communication efficiency, especially for distributed machine learning systems. Inspired by the above facts, we propose a gradient compression mechanism to reduce the gradients exchanged between the cloud aggregator and the edge devices. We expect that this mechanism can further improve the communication efficiency of the proposed framework.

When we choose a gradient with a larger absolute value, we will meet the following situations: (1) All gradient values in the gradient matrix are not greater than the given threshold; (2) There are some gradient values in the gradient matrix that are very close to the given threshold. If we set these gradients that do not meet the threshold requirements to 0, it will cause information loss. Therefore, the device uses a local gradient accumulation scheme to prevent information loss. Specifically, the cloud returns smaller gradients to the device instead of filtering the gradients. The device keeps the smaller gradient in the buffer and accumulates all the smaller gradients until it reaches a given threshold. Note that we use D-SGD for iterative updates, and the loss function to be optimized is defined as follows:

(11)
(12)

where is the loss function, is the loss function for the local device, are the weights of the model, is the total edge devices, is the learning rate, represents the data sample for the -th round of training, and each local dataset size of .

When the gradients’ sparsification reaches a high value (e.g., 99%), it will affect the model convergence. By following [30, 36]

, we use momentum correction and local gradient clipping to mitigate this effect. Momentum correction can make the accumulated small local gradients converge toward the gradients with a larger absolute value, thereby accelerating the model’s convergence speed. Local gradient clipping is used to alleviate the problem of gradient explosions

[30]. Next, we prove that local gradient accumulation scheme will not affect the model convergence: We assume that is the -th gradient, denotes the sum of the gradients using the aggregation algorithm in [26], denotes the sum of the gradients using the local gradient accumulation scheme, and is the rate of gradient descent. If the -th gradient does not exceed threshold until the -th iteration and triggers the model update, we have:

(13)
(14)

then we can update and set . If the -th gradient reaches the threshold at the -th iteration, model update is triggered, thus we have:

(15)
(16)

Then we can update , so the result of using the local gradient accumulation scheme is consistent with the usage effect of the optimization algorithm in [26].

The specific implementation phases of the gradient compression mechanism are given as follows:

  1. [label=)]

  2. Phase 1, Local Training: Edge devices use the local dataset to train the local model. In particular, we use the gradient accumulation scheme to accumulate local small gradients.

  3. Phase 2, Gradient Compression: Each edge device uses Algorithm 1 to compress the gradients and upload sparse gradients (i.e., only gradients larger than a threshold are transmitted.) to the cloud aggregator. Note that the edge devices send the remaining local gradient to the cloud aggregator when the local gradient accumulation is greater than a threshold.

  4. Phase 3, Gradient Aggregation: The cloud aggregator obtains the global model by aggregating sparse gradients and sends this global model to the edge devices.

The gradient compression algorithm is thus presented in Algorithm 1.

Input: is the edge node’s gradient, is the local mini-batch size, is the local dataset, is the learning rate, is the edge node’s loss function, and the optimization function .
Output: Parameter .
1Initialize parameter ;
2 ;
3 for  do
4       ;
5       for  do
6             Sample data from ;
7             ;
8            
9      
10if Gradient Clipping then
11       ;
12      
13foreach  and  do
14       ;
15       if  then
16             Send this gradient to the cloud aggregator;
17            
18      if  then
19             The edge node uses the local gradient accumulation scheme to accumulate gradients until the gradient reaches ;
20      Aggregate ;
21       .
return .
Algorithm 1 Gradient compression mechanism on edge node .

Vi Experiments

In this section, the proposed framework is applied to four real-world datasets, i.e., power demand 111https://archive.ics.uci.edu/ml/datasets/ , space shuttle 222https://archive.ics.uci.edu/ml/datasets/Statlog+(Shuttle), ECG 333https://physionet.org/about/database/, and engine 444https://archive.ics.uci.edu/ml/datasets.php for performance demonstration. These datasets are time series datasets collected by different types of sensors from different fields [6]. For example, the power demand dataset is composed of electricity consumption data recorded by the electricity meter. There are normal subsequences and anomalous subsequences in these datasets. As shown in Table I, , , and

is a number of original sequences, normal subsequences, and anomalous subsequences, respectively. For the power demand dataset, the anomalous subsequences indicate that the electricity meter has failed or stop working. Therefore, we need to use these datasets to train a FL model that can detect anomalies. We divide all datasets into a training set and a test set in a 7: 3 ratio. We implement the proposed framework by using Pytorch and PySyft

[40]. The experiment is conducted on a virtual workstation with the Ubuntu 18.04 operation system, Intel (R) Core (TM) i5-4210M CPU, 16GB RAM, 512GB SSD.

, Datasets Dimensions Power Demand 1 1 45 6 Space Shuttle 1 3 20 8 ECG 1 1 215 1 Engine 12 30 240 152

TABLE I: Details of four real-world datasets

Vi-a Evaluation Setup

In this experiment, to determine the hyperparameter

of the gradient compression mechanism, we first apply a simple CNN network (i.e., CNN with 2 convolutional layers followed by 1 fully connected layer) in the proposed framework to perform the classification task on MNIST and CIFAR-10 dataset. The pixels in all datasets are normalized into [0,1]. During the simulation, the number of edge devices is , the learning rate is

, the training epoch is

, the mini-batch size is , and we follow reference [41] and set as 0.05.

We adopt Root Mean Square Error (RMSE) to indicate the performance of AMCNN-LSTM model as follows:

(17)

where is the observed sensing time series data, and is the predicted sensing time series data.

Fig. 4: The accuracy of the proposed framework with different on MNIST and CIFAR-10 datasets.

Vi-B Hyperparameters Selection of the Proposed Framework

In the context of deep gradient compression scheme, proper hyperparameter selection, i.e., a threshold of absolute gradient value, is a notable factor that determines the proposed framework performance. In this section, we investigate the performance of the proposed framework with different thresholds and try to find a best-performing threshold for it. In particular, we employ to adjust the best threshold of the proposed framework. We use MNIST and CIFAR-10 datasets to evaluate the performance of the proposed framework with the selected threshold. As shown in Fig. 4, we observe that the larger , the better the performance of the proposed framework. For MNIST task, the results show that when , the accuracy is 97.25%; when , the accuracy is 99.08%. This means that the model increases gradient size by about 300 times, but the accuracy is only improved by 1.83%. Furthermore, we observe a trade-off between the gradient threshold and accuracy. Therefore, to achieve a good trade-off between the gradient threshold and accuracy, we choose as the best threshold of our scheme.

Fig. 5: Performance comparsion of detection accuracy for AMCNN-LSTM, CNN-LSTM, LSTM, GRU, SAEs, and SVM on different datasets: power demand, space shuttle, ECG, and engine.
Fig. 6: Performance comparsion of RMSE for AMCNN-LSTM, CNN-LSTM, LSTM, GRU, SAEs, and SVM on different datasets: power demand, space shuttle, ECG, and engine.

Vi-C Performance of the Proposed Framework

We compared the performance of the proposed model with that of CNN-LSTM [42], LSTM [41]

, Gate Recurrent Unit (GRU)

[43], Stacked Auto Encoders (SAEs) [44], and Support Machine Vector (SVM) [45] method with an identical simulation configuration. Among these competing methods, AMCNN-LSTM is a FL-based model, and the rest of the methods are centralized ones. All models are popular DAD models for general anomaly detection applications. We evaluate these models on four real-world datasets, i.e., power demand, space shuttle, ECG, and engine.

First, we compare the accuracy of the proposed model with competing methods in anomaly detection. We determine the and hyperparameter based on the accuracy and recall of the model on the training set. The hyperparameters of the dataset power demand, space shuttle, ECG, and engine are 0.75, 0.80, 0.80, and 0.60. In Fig. 5, experimental results show that the proposed model achieves the highest accuracy on all four datasets. For example, for the dataset power demand, the accuracy of AMCNN-LSTM model is 96.85%, which is 7.87% higher than that of SVM model. From the experimental results, AMCNN-LSTM has better robustness to different datasets. The reason is that we use the on-device FL framework to train and update the model, which can learn the time-series features from different edge devices as much as possible, thereby improving the robustness of the model. Furthermore, the FL framework provides opportunities for edge devices to update models in a timely manner. This helps the edge device owner to update the model on the edge devices in time.

Second, we need to evaluate the prediction error of the proposed model and the competing methods. As shown in Fig. 5, experimental results show that the proposed model achieves the best performance on four real-world datasets. For the ECG dataset, RMSE of AMCNN-LSTM model is 63.9% lower than that of SVM model. The reason is that AMCNN-LSTM model uses AMCNN units to capture important fine-grained features and prevent memory loss and gradient dispersion problems. Memory loss and gradient dispersion problems often occur in encoder-decoder models such as LSTM and GRU models. Furthermore, the proposed model retains the advantages of LSTM unit in predicting time series data. Therefore, the proposed model can accurately predict time series data.

Therefore, the proposed model not only accurately detects abnormalities, but also accurately predicts time series data.

Fig. 7: Comparison of communication efficiency between FL with GCM and FL without GCM with different models.

Vi-D Communication Efficiency of the Proposed Framework

In this section, we compare the communication efficiency between FL framework with the gradient compression mechanism (GCM) and the traditional FL framework without GCM. We apply the same model (i.e., AMCNN-LSTM, CNN-LSTM, LSTM, GRU, SAEs, and SVM) in the proposed framework and the traditional FL framework. Note that we fix the communication overhead of each round, so we can compare the running time of the model to compare the communication efficiency. In Fig. 7, we show the running time of FL with GCM and FL without GCM using different models. As shown in Fig. 7, we observe that the running time of FL framework with GCM is about 50% that of the framework without GCM. The reason is that GCM can reduce the number of gradients exchanged between the edge devices and the cloud aggregator. In section V-B, we show that GCM can compress the gradient by 300 times without compromising accuracy. Therefore, the proposed communication efficient framework is practical and effective in real-world applications.

Vi-E Discussion

Due to the trade-off between privacy and model performance, we will discuss the privacy analysis of the proposed framework in terms of data access and model performance:

  • Data Access: FL framework allows edge devices to keep the dataset locally and collaboratively learn deep learning models, which means that any third party cannot access the user’s raw data. Therefore, the FL-based model can achieve anomaly detection without compromising privacy.

  • Model Performance: Although the FL-based model can protect privacy, the model performance is still an important metric to measure the quality of the model. It can be seen from the experimental results that the performance of the proposed model is comparable to many advanced centralized machine learning models, such as CNN-LSTM, LSTM, GRU, and SVM model. In other words, the proposed model makes a good compromise between privacy and model performance.

Vii Conclusion

In this paper, we propose a novel communication-efficient on-device FL-based deep anomaly detection framework for sensing time series data in IIoT. First, we introduce a FL framework to enable decentralized edge devices to collaboratively train an anomaly detection model, which can solve the problem of data islands. Second, we propose an attention mechanism-based CNN-LSTM (AMCNN-LSTM) model to accurately detect anomalies. AMCNN-LSTM model uses attention mechanism-based CNN units to capture important fine-grained features and prevent memory loss and gradient dispersion problems. Furthermore, this model retains the advantages of LSTM unit in predicting time series data. We evaluate the performance of the proposed model on four real-world datasets and compare it with CNN-LSTM, LSTM, GRU, SAEs, and SVM methods. The experimental results show that the AMCNN-LSTM model can achieve the highest accuracy on all four datasets. Third, we propose a gradient compression mechanism based on Top-k selection to improve communication efficiency. Experimental results validate that this mechanism can compress the gradient by 300 times without losing accuracy. To the best of our knowledge, this is one of the pioneering work for deep anomaly detection by using on-device FL.

In the future, we will focus on researching privacy-enhanced FL frameworks and more robust anomaly detection models. The reason is that the FL framework is vulnerable to malicious attacks by malicious participants and a more robust model can be applied to a wider range of application scenarios.

References