The widespread deployment of edge devices in the Industrial Internet of Things (IIoT) paradigm has spawned a variety of emerging applications with edge computing, such as smart manufacturing, intelligent transportation, and intelligent logistics . The edge devices provide powerful computation resources to enable real-time, flexible, and quick decision making for the IIoT applications, which has greatly promoted the development of Industry 4.0 . However, the IIoT applications are suffering from critical security risks caused by abnormal IIoT nodes which hinders the rapid development of IIoT. For example, in smart manufacturing scenarios, industrial devices acting as IIoT nodes, e.g., engines with sensors, that have abnormal behaviors (e.g., abnormal traffic and irregular reporting frequency) may cause industrial production interruption thus resulting in huge economic losses for factories [3, 4]. Edge devices (e.g., industrial robots), generally collect sensing data from IIoT nodes, especially time-series data, to analyze and capture the behaviors and operating condition of IIoT nodes by edge computing . Therefore, these sensing time series data can be used to detect the anomaly behaviors of IIoT nodes .
To solve the abnormality problems from IIoT devices, typical methods are to perform abnormal detection for the affected IIoT devices[7, 8, 9, 10]. Previous work focused on utilizing deep anomaly detection (DAD)  approaches to detect abnormal behaviors of IIoT devices by analyzing sensing time series data. DAD techniques can learn hierarchical discriminative features from historical time-series data. In [12, 13, 14]
, the authors proposed a Long Short Term Memory (LSTM) networks-based deep learning model to achieve anomaly detection in sensing time series data. Muniret al. in  proposed a novel DAD approach, called DeepAnT, to achieve anomaly detection by utilizing deep Convolutional Neural Network (CNN) to predict anomaly value. Although the existing DAD approaches have achieved success in anomaly detection, they cannot be directly applied to the IIoT scenarios with distributed edge devices for timely and accurate anomaly detection. The reasons are two-fold: (i) the most of detection models are not flexible enough in traditional approaches, the edge devices lack dynamic and automatically updated detection models for different scenarios, and hence the models fail to accurately predict frequently updated time-series data ; (2) due to privacy concerns, the edge devices are not willing to share their own collected time-series data with each other, thus the data exists in the form of “islands.” The data islands significantly degrade the performance of anomaly detection. Furthermore, it is often overlooked that the data may contain sensitive private information, which leads to potential privacy leakage. There are some privacy issues in the anomaly detection context. For example, the anomaly detection model will reveal the patient’s heart disease history when detecting the patient’s abnormal pulse [16, 17].
To address the above challenges, a promising on-device privacy-preserving distributed machine learning paradigm, called on-device federated learning (FL), was proposed for edge devices to train a global DAD model while keeps the training datasets locally without sharing raw training data. Such a framework allows edge devices to collaboratively train an on-device DAD model without compromising privacy. For example, the authors in  proposed an on-device FL-based approach to achieve collaborative anomaly detection. Tsukada et al. in 
utilized FL framework to propose a Backpropagation Neural Networks (i.e., BPNNs) based approaches for anomaly detection. However, previous researches ignore the communication overhead during model training using FL among large-scale edge devices. Expensive communication overhead may cause excessive overhead and long convergence time for edge devices so that the on-device DAD model cannot quickly detect anomalies. Therefore, it is necessary to develop a communication-efficient on-device FL framework to achieve accurate and timely anomaly detection for edge devices.
In this paper, we propose a communication-efficient on-device FL framework that leverages an attention mechanism-based CNN-LSTM (AMCNN-LSTM) model to achieve accurate and timely anomaly detection for edge devices. First, we introduce a FL framework to enable distributed edge devices to collaboratively train a global DAD model without compromising privacy. Second, we propose an AMCNN-LSTM model to detect anomalies. Specifically, we use attention-based CNNs to extract fine-grained features of historical observation-sensing time-series data and use LSTM modules for time-series prediction. Such a model can prevent memory loss and gradient dispersion problems. Third, to further improve the communication efficiency of the proposed framework, we propose a gradient compression mechanism based on Top-k selection to reduce the number of gradients uploaded by edge devices. We evaluate the proposed framework on four real-world datasets: power demand, space shuttle, ECG, and engine. Experimental results show that the proposed framework can achieve high-efficiency communication and achieve accurate and timely anomaly detection. The contributions of this paper are summarized as follows:
We introduce a federated learning framework to develop an on-device collaborative deep anomaly detection model for edge devices in IIoT.
We propose an attention mechanism-based CNN-LSTM model to detect anomalies, which uses CNN to capture the fine-grained features of time series data and uses LSTM module to accurately and timely detect anomalies.
We propose a Top-k selection-based gradient compression scheme to improve the proposed framework’s communication efficiency. Such a scheme decreases communication overhead by reducing the exchanged gradient parameters between the edge devices and the cloud aggregator.
We conduct extensive experiments on four real-world datasets to demonstrate that the proposed framework can accurately detect anomalies with low communication overhead.
Ii Related Work
Ii-a Deep Anomaly Detection
Deep Anomaly Detection (DAD) has always been a hot issue in IIoT, which serves as a function of detecting anomalies. Previous researches about DAD generally can be divided into three categories: supervised DAD, semi-supervised DAD, and unsupervised DAD approaches.
Supervised Deep Anomaly Detection:
Supervised deep anomaly detection typically uses the labels of normal and abnormal data to train a deep-supervised binary or multi-class classifier. For example, Erfaniet al. in 
proposed a supervised Support Vector Machine (SVM) classifier for high-dimensional data to classify normal and abnormal data. Despite the success of supervised DAD methods in anomaly detection, these methods are not as popular as semi-supervised or unsupervised methods due to the lack of labeled training data. Furthermore, the supervised DAD method has poor performance for data with class imbalance (the total number of positive class data is much larger than the total number of negative class data) .
Semi-supervised Deep Anomaly Detection:
Since normal instances are easier to obtain the labels than that of anomalies, semi-supervised DAD techniques are proposed to utilize a single (normally positive class) existing label to separate outliers. For example, Wulsin et al. in  applied Deep Belief Nets (DBNs) in a semi-supervised paradigm to model Electroencephalogram (EEG) waveforms for classification and anomaly detection. The semi-supervised DBN performance is comparable to the standard classifier on EEG dataset. The semi-supervised DAD approach is popular because it can use only a single class of labels to detect anomalies.
proposed a deep automatic coding Gaussian Mixture Model (DAGMM) for unsupervised anomaly detection. Schleglet al. in 
proposed a deep convolutional generative adversarial network, called AnoGAN, which detects abnormal anatomical images by learning a variety of normal anatomical images. They trained such a model in an unsupervised manner. Unsupervised DAD is widely used since it does not require the characteristics of labeled training data.
Ii-B Communication-Efficient Federated Learning
Google proposed a privacy-preserving distributed machine learning framework, called FL, to train machine learning models without compromising privacy . Inspired by this framework, different edge devices can contribute to the global model training while keeping the training data locally. However, communication overhead is the bottleneck of FL being widely used in IIoT 
. Previous work has focused on designing efficient stochastic gradient descent (SGD) algorithms and using model compression to reduce the communication overhead of FL. Agarwalet al. in  proposed an efficient cpSGD algorithm to achieve communication-efficient FL. Reisizadeh et al. in  used Periodic Averaging and Quantization methods to design a communication-efficient FL framework. Jeong et al. in  proposed a federated model distillation method to reduce the communication overhead of FL.
However, the above methods do not substantially reduce the number of gradients exchanged between edge devices and the cloud aggregator. The fact is that a large number of gradients exchanged between the edge devices and the cloud aggregator may cause excessive communication overhead for FL . Therefore, in this paper, we propose a Top-k selection-based gradient compression scheme to improve the communication efficiency of FL.
In this section, we briefly introduce anomalies, federated deep learning, and gradient compression as follows.
In statistics, anomalies (also called outliers and abnormalities) are the data points that are significantly different from other observations . We assume that , , and are regions composed of most observations, so they are regarded as normal data instance regions. If data points and are far from these regions, they can be classified as anomalies. To define anomalies more formally, we assume that an n-dimensional dataset
follows a normal distribution and its mean
and variancefor each dimension where and . Specifically, for , under the assumption of the normal distribution, we have:
if there is a new vector
, the probabilityof anomaly can be calculated as follows:
We can then judge whether vector belongs to an anomaly according to the probability value.
Iii-B Federated Learning
Traditional distributed deep learning techniques require a certain amount of private data to be aggregated and analyzed at central servers (e.g., cloud servers) during the model training phase by using distributed stochastic gradient descent (D-SGD) algorithm . Such the training process suffers from potential data privacy leakage risks for IIoT devices. To address such privacy challenges, a collaboratively distributed deep learning paradigm, called federated deep learning, was proposed for edge devices to train a global model while keeping the training datasets locally without sharing raw training data . The procedure of FL is divided into three phases: the initialization phase, the aggregation phase, and the update phase. In the initialization phase, we consider that FL with edge devices and a parameter aggregator, i.e., a cloud aggregator, distributes a pre-trained global model on the public datasets (e.g., MNIST , CIFAR-10 ) to each edge devices. Following that, each device uses local dataset of size to train and improve the current global model
in each iteration. In the aggregation phase, the cloud aggregator collects local gradients uploaded by the edge nodes (i.e., edge devices). To do so, the local loss function to be optimized is defined as follows:
where is the local loss function for edge device , , is a regularizer function for edge device , and is sampled from the local dataset on the device. In the update phase, the cloud aggregator uses Federated Averaging (FedAVG) algorithm  to obtain a new global model for the next iteration, thus we have:
where denotes model updates aggregation and denotes the average aggregation (i.e., FedAVG algorithm). Both the edge devices and the cloud aggregator repeat the above process until the global model reaches convergence. This paradigm significantly reduces the risks of privacy leakage by decoupling the model training from direct access to the raw training data on edge nodes.
Iii-C Gradient Compression
Large-scale FL training requires significant communication bandwidth for gradient exchange, which limits the scalability of multi-nodes training . In this context, Lin et al. in  stated that 99.9% of the gradient exchange in D-SGD is redundant. To avoid expensive communication bandwidth limiting large-scale distributed training, gradient compression is proposed to greatly reduce communication bandwidth. Researchers generally use gradient quantization  and gradient sparsification  to achieve gradient compression. Gradient quantization reduces communication bandwidth by quantizing gradients to low-precision values. Gradient sparsification uses threshold quantization to reduce communication bandwidth.
For a fully connected (FC) layer in a deep neural network, we have: , where is the input, is the bias, is the weight, is the nonlinear mapping, and
is the output. This formula is the most basic operation in a neural network. For each specific neuron, the above formula can be simplified to the following: , Where
is the activation function. Gradient compression compresses the corresponding weight matrix into a sparse matrix, and hence the corresponding formula is given as follows:
where represents the compressed weight matrix and represent the position information of the gradient in the weight matrix . Such a method reduces the communication overhead through sparsing the gradient in the weight matrix .
Iv System Model
We consider the generic setting for on-device deep anomaly detection in IIoT, where a cloud aggregator and edge devices work collaboratively to train a DAD model by using a given training algorithm (e.g., LSTM) for a specific task (i.e., anomaly detection task), as illustrated in Fig. 1. The edge devices train a shared global model locally on their own local dataset (i.e., sensing time series data from IIoT nodes) and upload their model updates (i.e., gradients) to the cloud aggregator. The cloud aggregator uses the FedAVG algorithm or other aggregation algorithms to aggregate these model updates and obtains a new global model. In the end, the edge devices will receive the new global model sent by the cloud aggregator and use it to achieve accurate and timely anomaly detection.
Iv-a System Model Limitations
The proposed framework focus on a DAD model learning task involving distributed edge devices and a cloud aggregator. In this context, this framework has two limitations:: missing labels and communication overhead.
For missing-label limitation, we assume that the labels of the training sample with proportion are missing. The lack of the label of the sample will cause the problem of class imbalance, thereby reducing the accuracy of DAD model. For communication-overhead limitation, we consider that there exists an excessive communication overhead when a large number of gradients exchanged between the edge devices and the cloud aggregator, which may make the model fail to converge .
The above restrictions hinder the deployment of DAD model in edge devices, which motivates us to develop a communication-efficient FL-based unsupervised DAD framework to achieve accurate and timely anomaly detection.
Iv-B The Proposed Framework
We consider an on-device communication-efficient deep anomaly detection framework that involves multiple edge devices for collaborative model training in IIoT, as illustrated in Fig. 2. In particular, this framework consists of a cloud aggregator and edge devices. Furthermore, the proposed framework also includes two mechanisms: an anomaly detection mechanism and a gradient compression mechanism. More details are described as follows:
Cloud Aggregator: The cloud aggregator is generally a cloud server with strong computing power and rich computing resources. The cloud aggregator contains two functions: (1) initializes the global model and sends the global model to the all edge devices; (2) aggregates the gradients uploaded by the edge devices until the model converges.
Edge Devices: Edge devices are generally agents and clients, such as whirlpool, wind turbine, and vehicle, which contain local models and functional mechanisms (see below for more details). Each edge device uses the local dataset (i.e., sensing time series data from IIoT nodes) to train the global model sent by the cloud aggregator and uploads the gradients to the cloud aggregator until the global model converges. The local model is deployed in the edge device, and it can perform anomaly detection. In this paper, we use AMCNN-LSTM model to detect anomalies, which uses CNN to capture the fine-grained features of sensing time series data and uses LSTM module to accurately and timely detect anomalies.
The functions of mechanisms are described as follows:
Deep Anomaly Detection Mechanism: The deep anomaly detection mechanism is deployed in the edge devices, which can detect anomalies to reduce economic losses.
Gradient Compression Mechanism: The gradient compression mechanism is deployed in the edge devices, which can compress the local gradients to reduce the number of gradients exchanged between the edge devices and the cloud aggregator, thereby reducing communication overhead.
Iv-C Design Goals
In this paper, our goal is to develop an on-device communication-efficient FL framework for deep anomaly detection in IIoT. First, the proposed framework needs to detect anomalies accurately in an unsupervised manner. The proposed framework uses an unsupervised AMCNN-LSTM model to detect anomalies. Second, the proposed framework can significantly improve communication efficiency by using a gradient compression mechanism. Third, the performance of the proposed framework is comparable to traditional FL frameworks.
V A Communication-Efficient On-device Deep Anomaly Detection Framework
In this section, we first present the attention mechanism-based CNN-LSTM model. This model uses CNN to capture the fine-grained features of sensing time series data and uses LSTM module to accurately and timely detect anomalies. We then propose a deep gradient compression mechanism to further improve the communication efficiency of the proposed framework.
V-a Attention Mechanism-based CNN-LSTM Model
We present an unsupervised AMCNN-LSTM model including an input layer, an attention mechanism-based CNN unit, an LSTM unit, and an output layer shown in Fig. 3. First, we use the preprocessed data as input to the input layer. Second, we use CNN to capture the fine-grained features of the input and utilize the attention mechanism to focus on the important features of CNN captured features. Third, we use the output of the attention mechanism-based CNN unit as the input of the LSTM unit and use LSTM to predict future time-series data. Finally, we propose an anomaly detection score to detect anomalies.
Preprocessing: We normalize the sensing time series data collected by the IIoT nodes into [0,1] to accelerate the model convergence.
Attention Mechanism-based CNN Unit: First, we introduce an attention mechanism in CNN unit to improve the focus on important features. In cognitive science, due to the bottleneck of information processing, humans will selectively focus on important parts of information while ignoring other visible information 37, 38, 39]. Therefore, the attention mechanism can improve the performance of the model by paying attention to important features. The formal definition of the attention mechanism is given as follows:
where is the matching feature vector based on the current task and is used to interact with the context, is the feature vector of a timestamp in the time series, is the unnormalized attention score, is the normalized attention score, and is the context feature of the current timestamp calculated based on the attention score and feature sequence . In most instances, , where is the weight matrix.
Second, we use CNN unit to extract fine-grained features of time series data. The CNN module is formed by stacking multiple layers of one-dimensional (1-D) CNN, and each layer includes a convolution layer, a batch normalization layer, and a non-linear layer. Such modules implement sampling aggregation by using pooling layers and create hierarchical structures that gradually extract more abstract features through the stacking of convolutional layers. This module outputsfeature sequences of length , and the size can be expressed as (
). To further extract significant time-series data features, we propose a parallel feature extraction branch by combining the attention mechanisms and CNN. The attention mechanism module is composed of feature aggregation and scale restoration. The feature aggregation part uses the stacking of multiple convolutions and pooling layers to extract key features from the sequence and uses a convolution kernel of sizeto mine the linear relationship. The scale restoration part restores the key features to (
), which is consistent with the size of the output features of CNN module, and then uses the sigmoid function to constrain the values to [0,1].
Third, we multiply element-wise the output features of CNN module and the output of the important features by the corresponding attention mechanism module. We assume that the sequence . The output of the sequence processed by CNN module is represented by , and the output of the corresponding attention module is represented as . We multiply the two outputs element by element, as follows:
where represents element-wise multiplication, is the corresponding position of the time series in the feature layer, and is the channel. We use the final feature layer as the input of LSTM block.
We introduce the attention mechanism to expand the receptive field of the input, which allows the model to obtain more comprehensive contextual information, thereby learning the important features of the current local sequence. Furthermore, we use the attention module to suppress the interference of unimportant features to the model, thereby solving the problem that the model cannot distinguish the importance of the time series data features.
In this paper, we use a variant of a recurrent neural network, called LSTM, to support accurately predict the sensing time series data to detect anomalies, as shown in Fig.3. LSTM uses a well-designed “gate” structure to remove or add information to the state of the cell. The “gate” structure is a method of selectively passing information. LSTM cells include forget gates , input gates , and output gates . The calculations on the three gate structures are defined as follows:
where , and
are the weight matrices and the bias vectors for input vectorat time step , respectively. is the activation function, represents element-wise multiplication of a matrix, represents the cell state, is the state of the hidden layer at time step , and is the state of the hidden layer at time step .
Anomaly Detection: We use AMCNN-LSTM model to predict real-time and future sensing time series data in different edge devices:
where is the prediction function. In this paper, we use LSTM unit for time series prediction. We use anomaly scores for anomaly detection, which is defined as follows:
where is the anomaly score, is the reconstruction error vector, and the error vectors for the time series in the sequences
are used to estimate the parametersand of a Normal distribution using Maximum Likelihood Estimation.
In an unsupervised setting, when (), where is precision, is recall, and is the parameter, a point in a sequence can be predicted to be “anomalous”, otherwise “normal”.
V-B Gradient Compression Mechanism
If the gradients reach 99.9% sparsity, only the 0.1% gradients with the largest absolute value are useful for model aggregation . Therefore, we only need to aggregate the gradient with a larger absolute value to update the model. This way reduces the byte size of the gradient matrix, which can reduce the number of gradients exchanged between the device and the cloud to improve communication efficiency, especially for distributed machine learning systems. Inspired by the above facts, we propose a gradient compression mechanism to reduce the gradients exchanged between the cloud aggregator and the edge devices. We expect that this mechanism can further improve the communication efficiency of the proposed framework.
When we choose a gradient with a larger absolute value, we will meet the following situations: (1) All gradient values in the gradient matrix are not greater than the given threshold; (2) There are some gradient values in the gradient matrix that are very close to the given threshold. If we set these gradients that do not meet the threshold requirements to 0, it will cause information loss. Therefore, the device uses a local gradient accumulation scheme to prevent information loss. Specifically, the cloud returns smaller gradients to the device instead of filtering the gradients. The device keeps the smaller gradient in the buffer and accumulates all the smaller gradients until it reaches a given threshold. Note that we use D-SGD for iterative updates, and the loss function to be optimized is defined as follows:
where is the loss function, is the loss function for the local device, are the weights of the model, is the total edge devices, is the learning rate, represents the data sample for the -th round of training, and each local dataset size of .
, we use momentum correction and local gradient clipping to mitigate this effect. Momentum correction can make the accumulated small local gradients converge toward the gradients with a larger absolute value, thereby accelerating the model’s convergence speed. Local gradient clipping is used to alleviate the problem of gradient explosions. Next, we prove that local gradient accumulation scheme will not affect the model convergence: We assume that is the -th gradient, denotes the sum of the gradients using the aggregation algorithm in , denotes the sum of the gradients using the local gradient accumulation scheme, and is the rate of gradient descent. If the -th gradient does not exceed threshold until the -th iteration and triggers the model update, we have:
then we can update and set . If the -th gradient reaches the threshold at the -th iteration, model update is triggered, thus we have:
Then we can update , so the result of using the local gradient accumulation scheme is consistent with the usage effect of the optimization algorithm in .
The specific implementation phases of the gradient compression mechanism are given as follows:
Phase 1, Local Training: Edge devices use the local dataset to train the local model. In particular, we use the gradient accumulation scheme to accumulate local small gradients.
Phase 2, Gradient Compression: Each edge device uses Algorithm 1 to compress the gradients and upload sparse gradients (i.e., only gradients larger than a threshold are transmitted.) to the cloud aggregator. Note that the edge devices send the remaining local gradient to the cloud aggregator when the local gradient accumulation is greater than a threshold.
Phase 3, Gradient Aggregation: The cloud aggregator obtains the global model by aggregating sparse gradients and sends this global model to the edge devices.
The gradient compression algorithm is thus presented in Algorithm 1.
In this section, the proposed framework is applied to four real-world datasets, i.e., power demand 111https://archive.ics.uci.edu/ml/datasets/ , space shuttle 222https://archive.ics.uci.edu/ml/datasets/Statlog+(Shuttle), ECG 333https://physionet.org/about/database/, and engine 444https://archive.ics.uci.edu/ml/datasets.php for performance demonstration. These datasets are time series datasets collected by different types of sensors from different fields . For example, the power demand dataset is composed of electricity consumption data recorded by the electricity meter. There are normal subsequences and anomalous subsequences in these datasets. As shown in Table I, , , and
is a number of original sequences, normal subsequences, and anomalous subsequences, respectively. For the power demand dataset, the anomalous subsequences indicate that the electricity meter has failed or stop working. Therefore, we need to use these datasets to train a FL model that can detect anomalies. We divide all datasets into a training set and a test set in a 7: 3 ratio. We implement the proposed framework by using Pytorch and PySyft. The experiment is conducted on a virtual workstation with the Ubuntu 18.04 operation system, Intel (R) Core (TM) i5-4210M CPU, 16GB RAM, 512GB SSD.
Vi-a Evaluation Setup
In this experiment, to determine the hyperparameterof the gradient compression mechanism, we first apply a simple CNN network (i.e., CNN with 2 convolutional layers followed by 1 fully connected layer) in the proposed framework to perform the classification task on MNIST and CIFAR-10 dataset. The pixels in all datasets are normalized into [0,1]. During the simulation, the number of edge devices is , the learning rate is
, the training epoch is, the mini-batch size is , and we follow reference  and set as 0.05.
We adopt Root Mean Square Error (RMSE) to indicate the performance of AMCNN-LSTM model as follows:
where is the observed sensing time series data, and is the predicted sensing time series data.
Vi-B Hyperparameters Selection of the Proposed Framework
In the context of deep gradient compression scheme, proper hyperparameter selection, i.e., a threshold of absolute gradient value, is a notable factor that determines the proposed framework performance. In this section, we investigate the performance of the proposed framework with different thresholds and try to find a best-performing threshold for it. In particular, we employ to adjust the best threshold of the proposed framework. We use MNIST and CIFAR-10 datasets to evaluate the performance of the proposed framework with the selected threshold. As shown in Fig. 4, we observe that the larger , the better the performance of the proposed framework. For MNIST task, the results show that when , the accuracy is 97.25%; when , the accuracy is 99.08%. This means that the model increases gradient size by about 300 times, but the accuracy is only improved by 1.83%. Furthermore, we observe a trade-off between the gradient threshold and accuracy. Therefore, to achieve a good trade-off between the gradient threshold and accuracy, we choose as the best threshold of our scheme.
Vi-C Performance of the Proposed Framework
, Gate Recurrent Unit (GRU), Stacked Auto Encoders (SAEs) , and Support Machine Vector (SVM)  method with an identical simulation configuration. Among these competing methods, AMCNN-LSTM is a FL-based model, and the rest of the methods are centralized ones. All models are popular DAD models for general anomaly detection applications. We evaluate these models on four real-world datasets, i.e., power demand, space shuttle, ECG, and engine.
First, we compare the accuracy of the proposed model with competing methods in anomaly detection. We determine the and hyperparameter based on the accuracy and recall of the model on the training set. The hyperparameters of the dataset power demand, space shuttle, ECG, and engine are 0.75, 0.80, 0.80, and 0.60. In Fig. 5, experimental results show that the proposed model achieves the highest accuracy on all four datasets. For example, for the dataset power demand, the accuracy of AMCNN-LSTM model is 96.85%, which is 7.87% higher than that of SVM model. From the experimental results, AMCNN-LSTM has better robustness to different datasets. The reason is that we use the on-device FL framework to train and update the model, which can learn the time-series features from different edge devices as much as possible, thereby improving the robustness of the model. Furthermore, the FL framework provides opportunities for edge devices to update models in a timely manner. This helps the edge device owner to update the model on the edge devices in time.
Second, we need to evaluate the prediction error of the proposed model and the competing methods. As shown in Fig. 5, experimental results show that the proposed model achieves the best performance on four real-world datasets. For the ECG dataset, RMSE of AMCNN-LSTM model is 63.9% lower than that of SVM model. The reason is that AMCNN-LSTM model uses AMCNN units to capture important fine-grained features and prevent memory loss and gradient dispersion problems. Memory loss and gradient dispersion problems often occur in encoder-decoder models such as LSTM and GRU models. Furthermore, the proposed model retains the advantages of LSTM unit in predicting time series data. Therefore, the proposed model can accurately predict time series data.
Therefore, the proposed model not only accurately detects abnormalities, but also accurately predicts time series data.
Vi-D Communication Efficiency of the Proposed Framework
In this section, we compare the communication efficiency between FL framework with the gradient compression mechanism (GCM) and the traditional FL framework without GCM. We apply the same model (i.e., AMCNN-LSTM, CNN-LSTM, LSTM, GRU, SAEs, and SVM) in the proposed framework and the traditional FL framework. Note that we fix the communication overhead of each round, so we can compare the running time of the model to compare the communication efficiency. In Fig. 7, we show the running time of FL with GCM and FL without GCM using different models. As shown in Fig. 7, we observe that the running time of FL framework with GCM is about 50% that of the framework without GCM. The reason is that GCM can reduce the number of gradients exchanged between the edge devices and the cloud aggregator. In section V-B, we show that GCM can compress the gradient by 300 times without compromising accuracy. Therefore, the proposed communication efficient framework is practical and effective in real-world applications.
Due to the trade-off between privacy and model performance, we will discuss the privacy analysis of the proposed framework in terms of data access and model performance:
Data Access: FL framework allows edge devices to keep the dataset locally and collaboratively learn deep learning models, which means that any third party cannot access the user’s raw data. Therefore, the FL-based model can achieve anomaly detection without compromising privacy.
Model Performance: Although the FL-based model can protect privacy, the model performance is still an important metric to measure the quality of the model. It can be seen from the experimental results that the performance of the proposed model is comparable to many advanced centralized machine learning models, such as CNN-LSTM, LSTM, GRU, and SVM model. In other words, the proposed model makes a good compromise between privacy and model performance.
In this paper, we propose a novel communication-efficient on-device FL-based deep anomaly detection framework for sensing time series data in IIoT. First, we introduce a FL framework to enable decentralized edge devices to collaboratively train an anomaly detection model, which can solve the problem of data islands. Second, we propose an attention mechanism-based CNN-LSTM (AMCNN-LSTM) model to accurately detect anomalies. AMCNN-LSTM model uses attention mechanism-based CNN units to capture important fine-grained features and prevent memory loss and gradient dispersion problems. Furthermore, this model retains the advantages of LSTM unit in predicting time series data. We evaluate the performance of the proposed model on four real-world datasets and compare it with CNN-LSTM, LSTM, GRU, SAEs, and SVM methods. The experimental results show that the AMCNN-LSTM model can achieve the highest accuracy on all four datasets. Third, we propose a gradient compression mechanism based on Top-k selection to improve communication efficiency. Experimental results validate that this mechanism can compress the gradient by 300 times without losing accuracy. To the best of our knowledge, this is one of the pioneering work for deep anomaly detection by using on-device FL.
In the future, we will focus on researching privacy-enhanced FL frameworks and more robust anomaly detection models. The reason is that the FL framework is vulnerable to malicious attacks by malicious participants and a more robust model can be applied to a wider range of application scenarios.
H. Peng and X. Shen, “Deep reinforcement learning based resource management for multi-access edge computing in vehicular networks,”IEEE Transaction on Network Science and Engineering, to appear.
-  Y. Wu et al., “Dominant data set selection algorithms for electricity consumption time-series data analysis based on affine transformation,” IEEE Internet of Things Journal, vol. 7, no. 5, pp. 4347–4360, 2020.
-  Y. Peng et al., “Hierarchical edge computing: A novel multi-source multi-dimensional data anomaly detection scheme for industrial internet of things,” IEEE Access, vol. 7, pp. 111 257–111 270, 2019.
-  H. Peng et al., “Toward energy-efficient and robust large-scale wsns: a scale-free network approach,” IEEE Journal on Selected Areas in Communications, vol. 34, no. 12, pp. 4035–4047, 2016.
-  P. Malhotra et al., “Lstm-based encoder-decoder for multi-sensor anomaly detection,” arXiv preprint arXiv:1607.00148, 2016.
T. Luo et al.
, “Distributed anomaly detection using autoencoder neural networks in wsn for iot,” in2018 IEEE International Conference on Communications. IEEE, 2018, pp. 1–6.
-  S. Garg et al., “A hybrid deep learning-based model for anomaly detection in cloud datacenter networks,” IEEE Transactions on Network and Service Management, vol. 16, no. 3, pp. 924–935, 2019.
-  T. D. Nguyen et al., “Dïot: A federated self-learning anomaly detection system for iot,” in 2019 IEEE 39th International Conference on Distributed Computing Systems. IEEE, 2019, pp. 756–767.
-  S. Garg et al., “A multi-stage anomaly detection scheme for augmenting the security in iot-enabled applications,” Future Generation Computer Systems, vol. 104, pp. 105–118, 2020.
-  S. Garg et al., “En-abc: An ensemble artificial bee colony based anomaly detection scheme for cloud environment,” Journal of Parallel and Distributed Computing, vol. 135, pp. 219–233, 2020.
-  R. Chalapathy and S. Chawla, “Deep learning for anomaly detection: A survey,” arXiv preprint arXiv:1901.03407, 2019.
-  P. Malhotra et al.,“Long short term memory networks for anomaly detection in time series,” in Proceedings, vol. 89. Presses universitaires de Louvain, 2015.
-  W. Lu et al., “Deep hierarchical encoding model for sentence semantic matching,” Journal of Visual Communication and Image Representation, p. 102794, 2020.
H. Lu et al.
, “Deep fuzzy hashing network for efficient image retrieval,”IEEE Transactions on Fuzzy Systems, 2020.
-  M. Munir et al., “Deepant: A deep learning approach for unsupervised anomaly detection in time series,” IEEE Access, vol. 7, pp. 1991–2005, 2018.
-  E. Lundin et al., “Anomaly-based intrusion detection: privacy concerns and other problems,” Computer networks, vol. 34, no. 4, pp. 623–640, 2000.
-  I. Butun et al., “Anomaly detection and privacy preservation in cloud-centric internet of things,” in 2015 IEEE International Conference on Communication Workshop. IEEE, 2015, pp. 2610–2615.
-  Y. Liu et al., “Privacy-preserving traffic flow prediction: A federated learning approach,” IEEE Internet of Things Journal, pp. 1–1, 2020.
-  R. Ito et al., “An on-device federated learning approach for cooperative anomaly detection,” arXiv preprint arXiv:2002.12301, 2020.
-  M. Tsukada et al., “A neural network based on-device learning anomaly detector for edge devices,” arXiv preprint arXiv:1907.10147, 2019.
-  S. M. Erfani et al., “High-dimensional and large-scale anomaly detection using a linear one-class svm with deep learning,” Pattern Recognition, vol. 58, pp. 121–134, 2016.
-  T. S. Buda et al., “Deepad: A generic framework based on deep learning for time series anomaly detection,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 2018, pp. 577–588.
-  D. Wulsin et al., “Modeling electroencephalography waveforms with semi-supervised deep belief nets: fast classification and anomaly measurement,” Journal of neural engineering, vol. 8, no. 3, p. 036015, 2011.
-  B. Zong et al., “Deep autoencoding gaussian mixture model for unsupervised anomaly detection,” 2018.
-  T. Schlegl et al., “Unsupervised anomaly detection with generative adversarial networks to guide marker discovery,” in International conference on information processing in medical imaging. Springer, 2017, pp. 146–157.
-  J. Konečnỳ et al., “Federated learning: Strategies for improving communication efficiency,” arXiv preprint arXiv:1610.05492, 2016.
-  N. Agarwal et al., “cpsgd: Communication-efficient and differentially-private distributed sgd,” in Advances in Neural Information Processing Systems, 2018, pp. 7564–7575.
-  A. Reisizadeh et al., “Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization,” arXiv preprint arXiv:1909.13014, 2019.
-  E. Jeong et al., “Communication-efficient on-device machine learning: Federated distillation and augmentation under non-iid private data,” arXiv preprint arXiv:1811.11479, 2018.
-  J. Wangni et al., “Gradient sparsification for communication-efficient distributed optimization,” in Advances in Neural Information Processing Systems, 2018, pp. 1299–1309.
-  L. Zhao et al., “Shielding collaborative learning: Mitigating poisoning attacks through client-side detection,” arXiv preprint arXiv:1910.13111, 2019.
-  Y. LeCun et al., “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
-  A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
-  Y. Lin et al., “Deep gradient compression: Reducing the communication bandwidth for distributed training,” arXiv preprint arXiv:1712.01887, 2017.
-  D. Alistarh et al., “Qsgd: Communication-efficient sgd via gradient quantization and encoding,” in Advances in Neural Information Processing Systems, 2017, pp. 1709–1720.
-  J. Wangni et al., “Gradient sparsification for communication-efficient distributed optimization,” in Advances in Neural Information Processing Systems, 2018, pp. 1299–1309.
-  D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
-  V. Mnih et al., “Recurrent models of visual attention,” in Advances in neural information processing systems, 2014, pp. 2204–2212.
-  A. Vaswani et al., “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
-  T. Ryffel et al., “A generic framework for privacy preserving deep learning,” arXiv preprint arXiv:1811.04017, 2018.
-  P. Malhotra et al.,“Lstm-based encoder-decoder for multi-sensor anomaly detection,” arXiv preprint arXiv:1607.00148, 2016.
-  T.-Y. Kim and S.-B. Cho, “Web traffic anomaly detection using c-lstm neural networks,” Expert Systems with Applications, vol. 106, pp. 66–76, 2018.
-  Y. Guo et al., “Multidimensional time series anomaly detection: A gru-based gaussian mixture variational autoencoder approach,” in Asian Conference on Machine Learning, 2018, pp. 97–112.
-  N. Chouhan, A. Khan et al., “Network anomaly detection using channel boosted and residual learning based deep convolutional neural network,” Applied Soft Computing, vol. 83, p. 105612, 2019.
-  S. M. Erfani et al., “High-dimensional and large-scale anomaly detection using a linear one-class svm with deep learning,” Pattern Recognition, vol. 58, pp. 121–134, 2016.