I Introduction
The widespread deployment of edge devices in the Industrial Internet of Things (IIoT) paradigm has spawned a variety of emerging applications with edge computing, such as smart manufacturing, intelligent transportation, and intelligent logistics [1]. The edge devices provide powerful computation resources to enable realtime, flexible, and quick decision making for the IIoT applications, which has greatly promoted the development of Industry 4.0 [2]. However, the IIoT applications are suffering from critical security risks caused by abnormal IIoT nodes which hinders the rapid development of IIoT. For example, in smart manufacturing scenarios, industrial devices acting as IIoT nodes, e.g., engines with sensors, that have abnormal behaviors (e.g., abnormal traffic and irregular reporting frequency) may cause industrial production interruption thus resulting in huge economic losses for factories [3, 4]. Edge devices (e.g., industrial robots), generally collect sensing data from IIoT nodes, especially timeseries data, to analyze and capture the behaviors and operating condition of IIoT nodes by edge computing [5]. Therefore, these sensing time series data can be used to detect the anomaly behaviors of IIoT nodes [6].
To solve the abnormality problems from IIoT devices, typical methods are to perform abnormal detection for the affected IIoT devices[7, 8, 9, 10]. Previous work focused on utilizing deep anomaly detection (DAD) [11] approaches to detect abnormal behaviors of IIoT devices by analyzing sensing time series data. DAD techniques can learn hierarchical discriminative features from historical timeseries data. In [12, 13, 14]
, the authors proposed a Long Short Term Memory (LSTM) networksbased deep learning model to achieve anomaly detection in sensing time series data. Munir
et al. in [15] proposed a novel DAD approach, called DeepAnT, to achieve anomaly detection by utilizing deep Convolutional Neural Network (CNN) to predict anomaly value. Although the existing DAD approaches have achieved success in anomaly detection, they cannot be directly applied to the IIoT scenarios with distributed edge devices for timely and accurate anomaly detection. The reasons are twofold: (i) the most of detection models are not flexible enough in traditional approaches, the edge devices lack dynamic and automatically updated detection models for different scenarios, and hence the models fail to accurately predict frequently updated timeseries data [8]; (2) due to privacy concerns, the edge devices are not willing to share their own collected timeseries data with each other, thus the data exists in the form of “islands.” The data islands significantly degrade the performance of anomaly detection. Furthermore, it is often overlooked that the data may contain sensitive private information, which leads to potential privacy leakage. There are some privacy issues in the anomaly detection context. For example, the anomaly detection model will reveal the patient’s heart disease history when detecting the patient’s abnormal pulse [16, 17].To address the above challenges, a promising ondevice privacypreserving distributed machine learning paradigm, called ondevice federated learning (FL), was proposed for edge devices to train a global DAD model while keeps the training datasets locally without sharing raw training data
[18]. Such a framework allows edge devices to collaboratively train an ondevice DAD model without compromising privacy. For example, the authors in [19] proposed an ondevice FLbased approach to achieve collaborative anomaly detection. Tsukada et al. in [20]utilized FL framework to propose a Backpropagation Neural Networks (i.e., BPNNs) based approaches for anomaly detection. However, previous researches ignore the communication overhead during model training using FL among largescale edge devices. Expensive communication overhead may cause excessive overhead and long convergence time for edge devices so that the ondevice DAD model cannot quickly detect anomalies. Therefore, it is necessary to develop a communicationefficient ondevice FL framework to achieve accurate and timely anomaly detection for edge devices.
In this paper, we propose a communicationefficient ondevice FL framework that leverages an attention mechanismbased CNNLSTM (AMCNNLSTM) model to achieve accurate and timely anomaly detection for edge devices. First, we introduce a FL framework to enable distributed edge devices to collaboratively train a global DAD model without compromising privacy. Second, we propose an AMCNNLSTM model to detect anomalies. Specifically, we use attentionbased CNNs to extract finegrained features of historical observationsensing timeseries data and use LSTM modules for timeseries prediction. Such a model can prevent memory loss and gradient dispersion problems. Third, to further improve the communication efficiency of the proposed framework, we propose a gradient compression mechanism based on Topk selection to reduce the number of gradients uploaded by edge devices. We evaluate the proposed framework on four realworld datasets: power demand, space shuttle, ECG, and engine. Experimental results show that the proposed framework can achieve highefficiency communication and achieve accurate and timely anomaly detection. The contributions of this paper are summarized as follows:

We introduce a federated learning framework to develop an ondevice collaborative deep anomaly detection model for edge devices in IIoT.

We propose an attention mechanismbased CNNLSTM model to detect anomalies, which uses CNN to capture the finegrained features of time series data and uses LSTM module to accurately and timely detect anomalies.

We propose a Topk selectionbased gradient compression scheme to improve the proposed framework’s communication efficiency. Such a scheme decreases communication overhead by reducing the exchanged gradient parameters between the edge devices and the cloud aggregator.

We conduct extensive experiments on four realworld datasets to demonstrate that the proposed framework can accurately detect anomalies with low communication overhead.
Ii Related Work
Iia Deep Anomaly Detection
Deep Anomaly Detection (DAD) has always been a hot issue in IIoT, which serves as a function of detecting anomalies. Previous researches about DAD generally can be divided into three categories: supervised DAD, semisupervised DAD, and unsupervised DAD approaches.
Supervised Deep Anomaly Detection:
Supervised deep anomaly detection typically uses the labels of normal and abnormal data to train a deepsupervised binary or multiclass classifier. For example, Erfani
et al. in [21]proposed a supervised Support Vector Machine (SVM) classifier for highdimensional data to classify normal and abnormal data. Despite the success of supervised DAD methods in anomaly detection, these methods are not as popular as semisupervised or unsupervised methods due to the lack of labeled training data
[22]. Furthermore, the supervised DAD method has poor performance for data with class imbalance (the total number of positive class data is much larger than the total number of negative class data) [12].Semisupervised Deep Anomaly Detection:
Since normal instances are easier to obtain the labels than that of anomalies, semisupervised DAD techniques are proposed to utilize a single (normally positive class) existing label to separate outliers
[11]. For example, Wulsin et al. in [23] applied Deep Belief Nets (DBNs) in a semisupervised paradigm to model Electroencephalogram (EEG) waveforms for classification and anomaly detection. The semisupervised DBN performance is comparable to the standard classifier on EEG dataset. The semisupervised DAD approach is popular because it can use only a single class of labels to detect anomalies.Unsupervised Deep Anomaly Detection: Unsupervised deep anomaly detection techniques use the inherent properties of data instances to detect outliers [11]. For example, Zong et al. in [24]
proposed a deep automatic coding Gaussian Mixture Model (DAGMM) for unsupervised anomaly detection. Schlegl
et al. in [25]proposed a deep convolutional generative adversarial network, called AnoGAN, which detects abnormal anatomical images by learning a variety of normal anatomical images. They trained such a model in an unsupervised manner. Unsupervised DAD is widely used since it does not require the characteristics of labeled training data.
IiB CommunicationEfficient Federated Learning
Google proposed a privacypreserving distributed machine learning framework, called FL, to train machine learning models without compromising privacy [26]. Inspired by this framework, different edge devices can contribute to the global model training while keeping the training data locally. However, communication overhead is the bottleneck of FL being widely used in IIoT [18]
. Previous work has focused on designing efficient stochastic gradient descent (SGD) algorithms and using model compression to reduce the communication overhead of FL. Agarwal
et al. in [27] proposed an efficient cpSGD algorithm to achieve communicationefficient FL. Reisizadeh et al. in [28] used Periodic Averaging and Quantization methods to design a communicationefficient FL framework. Jeong et al. in [29] proposed a federated model distillation method to reduce the communication overhead of FL.However, the above methods do not substantially reduce the number of gradients exchanged between edge devices and the cloud aggregator. The fact is that a large number of gradients exchanged between the edge devices and the cloud aggregator may cause excessive communication overhead for FL [30]. Therefore, in this paper, we propose a Topk selectionbased gradient compression scheme to improve the communication efficiency of FL.
Iii Preliminary
In this section, we briefly introduce anomalies, federated deep learning, and gradient compression as follows.
Iiia Anomalies
In statistics, anomalies (also called outliers and abnormalities) are the data points that are significantly different from other observations [11]. We assume that , , and are regions composed of most observations, so they are regarded as normal data instance regions. If data points and are far from these regions, they can be classified as anomalies. To define anomalies more formally, we assume that an ndimensional dataset
follows a normal distribution and its mean
and variance
for each dimension where and . Specifically, for , under the assumption of the normal distribution, we have:(1) 
if there is a new vector
, the probability
of anomaly can be calculated as follows:(2) 
We can then judge whether vector belongs to an anomaly according to the probability value.
IiiB Federated Learning
Traditional distributed deep learning techniques require a certain amount of private data to be aggregated and analyzed at central servers (e.g., cloud servers) during the model training phase by using distributed stochastic gradient descent (DSGD) algorithm [31]. Such the training process suffers from potential data privacy leakage risks for IIoT devices. To address such privacy challenges, a collaboratively distributed deep learning paradigm, called federated deep learning, was proposed for edge devices to train a global model while keeping the training datasets locally without sharing raw training data [18]. The procedure of FL is divided into three phases: the initialization phase, the aggregation phase, and the update phase. In the initialization phase, we consider that FL with edge devices and a parameter aggregator, i.e., a cloud aggregator, distributes a pretrained global model on the public datasets (e.g., MNIST [32], CIFAR10 [33]) to each edge devices. Following that, each device uses local dataset of size to train and improve the current global model
in each iteration. In the aggregation phase, the cloud aggregator collects local gradients uploaded by the edge nodes (i.e., edge devices). To do so, the local loss function to be optimized is defined as follows:
(3) 
where is the local loss function for edge device , , is a regularizer function for edge device , and is sampled from the local dataset on the device. In the update phase, the cloud aggregator uses Federated Averaging (FedAVG) algorithm [26] to obtain a new global model for the next iteration, thus we have:
(4) 
where denotes model updates aggregation and denotes the average aggregation (i.e., FedAVG algorithm). Both the edge devices and the cloud aggregator repeat the above process until the global model reaches convergence. This paradigm significantly reduces the risks of privacy leakage by decoupling the model training from direct access to the raw training data on edge nodes.
IiiC Gradient Compression
Largescale FL training requires significant communication bandwidth for gradient exchange, which limits the scalability of multinodes training [34]. In this context, Lin et al. in [34] stated that 99.9% of the gradient exchange in DSGD is redundant. To avoid expensive communication bandwidth limiting largescale distributed training, gradient compression is proposed to greatly reduce communication bandwidth. Researchers generally use gradient quantization [35] and gradient sparsification [36] to achieve gradient compression. Gradient quantization reduces communication bandwidth by quantizing gradients to lowprecision values. Gradient sparsification uses threshold quantization to reduce communication bandwidth.
For a fully connected (FC) layer in a deep neural network, we have: , where is the input, is the bias, is the weight, is the nonlinear mapping, and
is the output. This formula is the most basic operation in a neural network. For each specific neuron
, the above formula can be simplified to the following: , Whereis the activation function. Gradient compression compresses the corresponding weight matrix into a sparse matrix, and hence the corresponding formula is given as follows:
(5) 
where represents the compressed weight matrix and represent the position information of the gradient in the weight matrix . Such a method reduces the communication overhead through sparsing the gradient in the weight matrix .
Iv System Model
We consider the generic setting for ondevice deep anomaly detection in IIoT, where a cloud aggregator and edge devices work collaboratively to train a DAD model by using a given training algorithm (e.g., LSTM) for a specific task (i.e., anomaly detection task), as illustrated in Fig. 1. The edge devices train a shared global model locally on their own local dataset (i.e., sensing time series data from IIoT nodes) and upload their model updates (i.e., gradients) to the cloud aggregator. The cloud aggregator uses the FedAVG algorithm or other aggregation algorithms to aggregate these model updates and obtains a new global model. In the end, the edge devices will receive the new global model sent by the cloud aggregator and use it to achieve accurate and timely anomaly detection.
Iva System Model Limitations
The proposed framework focus on a DAD model learning task involving distributed edge devices and a cloud aggregator. In this context, this framework has two limitations:: missing labels and communication overhead.
For missinglabel limitation, we assume that the labels of the training sample with proportion are missing. The lack of the label of the sample will cause the problem of class imbalance, thereby reducing the accuracy of DAD model. For communicationoverhead limitation, we consider that there exists an excessive communication overhead when a large number of gradients exchanged between the edge devices and the cloud aggregator, which may make the model fail to converge [29].
The above restrictions hinder the deployment of DAD model in edge devices, which motivates us to develop a communicationefficient FLbased unsupervised DAD framework to achieve accurate and timely anomaly detection.
IvB The Proposed Framework
We consider an ondevice communicationefficient deep anomaly detection framework that involves multiple edge devices for collaborative model training in IIoT, as illustrated in Fig. 2. In particular, this framework consists of a cloud aggregator and edge devices. Furthermore, the proposed framework also includes two mechanisms: an anomaly detection mechanism and a gradient compression mechanism. More details are described as follows:

Cloud Aggregator: The cloud aggregator is generally a cloud server with strong computing power and rich computing resources. The cloud aggregator contains two functions: (1) initializes the global model and sends the global model to the all edge devices; (2) aggregates the gradients uploaded by the edge devices until the model converges.

Edge Devices: Edge devices are generally agents and clients, such as whirlpool, wind turbine, and vehicle, which contain local models and functional mechanisms (see below for more details). Each edge device uses the local dataset (i.e., sensing time series data from IIoT nodes) to train the global model sent by the cloud aggregator and uploads the gradients to the cloud aggregator until the global model converges. The local model is deployed in the edge device, and it can perform anomaly detection. In this paper, we use AMCNNLSTM model to detect anomalies, which uses CNN to capture the finegrained features of sensing time series data and uses LSTM module to accurately and timely detect anomalies.
The functions of mechanisms are described as follows:

Deep Anomaly Detection Mechanism: The deep anomaly detection mechanism is deployed in the edge devices, which can detect anomalies to reduce economic losses.

Gradient Compression Mechanism: The gradient compression mechanism is deployed in the edge devices, which can compress the local gradients to reduce the number of gradients exchanged between the edge devices and the cloud aggregator, thereby reducing communication overhead.
IvC Design Goals
In this paper, our goal is to develop an ondevice communicationefficient FL framework for deep anomaly detection in IIoT. First, the proposed framework needs to detect anomalies accurately in an unsupervised manner. The proposed framework uses an unsupervised AMCNNLSTM model to detect anomalies. Second, the proposed framework can significantly improve communication efficiency by using a gradient compression mechanism. Third, the performance of the proposed framework is comparable to traditional FL frameworks.
V A CommunicationEfficient Ondevice Deep Anomaly Detection Framework
In this section, we first present the attention mechanismbased CNNLSTM model. This model uses CNN to capture the finegrained features of sensing time series data and uses LSTM module to accurately and timely detect anomalies. We then propose a deep gradient compression mechanism to further improve the communication efficiency of the proposed framework.
Va Attention Mechanismbased CNNLSTM Model
We present an unsupervised AMCNNLSTM model including an input layer, an attention mechanismbased CNN unit, an LSTM unit, and an output layer shown in Fig. 3. First, we use the preprocessed data as input to the input layer. Second, we use CNN to capture the finegrained features of the input and utilize the attention mechanism to focus on the important features of CNN captured features. Third, we use the output of the attention mechanismbased CNN unit as the input of the LSTM unit and use LSTM to predict future timeseries data. Finally, we propose an anomaly detection score to detect anomalies.
Preprocessing: We normalize the sensing time series data collected by the IIoT nodes into [0,1] to accelerate the model convergence.
Attention Mechanismbased CNN Unit: First, we introduce an attention mechanism in CNN unit to improve the focus on important features. In cognitive science, due to the bottleneck of information processing, humans will selectively focus on important parts of information while ignoring other visible information [37]
. Inspired by the above facts, attention mechanisms are proposed for various tasks, such as computer vision and natural language processing
[37, 38, 39]. Therefore, the attention mechanism can improve the performance of the model by paying attention to important features. The formal definition of the attention mechanism is given as follows:(6) 
where is the matching feature vector based on the current task and is used to interact with the context, is the feature vector of a timestamp in the time series, is the unnormalized attention score, is the normalized attention score, and is the context feature of the current timestamp calculated based on the attention score and feature sequence . In most instances, , where is the weight matrix.
Second, we use CNN unit to extract finegrained features of time series data. The CNN module is formed by stacking multiple layers of onedimensional (1D) CNN, and each layer includes a convolution layer, a batch normalization layer, and a nonlinear layer. Such modules implement sampling aggregation by using pooling layers and create hierarchical structures that gradually extract more abstract features through the stacking of convolutional layers. This module outputs
feature sequences of length , and the size can be expressed as (). To further extract significant timeseries data features, we propose a parallel feature extraction branch by combining the attention mechanisms and CNN. The attention mechanism module is composed of feature aggregation and scale restoration. The feature aggregation part uses the stacking of multiple convolutions and pooling layers to extract key features from the sequence and uses a convolution kernel of size
to mine the linear relationship. The scale restoration part restores the key features to (), which is consistent with the size of the output features of CNN module, and then uses the sigmoid function to constrain the values to [0,1].
Third, we multiply elementwise the output features of CNN module and the output of the important features by the corresponding attention mechanism module. We assume that the sequence . The output of the sequence processed by CNN module is represented by , and the output of the corresponding attention module is represented as . We multiply the two outputs element by element, as follows:
(7) 
where represents elementwise multiplication, is the corresponding position of the time series in the feature layer, and is the channel. We use the final feature layer as the input of LSTM block.
We introduce the attention mechanism to expand the receptive field of the input, which allows the model to obtain more comprehensive contextual information, thereby learning the important features of the current local sequence. Furthermore, we use the attention module to suppress the interference of unimportant features to the model, thereby solving the problem that the model cannot distinguish the importance of the time series data features.
LSTM Unit:
In this paper, we use a variant of a recurrent neural network, called LSTM, to support accurately predict the sensing time series data to detect anomalies, as shown in Fig.
3. LSTM uses a welldesigned “gate” structure to remove or add information to the state of the cell. The “gate” structure is a method of selectively passing information. LSTM cells include forget gates , input gates , and output gates . The calculations on the three gate structures are defined as follows:(8) 
where , and
are the weight matrices and the bias vectors for input vector
at time step , respectively. is the activation function, represents elementwise multiplication of a matrix, represents the cell state, is the state of the hidden layer at time step , and is the state of the hidden layer at time step .Anomaly Detection: We use AMCNNLSTM model to predict realtime and future sensing time series data in different edge devices:
(9) 
where is the prediction function. In this paper, we use LSTM unit for time series prediction. We use anomaly scores for anomaly detection, which is defined as follows:
(10) 
where is the anomaly score, is the reconstruction error vector, and the error vectors for the time series in the sequences
are used to estimate the parameters
and of a Normal distribution using Maximum Likelihood Estimation.In an unsupervised setting, when (), where is precision, is recall, and is the parameter, a point in a sequence can be predicted to be “anomalous”, otherwise “normal”.
VB Gradient Compression Mechanism
If the gradients reach 99.9% sparsity, only the 0.1% gradients with the largest absolute value are useful for model aggregation [30]. Therefore, we only need to aggregate the gradient with a larger absolute value to update the model. This way reduces the byte size of the gradient matrix, which can reduce the number of gradients exchanged between the device and the cloud to improve communication efficiency, especially for distributed machine learning systems. Inspired by the above facts, we propose a gradient compression mechanism to reduce the gradients exchanged between the cloud aggregator and the edge devices. We expect that this mechanism can further improve the communication efficiency of the proposed framework.
When we choose a gradient with a larger absolute value, we will meet the following situations: (1) All gradient values in the gradient matrix are not greater than the given threshold; (2) There are some gradient values in the gradient matrix that are very close to the given threshold. If we set these gradients that do not meet the threshold requirements to 0, it will cause information loss. Therefore, the device uses a local gradient accumulation scheme to prevent information loss. Specifically, the cloud returns smaller gradients to the device instead of filtering the gradients. The device keeps the smaller gradient in the buffer and accumulates all the smaller gradients until it reaches a given threshold. Note that we use DSGD for iterative updates, and the loss function to be optimized is defined as follows:
(11) 
(12) 
where is the loss function, is the loss function for the local device, are the weights of the model, is the total edge devices, is the learning rate, represents the data sample for the th round of training, and each local dataset size of .
When the gradients’ sparsification reaches a high value (e.g., 99%), it will affect the model convergence. By following [30, 36]
, we use momentum correction and local gradient clipping to mitigate this effect. Momentum correction can make the accumulated small local gradients converge toward the gradients with a larger absolute value, thereby accelerating the model’s convergence speed. Local gradient clipping is used to alleviate the problem of gradient explosions
[30]. Next, we prove that local gradient accumulation scheme will not affect the model convergence: We assume that is the th gradient, denotes the sum of the gradients using the aggregation algorithm in [26], denotes the sum of the gradients using the local gradient accumulation scheme, and is the rate of gradient descent. If the th gradient does not exceed threshold until the th iteration and triggers the model update, we have:(13) 
(14) 
then we can update and set . If the th gradient reaches the threshold at the th iteration, model update is triggered, thus we have:
(15) 
(16) 
Then we can update , so the result of using the local gradient accumulation scheme is consistent with the usage effect of the optimization algorithm in [26].
The specific implementation phases of the gradient compression mechanism are given as follows:

[label=)]

Phase 1, Local Training: Edge devices use the local dataset to train the local model. In particular, we use the gradient accumulation scheme to accumulate local small gradients.

Phase 2, Gradient Compression: Each edge device uses Algorithm 1 to compress the gradients and upload sparse gradients (i.e., only gradients larger than a threshold are transmitted.) to the cloud aggregator. Note that the edge devices send the remaining local gradient to the cloud aggregator when the local gradient accumulation is greater than a threshold.

Phase 3, Gradient Aggregation: The cloud aggregator obtains the global model by aggregating sparse gradients and sends this global model to the edge devices.
The gradient compression algorithm is thus presented in Algorithm 1.
Vi Experiments
In this section, the proposed framework is applied to four realworld datasets, i.e., power demand ^{1}^{1}1https://archive.ics.uci.edu/ml/datasets/ , space shuttle ^{2}^{2}2https://archive.ics.uci.edu/ml/datasets/Statlog+(Shuttle), ECG ^{3}^{3}3https://physionet.org/about/database/, and engine ^{4}^{4}4https://archive.ics.uci.edu/ml/datasets.php for performance demonstration. These datasets are time series datasets collected by different types of sensors from different fields [6]. For example, the power demand dataset is composed of electricity consumption data recorded by the electricity meter. There are normal subsequences and anomalous subsequences in these datasets. As shown in Table I, , , and
is a number of original sequences, normal subsequences, and anomalous subsequences, respectively. For the power demand dataset, the anomalous subsequences indicate that the electricity meter has failed or stop working. Therefore, we need to use these datasets to train a FL model that can detect anomalies. We divide all datasets into a training set and a test set in a 7: 3 ratio. We implement the proposed framework by using Pytorch and PySyft
[40]. The experiment is conducted on a virtual workstation with the Ubuntu 18.04 operation system, Intel (R) Core (TM) i54210M CPU, 16GB RAM, 512GB SSD.Via Evaluation Setup
In this experiment, to determine the hyperparameter
of the gradient compression mechanism, we first apply a simple CNN network (i.e., CNN with 2 convolutional layers followed by 1 fully connected layer) in the proposed framework to perform the classification task on MNIST and CIFAR10 dataset. The pixels in all datasets are normalized into [0,1]. During the simulation, the number of edge devices is , the learning rate is, the training epoch is
, the minibatch size is , and we follow reference [41] and set as 0.05.We adopt Root Mean Square Error (RMSE) to indicate the performance of AMCNNLSTM model as follows:
(17) 
where is the observed sensing time series data, and is the predicted sensing time series data.
ViB Hyperparameters Selection of the Proposed Framework
In the context of deep gradient compression scheme, proper hyperparameter selection, i.e., a threshold of absolute gradient value, is a notable factor that determines the proposed framework performance. In this section, we investigate the performance of the proposed framework with different thresholds and try to find a bestperforming threshold for it. In particular, we employ to adjust the best threshold of the proposed framework. We use MNIST and CIFAR10 datasets to evaluate the performance of the proposed framework with the selected threshold. As shown in Fig. 4, we observe that the larger , the better the performance of the proposed framework. For MNIST task, the results show that when , the accuracy is 97.25%; when , the accuracy is 99.08%. This means that the model increases gradient size by about 300 times, but the accuracy is only improved by 1.83%. Furthermore, we observe a tradeoff between the gradient threshold and accuracy. Therefore, to achieve a good tradeoff between the gradient threshold and accuracy, we choose as the best threshold of our scheme.
ViC Performance of the Proposed Framework
We compared the performance of the proposed model with that of CNNLSTM [42], LSTM [41]
, Gate Recurrent Unit (GRU)
[43], Stacked Auto Encoders (SAEs) [44], and Support Machine Vector (SVM) [45] method with an identical simulation configuration. Among these competing methods, AMCNNLSTM is a FLbased model, and the rest of the methods are centralized ones. All models are popular DAD models for general anomaly detection applications. We evaluate these models on four realworld datasets, i.e., power demand, space shuttle, ECG, and engine.First, we compare the accuracy of the proposed model with competing methods in anomaly detection. We determine the and hyperparameter based on the accuracy and recall of the model on the training set. The hyperparameters of the dataset power demand, space shuttle, ECG, and engine are 0.75, 0.80, 0.80, and 0.60. In Fig. 5, experimental results show that the proposed model achieves the highest accuracy on all four datasets. For example, for the dataset power demand, the accuracy of AMCNNLSTM model is 96.85%, which is 7.87% higher than that of SVM model. From the experimental results, AMCNNLSTM has better robustness to different datasets. The reason is that we use the ondevice FL framework to train and update the model, which can learn the timeseries features from different edge devices as much as possible, thereby improving the robustness of the model. Furthermore, the FL framework provides opportunities for edge devices to update models in a timely manner. This helps the edge device owner to update the model on the edge devices in time.
Second, we need to evaluate the prediction error of the proposed model and the competing methods. As shown in Fig. 5, experimental results show that the proposed model achieves the best performance on four realworld datasets. For the ECG dataset, RMSE of AMCNNLSTM model is 63.9% lower than that of SVM model. The reason is that AMCNNLSTM model uses AMCNN units to capture important finegrained features and prevent memory loss and gradient dispersion problems. Memory loss and gradient dispersion problems often occur in encoderdecoder models such as LSTM and GRU models. Furthermore, the proposed model retains the advantages of LSTM unit in predicting time series data. Therefore, the proposed model can accurately predict time series data.
Therefore, the proposed model not only accurately detects abnormalities, but also accurately predicts time series data.
ViD Communication Efficiency of the Proposed Framework
In this section, we compare the communication efficiency between FL framework with the gradient compression mechanism (GCM) and the traditional FL framework without GCM. We apply the same model (i.e., AMCNNLSTM, CNNLSTM, LSTM, GRU, SAEs, and SVM) in the proposed framework and the traditional FL framework. Note that we fix the communication overhead of each round, so we can compare the running time of the model to compare the communication efficiency. In Fig. 7, we show the running time of FL with GCM and FL without GCM using different models. As shown in Fig. 7, we observe that the running time of FL framework with GCM is about 50% that of the framework without GCM. The reason is that GCM can reduce the number of gradients exchanged between the edge devices and the cloud aggregator. In section VB, we show that GCM can compress the gradient by 300 times without compromising accuracy. Therefore, the proposed communication efficient framework is practical and effective in realworld applications.
ViE Discussion
Due to the tradeoff between privacy and model performance, we will discuss the privacy analysis of the proposed framework in terms of data access and model performance:

Data Access: FL framework allows edge devices to keep the dataset locally and collaboratively learn deep learning models, which means that any third party cannot access the user’s raw data. Therefore, the FLbased model can achieve anomaly detection without compromising privacy.

Model Performance: Although the FLbased model can protect privacy, the model performance is still an important metric to measure the quality of the model. It can be seen from the experimental results that the performance of the proposed model is comparable to many advanced centralized machine learning models, such as CNNLSTM, LSTM, GRU, and SVM model. In other words, the proposed model makes a good compromise between privacy and model performance.
Vii Conclusion
In this paper, we propose a novel communicationefficient ondevice FLbased deep anomaly detection framework for sensing time series data in IIoT. First, we introduce a FL framework to enable decentralized edge devices to collaboratively train an anomaly detection model, which can solve the problem of data islands. Second, we propose an attention mechanismbased CNNLSTM (AMCNNLSTM) model to accurately detect anomalies. AMCNNLSTM model uses attention mechanismbased CNN units to capture important finegrained features and prevent memory loss and gradient dispersion problems. Furthermore, this model retains the advantages of LSTM unit in predicting time series data. We evaluate the performance of the proposed model on four realworld datasets and compare it with CNNLSTM, LSTM, GRU, SAEs, and SVM methods. The experimental results show that the AMCNNLSTM model can achieve the highest accuracy on all four datasets. Third, we propose a gradient compression mechanism based on Topk selection to improve communication efficiency. Experimental results validate that this mechanism can compress the gradient by 300 times without losing accuracy. To the best of our knowledge, this is one of the pioneering work for deep anomaly detection by using ondevice FL.
In the future, we will focus on researching privacyenhanced FL frameworks and more robust anomaly detection models. The reason is that the FL framework is vulnerable to malicious attacks by malicious participants and a more robust model can be applied to a wider range of application scenarios.
References

[1]
H. Peng and X. Shen, “Deep reinforcement learning based resource management for multiaccess edge computing in vehicular networks,”
IEEE Transaction on Network Science and Engineering, to appear.  [2] Y. Wu et al., “Dominant data set selection algorithms for electricity consumption timeseries data analysis based on affine transformation,” IEEE Internet of Things Journal, vol. 7, no. 5, pp. 4347–4360, 2020.
 [3] Y. Peng et al., “Hierarchical edge computing: A novel multisource multidimensional data anomaly detection scheme for industrial internet of things,” IEEE Access, vol. 7, pp. 111 257–111 270, 2019.
 [4] H. Peng et al., “Toward energyefficient and robust largescale wsns: a scalefree network approach,” IEEE Journal on Selected Areas in Communications, vol. 34, no. 12, pp. 4035–4047, 2016.
 [5] P. Malhotra et al., “Lstmbased encoderdecoder for multisensor anomaly detection,” arXiv preprint arXiv:1607.00148, 2016.

[6]
T. Luo et al.
, “Distributed anomaly detection using autoencoder neural networks in wsn for iot,” in
2018 IEEE International Conference on Communications. IEEE, 2018, pp. 1–6.  [7] S. Garg et al., “A hybrid deep learningbased model for anomaly detection in cloud datacenter networks,” IEEE Transactions on Network and Service Management, vol. 16, no. 3, pp. 924–935, 2019.
 [8] T. D. Nguyen et al., “Dïot: A federated selflearning anomaly detection system for iot,” in 2019 IEEE 39th International Conference on Distributed Computing Systems. IEEE, 2019, pp. 756–767.
 [9] S. Garg et al., “A multistage anomaly detection scheme for augmenting the security in iotenabled applications,” Future Generation Computer Systems, vol. 104, pp. 105–118, 2020.
 [10] S. Garg et al., “Enabc: An ensemble artificial bee colony based anomaly detection scheme for cloud environment,” Journal of Parallel and Distributed Computing, vol. 135, pp. 219–233, 2020.
 [11] R. Chalapathy and S. Chawla, “Deep learning for anomaly detection: A survey,” arXiv preprint arXiv:1901.03407, 2019.
 [12] P. Malhotra et al.,“Long short term memory networks for anomaly detection in time series,” in Proceedings, vol. 89. Presses universitaires de Louvain, 2015.
 [13] W. Lu et al., “Deep hierarchical encoding model for sentence semantic matching,” Journal of Visual Communication and Image Representation, p. 102794, 2020.

[14]
H. Lu et al.
, “Deep fuzzy hashing network for efficient image retrieval,”
IEEE Transactions on Fuzzy Systems, 2020.  [15] M. Munir et al., “Deepant: A deep learning approach for unsupervised anomaly detection in time series,” IEEE Access, vol. 7, pp. 1991–2005, 2018.
 [16] E. Lundin et al., “Anomalybased intrusion detection: privacy concerns and other problems,” Computer networks, vol. 34, no. 4, pp. 623–640, 2000.
 [17] I. Butun et al., “Anomaly detection and privacy preservation in cloudcentric internet of things,” in 2015 IEEE International Conference on Communication Workshop. IEEE, 2015, pp. 2610–2615.
 [18] Y. Liu et al., “Privacypreserving traffic flow prediction: A federated learning approach,” IEEE Internet of Things Journal, pp. 1–1, 2020.
 [19] R. Ito et al., “An ondevice federated learning approach for cooperative anomaly detection,” arXiv preprint arXiv:2002.12301, 2020.
 [20] M. Tsukada et al., “A neural network based ondevice learning anomaly detector for edge devices,” arXiv preprint arXiv:1907.10147, 2019.
 [21] S. M. Erfani et al., “Highdimensional and largescale anomaly detection using a linear oneclass svm with deep learning,” Pattern Recognition, vol. 58, pp. 121–134, 2016.
 [22] T. S. Buda et al., “Deepad: A generic framework based on deep learning for time series anomaly detection,” in PacificAsia Conference on Knowledge Discovery and Data Mining. Springer, 2018, pp. 577–588.
 [23] D. Wulsin et al., “Modeling electroencephalography waveforms with semisupervised deep belief nets: fast classification and anomaly measurement,” Journal of neural engineering, vol. 8, no. 3, p. 036015, 2011.
 [24] B. Zong et al., “Deep autoencoding gaussian mixture model for unsupervised anomaly detection,” 2018.
 [25] T. Schlegl et al., “Unsupervised anomaly detection with generative adversarial networks to guide marker discovery,” in International conference on information processing in medical imaging. Springer, 2017, pp. 146–157.
 [26] J. Konečnỳ et al., “Federated learning: Strategies for improving communication efficiency,” arXiv preprint arXiv:1610.05492, 2016.
 [27] N. Agarwal et al., “cpsgd: Communicationefficient and differentiallyprivate distributed sgd,” in Advances in Neural Information Processing Systems, 2018, pp. 7564–7575.
 [28] A. Reisizadeh et al., “Fedpaq: A communicationefficient federated learning method with periodic averaging and quantization,” arXiv preprint arXiv:1909.13014, 2019.
 [29] E. Jeong et al., “Communicationefficient ondevice machine learning: Federated distillation and augmentation under noniid private data,” arXiv preprint arXiv:1811.11479, 2018.
 [30] J. Wangni et al., “Gradient sparsification for communicationefficient distributed optimization,” in Advances in Neural Information Processing Systems, 2018, pp. 1299–1309.
 [31] L. Zhao et al., “Shielding collaborative learning: Mitigating poisoning attacks through clientside detection,” arXiv preprint arXiv:1910.13111, 2019.
 [32] Y. LeCun et al., “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [33] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
 [34] Y. Lin et al., “Deep gradient compression: Reducing the communication bandwidth for distributed training,” arXiv preprint arXiv:1712.01887, 2017.
 [35] D. Alistarh et al., “Qsgd: Communicationefficient sgd via gradient quantization and encoding,” in Advances in Neural Information Processing Systems, 2017, pp. 1709–1720.
 [36] J. Wangni et al., “Gradient sparsification for communicationefficient distributed optimization,” in Advances in Neural Information Processing Systems, 2018, pp. 1299–1309.
 [37] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
 [38] V. Mnih et al., “Recurrent models of visual attention,” in Advances in neural information processing systems, 2014, pp. 2204–2212.
 [39] A. Vaswani et al., “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
 [40] T. Ryffel et al., “A generic framework for privacy preserving deep learning,” arXiv preprint arXiv:1811.04017, 2018.
 [41] P. Malhotra et al.,“Lstmbased encoderdecoder for multisensor anomaly detection,” arXiv preprint arXiv:1607.00148, 2016.
 [42] T.Y. Kim and S.B. Cho, “Web traffic anomaly detection using clstm neural networks,” Expert Systems with Applications, vol. 106, pp. 66–76, 2018.
 [43] Y. Guo et al., “Multidimensional time series anomaly detection: A grubased gaussian mixture variational autoencoder approach,” in Asian Conference on Machine Learning, 2018, pp. 97–112.
 [44] N. Chouhan, A. Khan et al., “Network anomaly detection using channel boosted and residual learning based deep convolutional neural network,” Applied Soft Computing, vol. 83, p. 105612, 2019.
 [45] S. M. Erfani et al., “Highdimensional and largescale anomaly detection using a linear oneclass svm with deep learning,” Pattern Recognition, vol. 58, pp. 121–134, 2016.
Comments
There are no comments yet.