1. Introduction
The proliferation of smart Internet of Things (IoT) devices has spurred global deployment of smart factories, smart homes, autonomous vehicles, digital health, and so on. The gigantic network of these IoT devices generates a sheer amount of sensory data that can be exploited to extract knowledge insights or detect anomalous events, by leveraging recently developed machine learning and especially deep learning techniques
(Mohammadi et al., 2018; Chalapathy and Chawla, 2019; Luo and Nagarajan, 2018; Malhotra et al., 2016). Certain IoT applications, such as collision avoidance for autonomous vehicles, fire alarm systems in factories, or fault diagnosis of automatic industrial processes, are timecritical and require fast anomaly detection to prevent unexpected breaks that can lead to costly or even fatal failures. In such cases, the traditional approach of streaming all the IoT sensory data to the cloud can be problematic as it tends to incur high communication delay, congest backbone network, and pose a risk on data privacy.To this end, the emerging edge or fog computing (La et al., 2019; Chen and others, 2017) provides a better alternative by performing anomaly detection (AD) in the proximity of the sources of sensory data. However, pushing computation from cloud to the edge of networks faces resource challenges especially when the model is complex (such as deep learning models) and the edge device only has limited computation power, storage, and energy supply, which is the case for typical IoT devices.
A possible solution is to transform a large complex model into one that fits the IoT device’s capability. For example, model compression (Han et al., 2016) achieves this by pruning redundant and unimportant (nearzero) parameters as well as by quantizing weights into bins; or Hinton et al. (Hinton et al., 2015) proposed a knowledge distillation technique that transfers the knowledge learned by a largeoriginal model to a smallerdistilled model, by training a distilled model to learn the soft output of the large model. However, such approaches need to handle each AD model on a casebycase basis via finetuning, or are only applicable to a few specific types of deep neural networks (DNNs) with large sparsity.
There are also other proposed approaches (Teerapittayanon et al., 2017; Kang et al., 2017) on distributed anomaly detection, but overall, we identify three main issues in most existing works: (1) attempting “one size fits all”—use one AD model to handle all the input data, while overlooking the fact that different data samples often have different difficulty levels in detecting anomalous events; (2) focusing on accuracy or F1score without giving adequate consideration to detection delay and memory footprint; (3) lacking appropriate local analysis in distributed systems and thus often transmitting data back and forth between edge sources and the cloud, which incurs unnecessary delay and bandwidth consumption.
In this paper, we propose an adaptive distributed AD approach that leverages the hierarchical edge computing (HEC) architecture by matching input data of different difficulty levels with AD models of different complexity on the fly. Specifically, we construct multiple anomaly detection DNN models (using autoencoder and LSTM) of increasing complexity, and associate each of them to an HEC layer from bottom to top, e.g., IoT devices, edge servers, and cloud. Then, we propose an adaptive model selection scheme that judiciously selects one of the AD models based on the contextual information extracted online from input data. We formulate this model selection problem as a contextual bandit problem
, which is characterized by a singlestep Markov Decision Process (MDP), and solve it using a
reinforcement learning policy network. The singlestep MDP enables quick decisionmaking on model selection, and the decisions thus made avoid unnecessary data transmission between edge and cloud, while retaining the best possible detection accuracy.We build an HEC testbed using real IoT devices, and implement our proposed contextualbandit approach on the testbed. We evaluate our approach using both univariate and multivariate IoT datasets in our HEC testbed, in comparison with both baseline and stateoftheart schemes. In summary, this paper makes the following contributions:

We identify three main issues in existing IoT anomaly detection approaches, namely using one universal model to fit all scenarios, onesided focus on accuracy, and lack of local analysis that results in unnecessary network traffic.

We propose an adaptive AD approach that differentiates inputs by matching data of different difficulty levels with AD models of different complexity in an HEC. Our approach uses a contextualbandit theoretical framework and the solution is obtained via a reinforcement learning policy network.

In our implementation of the proposed approach on an HEC testbed, we propose and incorporate an accelerated policy training method, which introduces parallelism by leveraging the distributed environment and achieves 45 times faster training time than traditional, sequential training.

We build a real HEC testbed and evaluate our proposed approach with baseline and stateoftheart schemes, using realworld IoT datasets (including both univariate and multivariate). Our approach strikes the best accuracydelay tradeoff on the univariate dataset, and achieves the best accuracy and F1score on the multivariate dataset with only negligibly longer delay than the best (but inflexible) scheme.
2. Related work
Anomaly detection has been extensively studied in several surveys (Mohammadi et al., 2018; Gupta et al., 2013; Chalapathy and Chawla, 2019). Here we give a brief discussion of some works related to our approach, and refer readers to these surveys for a more indepth discussion. Deep learning is becoming increasingly popular in anomaly detection for IoT applications (Mohammadi et al., 2018; Luo and Nagarajan, 2018; Singh, 2017; Malhotra et al., 2016; Su et al., 2019; Chalapathy and Chawla, 2019). For univariate data, Luo and Nagarajan (Luo and Nagarajan, 2018)
proposed an autoencoder (AE) neural networkbased model to learn patterns of normal data and detect outliers (based on reconstruction errors) in a distributed fashion in IoT systems. Their AE model can be deployed at IoT devices to perform detection, but the model is fairly lightweight and may not be able to detect some complex anomalous events or high dimensional data. For multivariate data, an LSTMbased encoderdecoder model was proposed by
(Malhotra et al., 2016) to learn normal data pattern and predict a few future timesteps based on a few historical timesteps. Or, Su et al. (Su et al., 2019) proposed a complex AD model, which glues GRU (a variant of RNN), variational autoencoder, and planar normalizing flows, to robustly learn temporal dependence and stochasticity of multivariate time series. However, this model does not suit resourceconstrained IoT devices due to its high computational cost.On another line of research, some other distributed machine learning models split complex models and deploy partial models at different layers of HEC (Teerapittayanon et al., 2017; Kang et al., 2017; Zhou et al., 2019; Li et al., 2018). Zhou et al. (Zhou et al., 2019) presented a method to reduce the inference time of a DNN model in HEC by splitting and allocating an appropriate partial DNN model to each layer of HEC. Teerapittayanon et al. (Teerapittayanon et al., 2016) proposed a BranchyNet architecture for an image classification task that can early “exit” from a multilayer DNN during inference based on the confidence of inference output. Later on, the same authors (Teerapittayanon et al., 2017) deployed different sections of BranchyNet in an HEC system, in order to reuse extracted features from lower layers to do inference at a higher layer. This requires less communication bandwidth and allows for faster and localized inference due to the shallow model at the edge. However, this approach has to make inferences sequentially from the very bottom to the top of HEC, which can lead to unnecessary delay and inference requests to lower layers when detection is hard. Also, it requires all the distributed models to use the same architecture, while our approach has the flexibility of using different models at different layers of HEC.
Our work is inspired by the observation that input data often consists of different levels of difficulty in analysis (e.g., easy, medium, and complex) that should be treated differently to achieve the best inference performance. Given the generality of this idea, some similar forms appeared in prior work (Taylor et al., 2018; Lin et al., 2017; Wu et al., 2018), but there are key differences from this work. (Taylor et al., 2018)
used multiple kNearest Neighbor (kNN) classification models to train a selection model to choose a proper inference model (among several models within an embedded device) for a given input image and desired accuracy. Lin
et al. (Lin et al., 2017) proposed a dynamic neural networks pruning framework at runtimeby using a reinforcement learning (RL)based approach to decide to prune or keep each convolutional neural networks (CNN) layer conditioned on difficulty levels of input samples, resulting in reducing average inference time while achieving high accuracy overall. Similarly,
(Wu et al., 2018)proposed a BlockDrop scheme that learns to dynamically drop or keep a residual block of a trained deep residual networks (ResNets) during inference, to minimize the number of residual blocks and preserve recognition accuracy. The key differences of our work from the above are as follows. First, these prior works deal with the task of image classification in the computer vision domain, while our work deals with the task of anomaly detection in IoT and edge computing. Second, using multiple sequentially kNN classifiers as in
(Taylor et al., 2018) does not scale well. In contrast, we use a single policy network that directly outputs a suitable model based on contextual information, without the need for checking each model (e.g., kNN) one by one until meeting a certain accuracy. It is also lightweight and runs quickly, and can be easily scaled to a large number of devices. Third, our RLbased approach deals with a distributed computing environment with multiple models, while the above prior works (Wu et al., 2018; Lin et al., 2017) deal with a single model in a central environment.3. ContextualBandit Anomaly Detection
3.1. Overall Approach
Fig. 1 shows the overall structure of the adaptive anomaly detection approach, which consists of three flows: (1) training multiple AD models (black solid line), (2) training policy networks (transparent purple line), and (3) online adaptive detection (transparent orange line). A common stage for all three flows is Data Preprocessing
which has input either univariate or multivariate data. In this stage, the training dataset is standardized to zero mean and unit variance. Then an according training scaler is used to transform the validation/test set, and used for online detection. The standardized datasets are segmented into sequences through sliding windows.
In the first flow, Training AD Models stage constructs multiple AD models (with increasing complexity) based on the training timeseries sequences that are preprocessed. The AD models capture normal patterns of data (either univariate or multivariate, which are presented in Subsection 3.2). Then, the Anomaly Scores stage finds anomaly scores and thresholds for detecting anomaly during validation and online detection phases. These trained AD models with increasing complexity and their appropriate thresholds are deployed at different layers of HEC, from bottom to top.
In the second flow, we design and train the policy networks based on extracted contextual information (e.g., encoded feature representation) of input sequences to adaptively select the best AD models on the fly, in order to achieve high accuracy and low detection delay simultaneously (Subsection 3.3).
Finally, during the online adaptive detection phase, the trained policy network deployed at IoT device can select the best suited AD model on the fly during testing phase.
3.2. Constructing Multiple Anomaly Detection Models in HEC
We consider a layer distributed hierarchical edge computing (HEC) system: IoT devices at layer1, edge servers at layer2 to layer, and the cloud at layer. We choose as a typical setting for a layer HEC (La et al., 2019; Mohammadi et al., 2018; Ngo et al., 2020a) (but our approach applies to any in general, i.e., multiple layers of edge servers). We consider two types of data: univariate and multivariate. In this subsection, we construct AD models for each data type, with increasing complexity, and associate those models with the HEC layers from 1 to .
3.2.1. AD Models for Univariate Data
For univariate IoT data, we adapt the autoencoder (AE) model with a single hidden layer from (Luo and Nagarajan, 2018), which has proved the feasibility of running this model on IoT devices. In (Luo and Nagarajan, 2018), the compression ratio between the dimension of the encoded layer and that of the input layer is 70%; while in our case, we use a much lower ratio of 30% in order to fit more diverse lowcost IoT devices. Simulation shows that our model under this ratio can still reconstruct normal data very well (see Fig. 3). For edge servers (e.g., IoT gateways or microservers), we consistently use the AEbased model with more hidden layers to enhance the capability of learning better features to represent data. We add one more encoder layer and one more decoder layer to the previous AE model to obtain a model which we call AEEdge. For the cloud, we further add one more encoder layer and decoder layer to have a deep AE model, which we refer to as AECloud.
The detailed setup of the above AEbased models is shown in the first column of Fig. 2
. Each number beside a layer of the AEbased model is the number of neural units of a corresponding layer. We train AEbased models with stochastic gradient descent (SGD) to minimize the mean absolute reconstruction error between the reconstructed outputs and the expected outputs (which are equal to the inputs). The reconstruction error is
where is input data and is the corresponding reconstructed output. To avoid overfitting, we usenorm regularization for weights, and add a dropout rate of 0.3 after each hidden layer. In accordance with the different complexities of these models, we train them over 4000, 6000 and 8000 training epochs for AEIoT, AEEdge, and AECloud, respectively.
3.2.2. AD Models for Multivariate Data
The simplicity of AEbased models does not well capture representation features for highdimensional IoT data. Hence in the multivariate case (18 dimensions in our evaluated dataset), we use sequencetosequence (seq2seq) model (Sutskever et al., 2014)
based on long shortterm memory (LSTM) to build an LSTM encoderdecoder as the AD model. Such models can learn representations of multisensor timeseries data with 1 to 12 dimensions
(Malhotra et al., 2016). In our case, we apply our LSTMseq2seq model to an 18dimensional dataset, and deploy the model on the IoT device, and name it LSTMseq2seqIoT. The multivariate IoT data are encoded into encoded states by an LSTMencoder, then an LSTMdecoder learns to reconstruct data from previous encoded states and previous output, one step a time. For the first step, a special token is used, which in our case is a zero vector. For the edge layer, we build an
LSTMseq2seqEdge anomaly detection model with a double number of LSTM units for both encoder and decoder, which can learn a better representation of a longer sequence input. For the cloud layer, we build a BiLSTMseq2seqCloud anomaly detection model with a bidirectionalLSTM (BiLSTM) encoder to learn both backward and forward directions of the input sequence to encode information into encoded states (i.e., by concatenating encoded states from two directions). These are depicted in the third column of Fig. 2. To train these LSTMseq2seq models, we use a teacher forcing method (Bengio et al., 2015) with the RMSProp optimizer and norm kernel regularizer of to minimize the mean squared reconstruction error. The output of LSTMdecoder is dropped out with a droprate, and then passes through a fullyconnected layer with the linear activation function to generate a reconstruction sequence.
3.2.3. Anomaly Score
The training process shows that the above models can well capture the normal data pattern, indicated by low reconstruction error for normal data and high error for abnormal data. Therefore, the reconstruction error is a good indicator for detecting anomalies. We assume that reconstruction errors generally follow Gaussian distribution
^{1}^{1}1We use univariate/multivariate Gaussian distribution for univariate/multivariate dataset, respectively. , where and are the mean and covariance matrix of reconstruction errors of normal data samples. We uselogarithmic probability densities (logPD)
of the reconstruction errors as anomaly scores, as in (Singh, 2017; Malhotra et al., 2015; Ngo et al., 2020b). Normal data will have a high logPD while anomalous data will have a low logPD (note logPD is negative). We then use the minimum value of the logPD on the normal dataset (i.e., the training set) as the threshold for detecting outliers during testing.We consider a detection as confident if the input sequence being detected satisfies one of these two conditions: (i) at least one data point has a logPD of less than a certain factor (e.g., 2x) of the threshold; (ii) the number of anomalous points is higher than a certain percentage (e.g., 5%) of the sequence size.
3.3. Adaptive Model Selection Scheme
As AD models are deployed at the IoT, edge, and cloud layers of HEC respectively, we propose an adaptive model selection scheme to select the most suitable AD model based on the contextual information of input data, so that each data sample will be directly fed to its bestsuited model. Note that this is in contrast to traditional approaches where input data will either (i) always go to one fixed model regardless of the hardness of detection (Chen and others, 2017), or (ii) be successively offloaded to higher layers until meeting a required or desired accuracy or confidence (Teerapittayanon et al., 2017).
Our proposed adaptive model selection scheme is a reinforcement learning algorithm that adapts its model selection strategy to maximize the expected reward of the model to be selected. We frame the learning problem as a contextual bandit problem (Sutton et al., 2000; Williams, 1992) (which is also known as associative reinforcement learning (RL), onestep RL, associative bandits, and learning with bandit feedback) and use a singlestep Markov decision process to solve it.
The contextualbandit model selection approach is illustrated in Fig. 4. Formally, given the contextual information of an input data , where is a representation of the input data, and trained AD models deployed at the layers of an HEC system, we build a policy network that takes as the input state and outputs a policy of selecting which model (or equivalently which layer of HEC) to do anomaly detection, in the form of a categorical distribution
(1) 
where , , is the actions encoded as a onehot vector which defines which model (or HEC layer) to perform the task, , , is a likelihood vector representing the likelihood of selecting each model , and . We set if and otherwise, and we denote the selected action as .
The policy network is designed as a neural network with parameters . To make the policy network small enough to run fast on IoT devices, we use extracted features instead of the raw input data
, to represent the contextual information of input data (i.e., a state vector). Specifically, for the univariate data, we define the contextual state as an extracted feature vector that includes min, max, mean, and standard deviation of each day’s sensor data. For the multivariate data, we use the encoded states of the LSTMencoder to represent the input for the policy network.
The policy network is trained to find an optimal policy that maps a state (of input) to an action (i.e., a model or layer) to maximize the expected reward of the selected action. We train the policy network using the policy gradient (Williams, 1992; Sutton et al., 2000) method to minimize the negative expected reward:
(2) 
where is a reward function of action given state . The gradient of is derived as follows:
To reduce the variance of reward value and increase the convergence rate, we utilize a reinforcement comparison with a baseline that is independent of output actions (Williams, 1992). We use the baseline as the best observed reward (Sutton et al., 2000), which is empirically shown to boost convergence rate. In addition, we add a
norm regularization term to the loss function to prevent overfitting problem. So
is rewritten as follows:(3) 
where is a regularized parameter. By choosing a baseline that is independent of output actions (Williams, 1992), this can be rewritten as:
Similar to the original objective function, we minimize (3) by utilizing the policy gradient method with REINFORCE algorithm (Williams, 1992) to compute the gradient of as follows:
(4)  
(5) 
where we substitute (1) into (4) to get (5). We also note that (4
) is an unbiased estimator because the separated gradient term of baseline reward is zero:
The original REINFORCE algorithm updates policy gradient over one sample for each training step, which also causes a high variance problem. To reduce this problem, at each training step, we use a minibatch training of contextual states, and update the gradient by averaging over these contextual states . The equations (3) and (5) may therefore be rewritten as:
(6)  
(7) 
In order to encourage the selection of an appropriate AD model for jointly increasing accuracy and reducing the cost of offloading tasks by pushing the chosen AD model to the edge of networks, we propose a reward function as follows:
(8) 
where accuracy() is the accuracy of detecting an input , and is the cost function of offloading the detection task to a layer for data . We define the cost function as a function that maps the endtoend detection delay to an equivalent accuracy in the range with intuition that a higher delay will result in a greater reduction of accuracy:
(9) 
where is a tuning parameter used to tradeoff between the endtoend delay and the accuracy (through the reward function (8)). For example, means the endtoend detection delay cost of offloading a sample to an edge server 250 ms is equivalent to a reduction 0.1 in accuracy. The endtoend delay consists of the communication delay of transmission data from an IoT device to a server at the layer of HEC, and the computing delay of executing the detection task at the layer .
We summarize the procedure of training policy network as Algorithm 1.
To balance between exploration and exploitation during training, we apply a decayedgreedy algorithm for action selection. We train the policy network over a number of epochs with an initial exploration probability of that decays over steps to a final value of . So in each training episode, the actual is calculated as Lines 1 and 1 of Algorithm 1. Then, an action is randomly selected to explore with probability , while with probability an action is greedily selected based on output of the current policy network.
4. Implementation & Experiment Setup
4.1. Dataset
We evaluate our proposed approach with two public datasets. The data is standardized to zero mean and unit variance for all of the training tasks and datasets.
Univariate dataset. We use a dataset of power consumption^{2}^{2}2http://www.cs.ucr.edu/~eamonn/discords/ of a Dutch research facility that has been used in (Keogh et al., 2005; Malhotra et al., 2016; Singh, 2017). It comprises 52 weeks of power consumption and consists of 35040 samples recorded every 15 minutes. The data has a repetition of weekly cycle of 672 time steps with five consecutive peaks for five weekdays and two lows for weekends. Abnormal weeks could have less than five consecutive peak days on weekdays (perhaps due to a holiday), or high power consumption on a weekend. Examples of normal and abnormal weeks are shown in Fig. 5. Hence, each input data is a sequence of one week of data with 672 time steps. We manually label a day as abnormal if it is a weekday with low power consumption, or it is a weekend with high power consumption; other days are labeled as normal. For the AD task, we split the dataset into train and test sets with ratio 70:30, or equivalently 37 weeks:15 weeks. The training set only contains normal weeks and the test set contains the remaining normal weeks and all the 8 anomalous weeks, each having at least one abnormal day. For training the policy network, we choose a training set that contains all the 8 abnormal weeks and 7 normal weeks, and a test set that is the whole dataset to verify the quality of the policy network.
Multivariate dataset. We use MHEALTH dataset^{3}^{3}3http://archive.ics.uci.edu/ml/datasets/mhealth+dataset which consists of 12 human activities of 10 different people. And each person wore two motion sensors: one on leftankle and the other on rightwrist. Each motion sensor contains a 3axis accelerator, a 3axis gyroscope, and a 3axis magnetometer; hence the input data has 18 dimensions. The sampling rate of these sensors is 50 Hz. We use a window sequence of 128 timesteps (2.56 second) with a stepsize of 64 between two windows. Adopting the common practice, we choose the dominant human activity (e.g., walking) as normal and treat the other activities as anomalous. Fig. 6 shows an example of normal and abnormal window sequences. For the AD task, we select 70% of normal samples of all the subjects (people) as the training set and the remaining 30% of normal samples plus 5% of each of the other activities as the test set. To train the policy network, we select 30% of normal samples and 5% of each of the other activities as the training set, and the whole dataset as the test set.
4.2. Implementation of Anomaly Detection Models and Policy Network
We use Tensorflow and Keras to implement the AD models
^{4}^{4}4We can package each AD model in a Docker container, a lightweight virtual computing software environment, and run the container within an edge server or cloud server, which will make deployment of edge computing systems much easier and more scalable, as studied in (La et al., 2019; Ngo et al., 2020a). (i.e., three AE models and three LSTMseq2seq models as shown in Fig. 2) and the policy network model. For the univariate timeseries dataset, the input is a sequence of 672 time steps—a week of measured power consumption. We use tanh as the nonlinear activation function for all the hidden layers, and the linear function for output layer of the autoencoder models. To prevent the overfitting problem, we apply dropout technique with rate 30% during training, and add a regularization term. We train and test the three models separately with fold crossvalidation.For the multivariate timeseries dataset, the input is a sequence of 128 time steps of 18dimensional data. The LSTMseq2seqIoT model, which will be deployed on Raspberry Pi 3, consists of 50 vanilla LSTM units for each encoder and decoder. Since the edge server and cloud server are empowered with GPU, we implement the LSTMseq2seqEdge and BiLSTMseq2seqCloud models based on CuDNNLSTM units to accelerate training and inference time. Before deploying the LSTMseq2seqIoT and LSTMseq2seqEdge models on Raspberry Pi3 and JetsonTX2, we compress them by (i) removing the trainable nodes from the graph and converting variables into constants; (ii) quantizing the model parameters from floatingpoint 32bit (FP32) to FP16. We observe no performance decrease of these compressed AD models, while the average inference delay of these compressed AD models running on Raspberry Pi3 and JetsonTX2 are reduced by 61 ms and 126.1 ms, respectively.
The policy network requires low complexity and needs to run fast on IoT devices without consuming enough resources to affect to the IoT detection task. So the state input to the policy network needs to be small but still well represent the whole sequence of input data. For the univariate data, we define the contextual state as an extracted feature vector which includes min, max, mean, and standard deviation of each day’s sensor data. Thus the dimension of the contextual state is just 4x7=28. For the multivariate data, we use the encoded states of the LSTMencoder to represent the input state for the policy network; hence, the dimension of the contextual state, which is the encoded states constructed by concatenation of (h,c), is 50 + 50 =100. Subsequently, we build the policy network as a single hidden neural network (with 100 and 300 hidden units for univariate and multivariate data, respectively) and a softmax layer with 3 output units, i.e.,
, which indicate the likelihood of choosing one of three AD models. We train the policy network as described in Section 3.3 with 6000 and 600 episodes^{5}^{5}5The univariate dataset needed more episodes than the multivariate dataset because the size of univariate dataset is smaller (even though we repetitively replay the training set with randomly shuffling the input sequences) and, as a result, the convergence of the reward value of the policy network was observed to be slower. In addition, the convergence rate is also dependent on the parameter , and we set the number of training episodes to the maximum corresponding to the worst . for univariate and multivariate datasets, respectively; and the initial is gradually decreased to zero after half the number of episodes. We empirically select (as shown in Section 5.3) for both univariate and multivariate datasets to calculate the cost of executing detection as given by (9).4.3. Accelerated Training of Policy Network
Recently, Google has published a distributed mechanism to efficiently accelerate the training process for deep reinforcement learning problems (Espeholt et al., 2020). Inspired by this work, we can accelerate our training policy network with minibatch training as in Algorithm 1 by modifying Line 1 as follows: (i) instead of sequentially querying reward of each input sample, we group inputs that belong to the same action and send them to the appropriate AD model to do parallel inference in a batch manner; (ii) we also can concurrently do inference from multiple AD models at layers of HEC if there are more than one AD model in the action outputs of the greedy method.
With this proposal (called the parallel approach), we expect to reduce time to train policy network because of (1) reduction of communication overhead between the training server and multiple AD models at multiple layers of HEC, and (2) leveraging minibatch inference at each AD model instead of one single input detection. The results will be analyzed in Section 5.4. Note that, when training the parallel approach, the detection delay is measured for multiple samples due to concurrently querying multiple samples. Therefore, we use the average of the endtoend detection delays for each HEC layer that were collected from the normal training approach to calculate rewards.
4.4. Software Architecture and Experiment Setup
The software architecture of our HEC system^{6}^{6}6A brief introduction of our HEC testbed is also available online: https://rebrand.ly/91a71 is shown in Fig. 7. It consists of a GUI, the adaptive model selection scheme based on the policy network, and the three AD models at the three layers of HEC. The GUI allows a user to select which dataset and model selection scheme to use, as well as to tune parameters, as shown in Fig. (a)a, and displays the sensory raw signals and performance results, as shown in Fig. (b)b. All the communication services use keepalive TCP sockets after opening in order to reduce the overhead of connection establishment. Network latency (roundtrip time) as shown in Fig. 2 is configured by using the Linux traffic control tool, tc, to emulate the WAN connections in HEC. The hardware setup for the HEC testbed is shown in Fig. 9. To emulate an environment with highspeed incoming IoT data that requires fast anomaly detection, we replay the datasets with increased sampling rates. For the univariate dataset (power consumption), we replay it with a sampling rate of 672 samples per second during the experiments, as compared to the dataset’s original sampling rate which is one sample per 15 minutes. As such, a whole year of power consumption data was replayed and processed within minutes in our experiments. For the multivariate dataset (healthcare), we replay the sensory data with a simulated sampling rate of 128 samples per second, as compared to the original sampling rate 50 samples per second.
We measure endtoend delays on actual IoT devices, which is the interval between the starting time when a sample input sequence is generated at an IoT device, and the end time when the detection result is received at the IoT device. Note that the actual anomaly detection was executed on exactly one of the three layers (IoT, edge, or cloud). Based on the measured , we calculate the cost of detection using (9). We will evaluate with different parameters to see the tradeoff between the offloading cost and the accuracy gain of a complex model.
User Actions: As shown in Fig. (a)a, we allow users to interact and evaluate the HEC testbed performance with (i) either univariate or multivariate datasets, (ii) different fractions of normal and abnormal data in the datasets to use, and (iii) different schemes under evaluation: (1) the IoT Device scheme, which always detects directly on IoT devices, (2) the Edge scheme, which always offloads to an edge server, (3) the Cloud scheme, which always offloads to the cloud, (4) the Successive scheme, which first executes on IoT devices and then successively offloads to a higher layer until reaching a confident output or the cloud, or finally (5) the Adaptive scheme, which is our proposed adaptive model selection scheme. After the user clicks “Start”, our result panel as in Fig. (b)b will show the continuously updated raw sensory data (accelerometer, gyroscope, magnetometer), anomaly detection outcome (0 or 1) vs. ground truth, detection delay vs. the actions determined by our policy network, and the accumulative accuracy and F1score.
4.5. StateoftheArt Schemes in Comparison
Besides comparing our Proposed scheme with the baseline schemes (i.e., IoT Device, Edge, Cloud and Successive schemes) described above, we compare our scheme with the stateoftheart methods (referred to as kNNsequence (Taylor et al., 2018), and AdaptedBlockDrop (Wu et al., 2018)). In the kNNsequence (Taylor et al., 2018), a series of lightweight kNN classifiers are used to provide decisions for choosing a proper inference model to use. We can directly apply this method with our datasets and our trained AD models by implementing three kNN classifiers (with ) for univariate data, and three kNN classifiers (with ) for multivariate data. Note that, instead of using as the paper (Taylor et al., 2018), we choose based on grid search for each dataset to achieve the best performance results.
Besides kNNsequence, we also implement a variant of (Taylor et al., 2018), called kNNsingle, as another basis for comparison. It directly provides a selected layer for detecting anomaly based on extracted features of the input IoT data. For both kNNsequence and kNNsingle schemes, we adopt a similar training procedure from (Taylor et al., 2018) to generate training data (i.e., feature inputs and output classes) and build the classifier models.
The prior work (Wu et al., 2018) (i.e., BlockDrop) is strongly tied with the computer vision ResNet architecture based on multiple stacked CNNs blocks, which can be bypassed by identity skipconnections to reduce complexity for each imagespecific input. In contrast, our paper tackles sequential timeseries data and allows to use different model architectures for different complexities, resulting in more flexibility for designing AD models at HEC layers. In terms of similarity, our approach and (Wu et al., 2018) both use a policy network to make decisions, but with (i) different input (extracted feature of a sequential window vs. an image) and (ii) different output (a selected HEC’s layer for inference vs. a binary vector for choosing which CNN blocks involving in a reduced ResNet), (iii) different reward functions, and (iv) with/without network latency in mind. In addition, the policy network in BlockDrop (which also uses a ResNet architecture) is not lightweight and is therefore unsuitable to run on IoT devices in the HEC scenario. To make a fair comparison with BlockDrop (Wu et al., 2018) in the anomaly detection task, we have followed their approach by implementing a policy network using the same extracted feature of sequential input, the same type of output as our application, and the same two fullyconnected layers architecture as our approach. With this setup, we train this mimicking policy network, namely AdaptedBlockDrop, with the reward function and parameters as in (Wu et al., 2018) to penalize incorrect prediction at IoT Device, Edge, and Cloud layers, respectively.
The idea of splitting large neural networks into smaller portions for each layer of HEC, such as in BranchyNet (Teerapittayanon et al., 2017) and Neurosurgeon (Kang et al., 2017)
, is an interesting approach for image classification task and can save network bandwidth by transmitting compressed intermediate results between portions. However, these partial models must be portions of the same DNN and be jointly trained together with a combined objective function, while our approach has the flexibility to use independent models at different HEC layers, taking advantage of any available stateoftheart DNN architecture that is best suited for each layer. In addition, BranchyNet was designed for CNNs, and its implementation involved splitting in the forward path and carefully gathering gradient losses in the backward path during backpropagation training. Transforming its idea of splitting and stacking to RNN (such as LSTM or GRU, as in our case) is not trivial and is worth another new study. Therefore, we do not directly compare with BranchyNet. On the other hand, BranchyNet
(Teerapittayanon et al., 2017) is similar to the Successive scheme (which we compare) in the following sense: BranchyNet attempts its deep CNNs portions in a successive manner, and if an early portion does not satisfy the performance requirement, it will move on to the next portion of the deep CNNs.5. Experiment Results
5.1. Comparison Among Anomaly Detection Models
Table 1 compares the performance and complexity among the three AD models we use. For both univariate and multivariate data, the complexity of AD models increases from IoT to cloud, as indicated by “# of Parameters” (weights and biases) which reflects the approximate memory footprint of the models, and total “FLOP” (floatingpoint operations) which reflects the required computation of each model during the inference. Along with this, the F1score and accuracy increase as well; for example, these metrics are 95.5% and 19% higher for AECloud than for AEIoT, and 15% and 18.9% higher for BiLSTMseq2seqCloud than for LSTMseq2seqIoT. To illustrate the nature of the edge model errors, we show an example of reconstruction performance of the AEEdge model in Fig. 10.
On the other hand, the execution time (for running the detection algorithms) decreases from IoT to cloud, as indicated in the last row of Table 1, which is measured on the actual machines of our HEC testbed and averaged over five runs. This is due to the different computation capacity (whereas communication capacity is taken into account by our endtoend delay shown later). One more observation is that LSTMseq2seq models, which handle multivariate datasets, take much longer time to run (up to 591 ms) than AE models, which handle univariate datasets (up to 12.4 ms).
Dataset/Model  Univariate/Autoencoder  Multivariate/LSTMseq2seq  

Layer  IoT  Edge  Cloud  IoT  Edge  Cloud 
#Parameters  271,017  949,468  1,085,077  28,518  97,818  1,028,018 
FLOP  1.35M  2.93M  5.41M  3.92M  7.84M  31.33M 
Accuracy(%)  78.09  93.33  98.09  82.63  94.21  97.37 
F1score  0.465  0.741  0.909  0.852  0.955  0.980 
Exec time (ms)  12.4  7.4  4.5  591.0  417.3  232.3 
5.2. Comparison Among Model Selection Schemes
The F1score, accuracy, average detection delay, and total reward of the entire univariate and multivariate datasets under four baseline schemes, three stateoftheart schemes, and our proposed scheme are shown in Table 2. We can see that the IoT Device scheme achieves the lowest detection delay but also the poorest accuracy and F1score among all the evaluated schemes. On the other extreme, the Cloud scheme yields the best accuracy and F1score but incurs the highest detection delay (endtoend). The Successive scheme leverages distributed anomaly detectors in HEC and thus significantly reduces the average detection delay as compared to the Edge and Cloud schemes. However, its accuracy and F1score are outperformed by the Edge scheme. In contrast, our proposed adaptive scheme adaptively selects a suitable model to execute the AD task to jointly maximize accuracy and minimize detection delay. Thus, not only does it achieve lower detection delay but its F1score and accuracy also consistently outperform those of IoT Device, Edge, and Successive schemes. For univariate data, even though the F1score and accuracy of our proposed scheme are marginally lower than those of the Cloud scheme by 4.3% and 0.35% respectively; our scheme reduces the endtoend detection delay by a substantial 84.9%. For multivariate data, we got an interesting result (with ) that the proposed scheme outperforms the Cloud scheme not only in terms of delay (reduction by 10.6%), but also in F1score and accuracy. We believe that in this case our proposed scheme of the multivariate data outperforms the Cloud scheme in F1score and accuracy because: (i) quality: the contextual information of input data (i.e., encoded states from LSTMencoder) is able to well capture feature representations of the multivariate data which help policy network deliberately choose the best destination AD model for detecting each input sequence to achieve the highest accuracy; (ii) quantity: the multivariate dataset also consists a larger number of training samples compared to the univariate dataset, which helps to learn a better policy.
The kNNsingle and kNNsequence (Taylor et al., 2018) schemes can achieve average delays that are lower than our proposed scheme. However, their average accuracy and F1score are not competitive, as they are even worse than the Edge scheme. Therefore, while their equivalent reward values (calculated using the reward function (8)) exceed those of the baseline schemes, they are lower than our proposed scheme—particularly in the multivariate case. We also observe that kNNsingle does not perform as well as kNNsequence (which consists of three consecutive kNNclassifiers) in terms of accuracy, F1score, and reward.
The AdaptedBlockDrop scheme (Wu et al., 2018) achieves higher accuracy and F1score than the IoT Device, Edge, Successive, kNNsingle, and kNNsequence schemes. However, it is outperformed by our proposed scheme in all the performance metrics (accuracy, F1score, and delay). In particular, with the univariate data, its average detection delay (301.91 ms) is 4 times larger than that of our proposed scheme (76.12 ms).
Dataset  Scheme  F1  Accuracy(%)  Delay(ms)  Reward  

IoT Device  0.465  93.68  12.4  48.39  
Edge  0.800  98.63  257.43  45.36  
Cloud  0.909  99.46  504.50  41.24  
Successive  0.769  98.35  105.27  N/A  
kNNsingle  0.588  96.15  31.29  49.27  
kNNsequence  0.741  98.07  54.92  49.77  
AdaptedBlockDrop  0.842  99.09  301.91  31.44  
Proposed  0.870  99.11  76.12  49.82  

IoT Device  0.848  93.19  591.0  351.18  
Edge  0.951  97.59  667.30  362.16  
Cloud  0.980  99.00  732.30  360.26  
Successive  0.911  95.79  626.16  N/A  
kNNsingle  0.925  96.39  597.71  366.22  
kNNsequence  0.929  96.59  598.79  367.07  
AdaptedBlockDrop  0.962  98.13  657.31  420.17  
Proposed  0.984  99.01  654.74  371.61  
Average results from at least 3 trained policy networks ( for the Proposed scheme).
Cloud’s marginal F1 and accuracy advantage are at the cost of a much higher delay.
This reward value is calculated according to (Wu et al., 2018) which results in a different range from the other values calculated by our reward function. So, these marked reward values are not comparable to the other reward values.
In summary, and as also indicated in the last column “Reward”, which is a convex combination of both accuracy and delay, our proposed adaptive scheme strikes the best tradeoff between accuracy and detection delay for univariate data with handcrafted feature representations of contextual information, and achieves the best performance for these two metrics for multivariate data with encoded feature representations. This is achieved by leveraging the distributed HEC architecture and our policy network that automatically selects the best layer to execute the detection task.
Execution time of policy network: We measure the average execution time of the policy network on the actual IoT devices as follows: 8.4 ms and 44.9 ms for univariate and multivariate data, respectively. For univariate data, although the execution time of our policy network (which is small) is comparable to the detection delay at the IoT layer (12.4 ms), it is only 11% of the overall average detection delay of the adaptive scheme (76.12 ms). For multivariate data, the execution time of the policy network is about 6.8% of the average endtoend detection delay of the adaptive scheme (654.74 ms). Overall, we can see that the execution time of the policy network as compared to the whole detection delay is negligible.
Compared to the stateoftheart schemes, the policy network of the AdaptedBlockDrop scheme (Wu et al., 2018) uses the same neural network architecture as our policy network; hence, we do not compare the execution time with the AdaptedBlockDrop scheme. We measure the average execution time of the decisionmaking module of the kNNsingle scheme on the IoT devices (Raspberry Pi3), and the results are 138.2 ms and 159.4 ms for univariate and multivariate data, respectively. For the kNNsequence decision making module with a series of three kNN classifiers, the execution time measured on the IoT devices are 118.3 ms and 180.5 ms for univariate and multivarate data, respectively. Interestingly, kNNsingle has longer execution delay than kNNsequence for univariate data. We examined this observation and found that most of the selected actions of kNNsequence are the IoT layer (i.e., exit after executing only the first kNN classifier). In addition, all kNN classifiers in kNNsequence are binary classifiers whereas kNNsingle is a multiclass classifier. This explains the above counterintuitive result.
The key observation is that, even though we have built kNN classifiers for embedded devices as lightweight as in (Taylor et al., 2018) for comparison with our proposed method, the execution time of kNNsingle and kNNsequence schemes (118.3–180.5 ms) is still much longer than that of our policy network (8.4–44.9 ms), for both univariate and multivariate data.
5.3. Cost Function: a Tradeoff Between Accuracy vs Delay
We train different policy networks with different cost functions of which the tunable parameter for univariate data, and for multivariate data. In Fig. 11, we plot the mean and standard deviation of accuracy (solid line) and delay (dashed line) of the proposed scheme (over at least 3 trained policy networks for each ), and the other baseline schemes (which are independent of ).
For univariate data, we can see in Fig. (a)a that the accuracy, and detection delay of the proposed scheme gradually decreases when increasing . With , the proposed scheme can achieve accuracy as high as the Cloud scheme, while the average delay of the proposed scheme significantly reduces by 60%90% compared to that of the Cloud scheme. Based on Fig. (a)a, we choose to get the best tradeoff between accuracy and delay, in which the accuracy only drops by 0.29% compared to the Cloud scheme while the delay is even lower than that of the Successive scheme.
For multivariate data, we can see that the proposed scheme consistently achieves an accuracy as high as the Cloud scheme regardless of different . As explained in Section 5.2, in some cases (e.g., ) the accuracy actually exceeds that of the Cloud scheme. The average detection delay of the proposed scheme fluctuates around that of the Edge scheme.
5.4. Accelerated Training of Policy Network
Comparison between average breakdown delays of training policy networks for each episode over the traditional training approach–sequential and the accelerated training approach–parallel is shown in Table 3. Similar to the sequential approach, under the new parallel approach as described in Section 4.3, we train the policy networks (with the same and number of episodes) for univariate and multivariate data to leverage parallel inference in distributed environments—distributed AD models at multiple HEC layers. We can see in Table 3 that the average transmission delay on each training episode under the parallel approach is significantly reduced by 80.3% and 82.9% compared to that under the sequential approach for univariate and multivariate data, respectively. For computing delay, we observe a slightly increasing computing delay under the parallel approach for univariate data, but a significantly decreasing computing delay with the parallel approach for multivariate data. The delay reduction of the parallel approach in the multivariate data is because for edge and cloud models we use CuDNNLSTM units that leverage accelerated GPU hardware for parallel inference; while for the univariate data, the autoencoder AD models running inference on CPU device take longer time to compute a batch of multiple samples. For training time, there is no difference between these two approaches. Note that the training time of multivariate data is 10 times that of the univariate data because we split the large multivariate dataset into 10 minibatches while the small univariate dataset is trained with one batch in each episode. A combination of reduction delays of transmission and computing is signified when we account for the whole training process which consists of hundred episodes. For example, for univariate data, the training policy network with 300 episodes took 7.6 hours under the sequential approach, but only 1.7 hours under the parallel approach.
Dataset  Approach  Transmission delay(s)  Computing delay(s)  Training time(s) 

Uni  Sequential  48.8345.696  0.6610.188  0.0130.006 
variate  Parallel  9.6052.290  0.9580.697  0.0150.004 
Multi  Sequential  420.7853.69  351.3043.67  0.1410.01 
variate  Parallel  72.179.66  78.0613.14  0.150.12 
Training 300 episodes with 1 batch in each episode.
Training 100 episodes with 10 minibatches in each episode.
5.5. Contextual Information: Handcrafted vs Encoded Features
In Section 5.2, we have seen that the encoded states from LSTMencoder for multivariate data, which capture good representations of contextual information, help the policy network consistently outperform the Cloud scheme in terms of accuracy. In this section, we verify this claim for univariate data. Instead of using handcrafted engineering features as the contextual information for univariate case, we use an encoded vector (201 dimensions) from the encoder model as the contextual state for training the policy network. We still use a single hidden layer neural network with 500 hidden units as a policy network, then train multiple policy networks with different under this new setup. Fig. 12 shows a comparison between the handcrafted engineering features and the encoded representations of input in terms of accuracy, delay, and reward. We can see that the encoded representations boost the performance of the policy network, which allow it to achieve accuracy nearly as high as the Cloud scheme while detection delay is significantly lower than that of the Cloud scheme. This result is consistent with what we found for multivariate data.
However, we note that the detection delay under the new setup is higher than it is for handcrafted features. Therefore, if one needs to balance between accuracy and detection delay, good handcrafted engineering features are a good option. But it requires a domainspecific knowledge and lots of effort to find good feature representations.
6. Conclusions
We identify three issues in existing IoT anomaly detection approaches, namely using one universal model to fit all data, lopsided focus on accuracy, and lack of local analysis. We then propose an adaptive approach to anomaly detection for both univariate and multivariate IoT data in distributed HEC. It constructs multiple distributed AD models based on an autoencoder and LSTM with increasing complexity, and associates each of them with an HEC layer from bottom to top, i.e., IoT devices, edge servers, and cloud. Then, it uses a reinforcement learningbased adaptive scheme to select the bestsuited model on the fly based on the contextual information of input data. The scheme consists of a policy network as the solution to a contextualbandit problem, characterized by a singlestep MDP. We also presented the accelerated method for training the policy network to take advantage of the distributed AD models of the HEC system. We implemented the proposed scheme and conducted experiments using two realworld IoT datasets on the HEC testbed that we built. By comparing with other baseline and some stateoftheart schemes, we show that our proposed scheme strikes the best accuracydelay tradeoff with the univariate dataset, and achieves the best accuracy and F1score while the delay is negligibly larger than the best scheme with the multivariate dataset. For example, in the univariate case, the proposed scheme reduces detection delay by 84.9% while retaining comparable accuracy and F1score, as compared to the cloud offloading approach.
References

Scheduled sampling for sequence prediction with recurrent neural networks
. In Proc. NIPS, Cambridge, MA, USA, pp. 1171–1179. Cited by: §3.2.2.  Deep learning for anomaly detection: a survey. External Links: 1901.03407 Cited by: §1, §2.
 An empirical study of latency in an emerging class of edge computing applications for wearable cognitive assistance. In Proc. ACM/IEEE SEC, Cited by: §1, §3.3.
 SEED RL: Scalable and efficient deepRL with accelerated central inference. Proc. ICLR. External Links: 1910.06591 Cited by: §4.3.
 Outlier detection for temporal data: a survey. IEEE Transactions on Knowledge and Data Engineering 26 (9), pp. 2250–2267. Cited by: §2.
 Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. ICLR. Cited by: §1.
 Distilling the knowledge in a neural network. NIPS 2014 Deep Learning Workshop. Cited by: §1.
 Neurosurgeon: collaborative intelligence between the cloud and mobile edge. SIGARCH Comput. Archit. News 45 (1), pp. 615–629. External Links: ISSN 01635964 Cited by: §1, §2, §4.5.
 HOT SAX: efficiently finding the most unusual time series subsequence. In Proc. IEEE ICDM, Vol. . Cited by: §4.1.
 Enabling intelligence in fog computing to achieve energy and latency reduction. Digital Communications and Networks 5 (1), pp. 3–9. Cited by: §1, §3.2, footnote 4.
 Learning iot in edge: deep learning for the internet of things with edge computing. IEEE Network 32 (1), pp. 96–101. Cited by: §2.
 Runtime neural pruning. In Advances in Neural Information Processing Systems, Vol. 30, pp. . Cited by: §2.
 Distributed anomaly detection using autoencoder neural networks in WSN for IoT. In Proc. IEEE ICC, Vol. . External Links: ISSN 19381883 Cited by: §1, §2, §3.2.1.
 LSTMbased encoderdecoder for multisensor anomaly detection. ICML Workshop. Cited by: §1, §2, §3.2.2, §4.1.
 Long short term memory networks for anomaly detection in time series. In Proc. ESANN, pp. 89. Cited by: §3.2.3.
 Deep learning for IoT big data and streaming analytics: a survey. IEEE Commun. Surveys Tuts. 20 (4), pp. 2923–2960. External Links: ISSN Cited by: §1, §2, §3.2.
 Coordinated container migration and base station handover in mobile edge computing. In GLOBECOM 2020  2020 IEEE Global Communications Conference, Vol. , pp. 1–6. Cited by: §3.2, footnote 4.
 Adaptive anomaly detection for IoT data in hierarchical edge computing. AAAI Workshop. External Links: 2001.03314 Cited by: §2, §3.2.3.
 Contextualbandit anomaly detection for IoT data in distributed hierarchical edge computing. IEEE ICDCS Demo track. Cited by: §2.
 Anomaly detection for temporal data using long shortterm memory (LSTM). Master’s Thesis. Cited by: §2, §3.2.3, §4.1.
 Robust anomaly detection for multivariate time series through stochastic recurrent neural network. In Proc. ACM SIGKDD, pp. 2828–2837. Cited by: §2.
 Sequence to sequence learning with neural networks. In Proc. NIPS, Cited by: §3.2.2.
 Policy gradient methods for reinforcement learning with function approximation. In Proc. NIPS, pp. 1057–1063. Cited by: §3.3, §3.3, §3.3.
 Adaptive deep learning model selection on embedded systems. In ACM SIGPLAN Notices, Vol. 53, pp. 31–43. Cited by: §2, §4.5, §4.5, §5.2, §5.2.
 BranchyNet: fast inference via early exiting from deep neural networks. In Proc. ICPR, Vol. , pp. 2464–2469. Cited by: §2.
 Distributed deep neural networks over the cloud, the edge and end devices. In Proc. IEEE ICDCS, pp. 328–339. Cited by: §1, §2, §3.3, §4.5.
 Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8 (34), pp. 229–256. Cited by: §3.3, §3.3, §3.3, §3.3.
 Blockdrop: dynamic inference paths in residual networks. In Proc. CVPR, pp. 8817–8826. Cited by: §2, §4.5, §4.5, §5.2, §5.2, Table 2.

AAIoT: accelerating artificial intelligence in iot systems
. IEEE Wireless Communications Letters 8 (3), pp. 825–828. Cited by: §2.
Comments
There are no comments yet.