## I Introduction

Machine Learning (ML) is considered to be one of the biggest innovations since the microchip and it is well on its way to becoming one of the most prolific technologies of this century. From image recognition and natural language processing, to intelligent business and production processes, ML and deep learning in particular, are transforming almost every aspect of our daily lives [acc2019]. The progress we have witnessed during this last decade has been fuelled mainly by the convergence of three crucial factors, without any of which, this progress would have been nearly impossible. These factors are the availability of computing power, algorithmic improvements, and data, where the latter is often considered to be today’s most valuable commodity. A major pipeline for acquiring data is the Internet of Things (IoT) [gupta2016abc]. In fact, the IoT and deep learning have a two-way, symbiotic relationship. On one hand, the IoT is one of the main benefactors of deep learning, as it constantly produces a vast amount of data. On the other hand, the IoT ecosystem may benefit significantly from the fact that deep learning can turn the data into insights and actions for improving the IoT-powered processes and services.

In the IoT, a growing army of sensors capable of registering location, voices, faces, audio, temperature, sentiment, health and the like operate autonomously, with little to no human intervention. IoT nodes are typically heavily constrained, in terms of hardware, computational abilities and energy supply. In particular, IoT nodes are battery powered and for many IoT applications these simple nodes are expected to operate for at least 10 years without battery recharging or battery replacement [gubbi2013internet]. In spite of these significant challenges, the number of connected IoT devices is continuing to grow and consequently so does the amount of available data to train ML models. However, more data does not necessarily mean good data, but it certainly means higher computational cost. In addition, using large data sets has many ripple effects, from an increased memory cost to a non-negligible environmental impact [strubell2019energy]. Luckily, when working with an ML model, many data samples are redundant and thus can be ignored without impacting the final model and the inference ability of the algorithm [schroff2015facenet]. This is due to the fact that many data samples are either not informative or can already be properly handled by the ML model.

Motivated by this idea, in this paper, we investigate how to reduce the size of big data in order to produce better data in the IoT. To this end, we consider a cell comprised of IoT nodes and an Access Point (AP), as illustrated in Fig. 1. The IoT devices are assumed to monitor their environment and generate data samples that are to be transmitted to the AP in order for the AP to make inferences. The more data samples are transmitted to the AP, the more accurate the model becomes at the cost of a higher energy expenditure, which leads to reducing the lifetime of the IoT devices. Hence, there exists a trade-off between, on the one hand, the inference accuracy of the AP, and, on the other hand, the lifetime of the IoT devices. In order to optimize this trade-off, we propose *distributed importance filtering* of the data. The basic idea is to filter out redundant data samples already at the IoT nodes, such that these data samples are not sent to the AP, thereby preserving resources. The proposed scheme is grounded on the fact that humans can learn from a small collection of
examples, and a good tutor is able to pick examples that are useful for a student to learn the current
lesson. In the proposed distributed filtering scheme, the ML model itself is both the student and the teacher, as the model is used to figure out which examples are currently informative for itself.

Unlike edge intelligence paradigms in which the edge nodes host neural networks, such as in federated learning [mcmahan2016communication] [konevcny2016federated], we consider generic IoT devices without any intelligence, with limited computational abilities and limited energy supply. By employing the proposed scheme, the longevity of the IoT nodes is drastically improved, the computational complexity for training the model is reduced and additional data pre-processing steps aiming to decrease the data set size are no longer needed, all at the expense of only a small reduction in inference accuracy at the AP. We evaluate the proposed scheme on real-life problems, namely leakage detection in water distribution networks, as well as air-pollution detection in urban areas. Results confirm that our scheme offers significant benefits in terms of network longevity, whilst maintaining high inference accuracy.

## Ii System Model

We consider a communication setup comprised of IoT nodes, transmitting to a common AP, as shown in Fig. 1. The nodes in the cell are assumed to generate measurements from the environment, i.e., data samples which are denoted by , where denotes the full data set. We assume that part of the generated data samples are filtered out by the nodes and only the remaining data samples are then sent to the AP. Thereby, during a time interval of length , a node is assumed to generate data samples and transmit to the AP data samples, such that . We assume that the data samples are placed as payloads in the data packets which are to be transmitted to the AP. The packet rate is thus given by , expressed in number of data samples per second. The data packets are assumed to be received and decoded by the AP without errors. In addition, all IoT nodes are assumed to be typical in a sense that they face severe constraints on computation, energy consumption, and memory.

Contrary to the nodes, the AP is assumed to have abundant resources in terms of computation and memory. As a result, the AP hosts artificial deep neural networks of depth and size , and it is able to make inferences^{1}^{1}1The type of inference depends on the application. based on the received data samples . In order for the AP to make inferences, the neural network is trained to approximate a desired function of the received data samples, denoted by , by solving

(1) |

where is the full set of collected data samples from all users,

is the loss function, and

is the approximation. To determine the optimal approximation which minimizes the loss function , the weights of the neural network are iterativly adjusted as the neural network is presented with data samples from the nodes for training via Stochastic Gradient Descend (SGD) algorithm. Thereby, in each SGD iteration the weights of the neural network are adjusted according to(2) |

where denotes the learning rate. The time and the number of data samples required for the neural network to find the optimal approximation, , are characterized by the computational complexity and the sample complexity, respectively [goodfellow2016deep].

## Iii Problem Formulation

Let us assume that the training error and the packet rate are related according to a decreasing function , as higher rates produce more data samples for training [hestness2017deep]. Thereby,

(3) |

In addition, let us assume that the network longevity, denoted by , and the packet rate are related according to a decreasing function , as high packet rates increase the energy consumption of the nodes, due to the frequent use of the wireless channel. This drastically decreases the network longevity, as the nodes in question are assumed to have a limited energy supply. Thereby,

(4) |

This interplay motivates us to address the role of the rate in the inference accuracy of the neural network and the longevity of the IoT nodes. In other words, we aim to maximize the network longevity by reducing the data rate , whilst keeping the inference accuracy at a desired level. Thereby, we try to solve the following optimization problem

(5) |

where denotes a pre-determined threshold. This is illustrated on Fig. 2. As the benchmark curve suggests, lower rates , decrease the training accuracy (i.e., increase the error), as less data is available for training for a fixed training time. Conversely, higher rates lead to higher accuracy, however, higher rates yield a large number of data samples which in turn results in a decrease of the network longevity. Thereby, by solving (III), we shift the benchmark curve to the target curve, and to this end, we resort to selecting which data packets (i.e., data samples

) should be transmitted by the edge nodes to the AP, by leveraging the fact that in large-scale setting, much of the data is often redundant, though there may also be a small set of data points that are distinctive. This method, which can be thought of as a pre-processing step in classical data-science, constructs a

*coreset*– a small, weighted subset of the data that approximates the full dataset, that can be used in many standard inference procedures to provide posterior approximations with guaranteed quality.

## Iv Proposed Algorithm

In this section, we present the proposed algorithm for distributed importance filtering. To this end, we first provide some intuition in Subsection IV-A, and the proposed algorithm is explained in detail in Subsection IV-B.

### Iv-a Intuition

When training ML models, not all data samples are equally important [schroff2015facenet], [loshchilov2015online]. In fact, many of them are redundant and can be ignored without impacting the final model. A useful tool which allows for filtering data samples is coreset construction [feldman2011unified]. The key idea behind coresets is to approximate the original data set of size by a (potentially) weighted set of size , where

, where the weights depend on the coreset type. In other words, coresets are succinct, small summaries of large data sets, created in a way to ensure that the solutions found based on the summary are provably competitive with the solution found on the full data set. Coresets were originally studied in the context of computational geometry and older approaches often relied on computationally expensive methods such as exponential grids. The existence of coresets is trivial, as the original data set it in itself a coreset. One can even construct a coreset by simply uniformly sampling though the data points. Assuming the coreset has thousands to hundreds of thousands data points, by the law of large numbers the obtained data subset is indeed a coreset.

However, we are interested in approaches with higher practical value and thereby we resort to constructing coresets on the basis of importance sampling. However, we neither have the full data set in advance, nor do we make any assumptions on the probability distribution of the data. Thereby, we construct our coreset in a streaming setting. Formally we refer to our approach as

distributed importance filtering, as we propose to filter out redundant data samples at the edge. As a result of reducing the number of samples that are transmitted to the AP, we improve the computational cost associated with training the model (or the amount of memory required to store all data samples), and also improve the network longevity.One major advantage of coresets for IoT data is that they can be constructed in parallel. In other words, each device in the cell can independently construct a coreset, and the resulting data set from all received data samples at the AP is also a coreset.

### Iv-B Proposed Algorithm for Distributed Importance Filtering

Consider the weights update rule given in (2). In order to update the neural network weights , in each SGD iteration the ”error signal” given by is back-propagated through the model. Samples which are redundant, correctly handled by the model or not useful will not induce changes to the weights and feeding more of such samples only increases the computational cost and brings no benefit. The biggest change in the weights will be induced by the data samples which are deemed informative for and by the model. In order to identify such informative data samples, we introduce leverage scores defined as

(6) |

where

denotes the 2-norm of the loss gradient vector. Thereby, influential samples obtain high leverage scores and conversely, low leverage scores are allocated to non influential data samples. In order to obtain leverage scores which are sufficiently good proxies for the importance of the data samples, initially, all

measurements are transmitted by setting the transmission probability . Thereby, initially, the node transmits data samples, with rate . After the initial time slots, the AP computes leverage scores, and lowers the the packet rate to by reducing the transmission probabilities. More specifically, the transmission probabilities are selected such that they are proportional to the leverage scores, i.e.,(7) |

Note that, in cases when the data probability distribution is known, the transmission probability can be adjusted according to the data probability distribution. However, in practice, the data probability distribution is rarely known, or it takes a (prohibitively) long time to be properly estimated. As we do not make any assumptions about the probability distribution of the data samples we set the transmission probability to be proportional to the leverage scores, as given in (

7). Once the leverage scores and transmission probabilities are obtained according to (6) and (7), respectively, they are transmitted back to the nodes, and each node transmits new data samples to the AP during the following time interval of length . Upon reception, the neural network weights are updated according to (2) for each data sample, and the leverage scores are updated according to (6). The updated leverage scores are transmitted back to the nodes, and this procedure is repeated until convergence.Note that, as the proposed algorithm filters data packets which are not useful during training (as judged by the internal state of the model), the proposed algorithm can address semantic problems in IoT systems [shannon1948mathematical], [popovski2019semantic]. In addition, by employing importance filtering, data pre-processing steps aimed at reducing the data set size are no longer needed, as the pre-processing is essentially carried out by the edge nodes themselves. Further, the energy consumption profile of distributed importance filtering is much better than classical schemes, as fewer packets are transmitted. This in practice translates to consuming energy only for reception and for waking up. The former is due to the fact that the IoT nodes need to receive the leverage scores, for which the radio is engaged, and the later is due to the fact that the nodes need to make a measurement and decide whether to transmit or not. If not, the node immediately initiates a turn off sequence and goes back to sleep, without engaging the radio. Thereby, the node does not consume energy for radio preparation, transmission, acknowledgement reception, and turning off the radio. For reception of the leverage scores, as these updates are far less infrequent than potential transmissions, the energy consumption of the proposed scheme is lower compared to the case when all packets are transmitted. This leads to significant improvements in terms of energy consumption, and implicitly network longevity.

The only parameter which needs to be explicitly defined in the proposed algorithm is the packet rate . The proposed algorithm is given in a flow chart in Fig. 3. In the spirit of reproducible science, the codes for the algorithm will be made publicly available on [ikoloska].

## V Numerical Evaluation

In this section, we evaluate the proposed algorithm. To this end, the used data sets are presented in Subsection V-A, and the architecture of the neural networks is presented in Subsection V-B. The benchmark schemes are described in Subsection V-C, and finally, the results are presented in Subsection V-D.

### V-a Data Sets

We consider two representative IoT applications, i.e., leak detection in water distribution networks, and air pollution detection in urban areas.

#### V-A1 Leak detection

In a water distribution network, water is pumped from the source at one end, through a structure of pipes, to the end users. However, due to external stresors, or accumulated damage, water often leaks in this structure. Such leaks in the pipes are detected by measuring the flow at different points of the distribution network [koo2015towards], [mutchek2014moving], [geetha2016internet]. The used data sets are generated by the publicly available simulator EPANET, which is widely used for simulation of water network data [rossman2000epanet]. The architecture of the considered water distribution network is given in [rossman2000epanet]. Thereby, the data set is comprised of hourly measurements from links in the water distribution network. In the time series data of flow measurements, leaks are induced according to [fagiani2015novelty], [boracchi2014exploiting].

#### V-A2 Air pollution detection

A large part of the world is experiencing chronic air pollution with severe fine particulate matter concentration ( in particular). High levels of are observed and detected via national air pollution monitoring networks. We consider the data set from the Beijing air pollution monitoring network [zhang2017cautionary], comprised of hourly measurements from monitoring stations, in order to train out neural network to detect dangerous levels of in the air.

### V-B Architecture of the Neural Networks

The neural networks for both problems are single layer, and hidden unit neural networks for leak detection and air pollution detection, respectively. The learning rates are set to and for leak detection and air pollution detection, respectively. Adam and SGD are the used optimizers for leak detection and air pollution detection, respectively. In addition, in the case of air pollution detection, Dropout with probability is used to regularise the neural network. The hyper parameters are summarised in Table I.

Hyper Parameters | Leak detection | Air polution detection |

No. of hidden layers | 1 | 1 |

No. of hidden units | 5 | 32 |

Learning rate | 0.0005 | 0.0001 |

Optimizer | SGD | Adam |

Dropout | 0 | 0.2 |

### V-C Benchmark Schemes

We consider Uniform Filtering and Genie-aided Filtering as benchmark schemes. In order to ensure fairness among the all schemes, equal number of packets are transmitted with all schemes.

#### V-C1 Uniform filtering

Uniform filtering is straightforward, and relies on uniform selection of packets to be transmitted, i.e., the node transmits a packet with probability

(8) |

#### V-C2 Genie-aided filtering

For Genie-aided filtering, we assume the existence of a genie in the network. The genie knows not only the probability distribution of the data samples , but also the correct class probability . Since the genie knows the correct class probability, the genie is able to balance the training data set and implement a form of importance sampling by choosing to transmit data samples with probability

(9) |

In other words, the genie downsamples the majority class, and upsamples the minority class, and as a result ensures that samples from both classes are transmitted. This is especially important in cases when the rate is very low. Note that, with uniform sampling often only samples form the majority class are transmitted when the rate is very low.

### V-D Results

We first plot the training error (MSE) as a function of the training time. During the initial hours, the rate is set to , and it is then decreased to . Initially, when all samples are transmitted, the training error drops to . However, as the rate is decreased and the node starts to transmit only the important samples, the training error drastically decreases. Note that, the exhibited decrease isn’t necessarily smooth, and a small spike in the training error can be noticed right before the final update of the leverage scores (hour ). Still, after the final update, the training error reaches its minimal value.

The maximum packet rate for both problems is packet per hour. We present the maximum achievable accuracy of the proposed distributed importance filtering scheme as a function of the average packet rate , in Fig 5 and Fig 6. For the proposed scheme, during the initial hours, the node is allowed to transmit packets with packet rate packet per hour. Then, the number of transmitted packets is varied from packets every hours (resulting in packets per hour) to packets every hours (resulting in packets per hour). As expected, the error decreases when increases, as more data samples are fed to the model. Fig. 5 shows that the proposed distributed importance filtering scheme with can result in equal prediction error to the case when all packets are transmitted and packets per hour in the case of leak detection. However, unlike the case when packets per hour, with the proposed scheme the communication channel is sparsely used and the computational cost to the model is lower. In practice, this translates to significant reduction of energy consumption on the network side, and reduction of memory required to compute and store all data samples on the AP side. In addition, the proposed scheme significantly outperforms the Uniform Filtering Scheme, and achieves similar performance to the Genie-aided filtering scheme in the case of leak detection. In fact, both schemes result in the same detection error with only a small increase of the packet rate for the proposed scheme from to . Improvements in terms of detection error can be seen also in the case of air-pollution detection, and in fact the proposed distributed importance filtering scheme offers results comparable with the genie-aided filtering scheme.

An interesting property that can be leveraged in order to improve the performance of the proposed scheme even further (as well as the benchmarks) is the nature of IoT data itself. In particular, IoT samples in many applications tend to exhibit a large degree of correlation. To demonstrate, in water distribution networks, such as the one we are considering in this section, leaks propagate through the entire network, and thereby the flow measurement in the presence of a leak are almost perfectly correlated. In Fig. 7 we plot the detection error for (at each node) for two consecutive links in the distribution network. To provide contrast, in the case of air pollution detection, we consider two neighbouring monitoring sites (a map of the sites is available in [zhang2017cautionary]), which exhibit some degree of correlation, however due to the underlying weather conditions (such as wind speed and direction), the data samples are not perfectly correlated. We plot the detection error for (at each node) for two neighbouring monitoring stations in the network in Fig. 7. We first note the significant improvements in terms of average detection error, especially when it comes to leak detection. Because the data samples for air pollution detection are not perfectly correlated, whilst using two monitoring stations offers detection error improvements (around ), these improvements are not as significant as in the case of leak detection (around ). Second, consider the improved performance of uniform filtering in the case of air pollution detection (which also can be seen in Fig. 6). This is due to the fact that the air pollution data is more balanced. In other words, when some events are more rare than others, it is imperative to extract the data samples associated with those events. Such data samples are highly important and can be extracted via the proposed scheme in this paper. This is the main reason why distributed importance filtering, as well as genie-aided filtering, perform very well. Meanwhile, the more balanced a data set is, the better the performance of uniform filtering is going to be.

## Vi Discussion and Conclusions

The proposed scheme in this paper improves the longevity of the network, reduces the computational complexity for training the model, and eliminates additional data pre-processing steps. In addition, the algorithm is not exclusively designed to cater to water leakage detection, or air-pollution detection, and as a result it can be used in many real-life scenarios. In fact, the proposed algorithm can be used for any case where the leverage scores can be found according to (6). Straightforward use cases include contamination detection in water distribution networks, gas leak detection, power fluctuations/outage detection in smart grids, blockage detection in oil pipelines etc.

Extending this work to the more complex case when packet errors are taken into account is rather straightforward. However, a degradation in performance which depends on the packet error rate is to be expected, both for the proposed scheme, and for the benchmarks. In particular, higher packet rates will be needed to achieve low detection error, as the nodes will need to compensate for the lost pockets.

Comments

There are no comments yet.