The recent demand for machine learning (ML) applications, such as image recognition, natural language translation, and health monitoring, has been unprecedented . These services collect data streams generated by small devices, and analyze them locally or at distant cloud servers. There is growing consensus that such applications will be ubiquitous in Internet of Things (IoT) systems . The challenge, however, with such services is that they are often resource intensive. On the one hand, the cloud offers powerful ML models and abundant compute resources but requires data transfers which consume network bandwidth and might induce significant delays . On the other hand, executing these services at the devices economizes bandwidth but degrades their performance due to the devices’ limited resources, e.g. memory or energy.
A promising approach to tackle this problem is to allow the devices to outsource individual ML tasks to edge infrastructure such as cloudlets . This can increase their execution accuracy since the cloudlet’s ML components are typically more complex, and hence offer improved results. Nevertheless, the success of such solutions presumes intelligent outsourcing algorithms. The cloudlets, unlike the cloud, have limited computing capacity and cannot support all requests. At the same time, task execution requires the transfer of large data volumes (e.g., video streams). This calls for prudent transmission decisions in order to avoid wasting device energy and bandwidth. Furthermore, unlike prior computation offloading solutions , it is crucial to only outsource the tasks that can significantly benefit from cloudlet execution.
Our goal is to design an online framework that addresses the above issues and makes intelligent outsourcing decisions
. We consider a system where a cloudlet improves the execution of image classification tasks running on devices such as wireless IoT cameras. We assume that each device has a ”low-precision” classifier while the cloudlet can execute the task with higher precision. The devices classify the received objects upon arrival, and decide whether to transmit them to the cloudlet or not, to get a better classification result. Making this decision requires an assessment of the potential performance gains, which are measured in terms of accuracy improvements. To this end, we propose the usage of apredictor at each device that leverages the local classification results.
We consider the practical case where the resources’ availability is unknown and time-varying, but their instantaneous values are observable. We design a distributed adaptive algorithm that decides the task outsourcing policy towards maximizing the long-term performance of analytics. To achieve this, we formulate the system’s operation as an optimization problem, which is decomposed via Lagrange relaxation to a set of device-specific problems. This enables its distributed solution through an approximate – due to the unknown parameters – dual ascent method, that can be applied in real time. The method is inspired by primal averaging schemes for static problems, e.g., see , and achieves a bounded and tunable optimality gap using a novel approximate iteration technique. Our contributions can be summarized as follows:
Edge Analytics. We study the novel problem of intelligently improving data analytics tasks using edge infrastructure, which is increasingly important for the IoT.
Decision Framework. We propose an online task outsourcing algorithm that achieves near-optimal performance under very general conditions (unknown, non i.i.d. statistics). This is a novel analytical result of independent value.
Implementation & Evaluation. The solution is evaluated in a wireless testbed using a ML application, several classifiers and datasets. We find that our algorithm increases the accuracy (up to ) and reduces the energy (down to ) compared to carefully selected benchmark policies.
Organization. Sec. II introduces the model and the problem. Sec. III presents the algorithm and Sec. IV the system implementation, experiments and trace-driven simulations. We discuss related work in Sec. V and conclude in Sec. VI. Although the paper is completely self-sufficient, the interested reader will find more results from the implementation of our system, as well as a more detailed version of the proof of our main analytical contribution in .
Ii Model and Problem Formulation
Classifiers. There is a set of disjoint object classes and a set of edge devices. We assume a time-slotted operation where each device receives at slot a group of objects (or tasks) to be classified, e.g., frames captured by its camera. We define as the set of objects that can arrive at , and . Each device is equipped with a local classifier , which outputs the inferred class of an object and a normalized confidence value for that inference111 The classifier might output only the class with the highest confidence, or a vector with the confidence for each class; our analysis holds for both cases.
The classifier might output only the class with the highest confidence, or a vector with the confidence for each class; our analysis holds for both cases.. The cloudlet has a classifier that can classify any object, and offers higher accuracy from all devices, i.e., .
Let denote the accuracy improvement when the cloudlet classifier is used:
Every device is also equipped with a predictor222 This can be a model-based or model-free solution, e.g., a regressor or a neural-network; our analysis and framework work for any of these solutions. In the implementation we used a mixed-effects regressor, see
This can be a model-based or model-free solution, e.g., a regressor or a neural-network; our analysis and framework work for any of these solutions. In the implementation we used a mixed-effects regressor, see.
that is trained with the outcomes of the local and cloudlet classifiers. This predictor can estimate the accuracy improvement offered by the cloudlet for each object:
and, in general, this assessment might be inexact, , and is the respective confidence value.
Wireless System. The devices access the cloudlet through high capacity cellular or Wi-Fi links. Each device has an average power budget of Watts. Power is a key limitation here because the devices might have a small energy budget due to protocol-induced transmission constraints, or due to user aversion for energy spending. The cloudlet has an average processing capacity of cycles/sec which is shared by the devices, and when the total load exceeds , the task delay increases and eventually renders the system non-responsive.
We consider the realistic scenario where the parameters of devices and the cloudlet change over time in an unknown fashion. Namely, they are created by random processes and , and our decision framework has access only to their instantaneous values in each slot. Unlike previous optimization frameworks  that assume i.i.d., or Markov modulated processes; here we only ask that these perturbations are bounded in each slot, i.e. and their averages converge to some finite values which we do need to know, i.e., , and similarly for . We also define .
When an object (say, image) is transmitted in slot from device to the cloudlet, it consumes333Power budgets are also affected by the local classifier computations which are made for every object and thus do not affect the offloading decisions. part of the device’s power budget . We assume that this cost, denoted , follows a random process that is uniformly upper-bounded and has well-defined mean values.444This cost can reflect, e.g., the impact of time-varying channel conditions. Also, each transmitted object requires a number of processing cycles in the cloudlet which might also vary with time, e.g., due to the different type of the objects, and we assume it follows the random process , with . We define , and . Our model is very general as the (i) requests, (ii) power and computing cost per request, and (iii) resource availability, can be arbitrarily time-varying, and with unknown statistics.
Problem Formulation. The IoT devices wish to involve the cloudlet only when they confidently expect high classification precision gains. Otherwise, they will consume the cloudlet’s capacity and their own power without significant performance benefits. Therefore, we make the outsourcing decision for each object based on the weighted improvement gain:
where is a risk aversion
parameter set by the system designer or each user. For example, assuming normal distribution for, we could set and use a threshold rule of standard deviation. We use hereafter these modified parameters , and partition the interval of their values ( being the maximum) into subintervals such that ; with being the center point of . This quantization facilitates the implementation of our algorithm in a real system, and is without loss of generality since we can use very short intervals. Finally, let denote the number of objects with expected gain that device has created in slot . These arrivals are generated by an unknown process , with .
Our aim is to maximize the aggregate long-term analytics performance gains, for all objects and IoT devices. This can be formulated as a mathematical program. We define variables which indicate the long term ratio of objects with expected gain of that are sent to the cloudlet (with , when all objects of in are sent), and formulate the convex problem:
where . Eq. (4b) constraints the average power budget of each device and (4c) bounds the cloudlet utilization. Clearly, based on the specifics of each system we can add more constraints, e.g., for the average wireless link capacity in case bandwidth is also a bottleneck resource. Such extensions are straightforward as they do not change the properties of the problem, nor affect our analysis below.
The solution of is a policy that maximizes the aggregate (hence also average) analytics performance in the system. Such policies can be randomized, with
denoting the probability of sending each object ofin interval to the cloudlet (at each slot). However, in reality, the system parameters not only change with time, but are generated by processes that might not be i.i.d. and have unknown statistics (mean values, etc.). This means that in practice we cannot find . In the next section we present an online policy that is oblivious to the statistics of but achieves indeed the same performance with .
Iii Online Offloading Algorithm
Our solution approach is simple and, we believe, elegant. We replace the unknown parameters , , , and , in with their running averages (which we calculate as the system operates), solve the modified problem with gradient ascent in the dual space, and perform primal averaging. This gives us an online policy that applies in real time the solution while using only information made available by slot .
Iii-a Problem Decomposition & Algorithm Design
Let us first define the running-average function:
is the running average of process , and similarly we define , , and . Note that can be calculated at each slot, while and are unknown. We can now define a new problem:
We will use the instances to perform a dual ascent method and obtain a sequence of decisions that will be applied in real time and achieve performance that converges asymptotically to the (unknown) solution of .
We first dualize and introduce the Lagrangian555For our system implementation, this relaxation means we install queues for the data transmission (at the devices) and image processing (at the cloudlet).:
where are the non-negative dual variables for . The dual function is:
and the dual problem amounts to maximizing .
We apply a dual ascent algorithm where the iterations are in sync with the system’s time slots . Observe that does not depend on or , it is separable with respect to the primal variables, and independent of . Hence, in each iteration we can minimize by:
This yields the following easy-to-implement threshold rule:
which is a deterministic decision that offloads (or not) all requests of each device (at each ). Then we improve the current value of by updating the dual variables:
where is the update step size, and return to (7).
The detailed steps that implement our online policy are as follows (with reference to OnAlgo, Algorithm 1). Each device receives a group of objects in slot and uses its classifier to predict their classes, and the predictor to estimate the expected offloading gains (Steps 4-6). They update their statistics (step 7) and compare the expected benefits with the outsourcing costs (Step 10). Finally, they update their local dual variable for the power constraint violation (Step 12). The cloudlet classifies the received objects (Step 16) and updates its parameter estimates (Step 17) and its congestion (Step 18), which is sent to the devices.
Iii-B Performance Analysis
The gist of our approach is that, as time evolves, the sequence of problems approaches our initial problem . This is true under the following mild assumption.
The perturbations of the system parameters are independent to each other, uniformly bounded, and their averages converge, e.g., .
Under this assumption it is easy to see that it holds:
Furthermore, note that due to boundedness of the parameters and we have that:
and using Minkowski’s inequality, we get the bound:
It is also easy to see that . The following Theorem is our main analytical result.
Under Assumption 1, OnAlgo ensures the following optimality and feasibility gaps:
We drop bold typeface notation here, and use subscript to denote the -th slot. We first bound the distance of from vector , i.e.,
(i) Optimality Gap. From the dual problem we can write:
Dropping the non-negative term , dividing by , setting , and rearranging terms, yields:
Using the fact that , and combining the above with (14), we obtain:
All sums have diminishing terms and divided by , hence converge to . Thus, we obtained the first part of the theorem.
(ii) Constraint Violation. If we apply recursively the dual variable update rule, we obtain:
Setting , dividing by , and using Jensen’s inequality for , we get:
The second term of the LHS converges to zero as . Our claim holds if the same is true for the RHS. Indeed, this is the case assuming the existence of a Slater vector, and the boundedness of the set of dual variables (see [22, 10]). ∎
The theorem shows that OnAlgo asymptotically achieves zero feasibility gap (no constraint violation), and a fixed optimality gap that can be made arbitrarily small by tuning the step size.
Iv Implementation and Evaluation
Iv-a Experimentation Setup and Initial Measurements
Iv-A1 Testbed and Measurements
We used 4 Raspberry Pis (RPs) as end-nodes, placed in different distances from a laptop (cloudlet). We used a Monsoon monitor for the energy measurements, and Python libraries and TensorFlow for the classifiers.666We used vanilla versions of the classifiers to facilitate observation of the results. The memory footprint of NNs can be made smaller  but this might affect their performance. Our analysis is orthogonal to such interventions.
We first measured the average power consumption when RPs transmit data to the cloudlet with different rates, and then fitted a linear regression model that estimates the consumed power as a function of. This model is used by OnAlgo to estimate the energy cost for each transmitted image, given the data rate in each slot (which might differ for the RPs). Also, we measured the average computing costs ( cycles/task) of the classification tasks, to be used in simulations. For more details on the setup, see .
Iv-A2 Data Sets and Classifiers
We use two well-known datasets: (i) MNIST  which consists of pixel handwritten digits, and includes K training and K test examples; (ii) CIFAR-10  with K training and K test examples of color images of classes. We used two classifiers, the normalized-distance weighted k
-nearest neighbors (KNN)
, and the more sophisticated Convolutional Neural Network (CNN) implemented with TensorFlow. They output a vector with the probabilities that the object belongs to each class. These classifiers have different performance and resource needs, hence allow us to build diverse experiments. The predictors are trained with labeled images and the outputs of the local () and cloudlet () classifiers. These are the independent variables in our regression model that estimates (dependent variables). Recall that the latter are calculated using (1), where we additionally use that if device has given a wrong classification and if the cloudlet is mistaken.
We compare OnAlgo with two algorithms. The Accuracy-Threshold Offloading (ATO) algorithm, where a task is offloaded when the confidence of the local classifier is below a threshold, without considering the resource consumption. And the Resource-Consumption Offloading (RCO) algorithm, where a task is offloaded when there is enough energy, without considering the expected classification improvement.
Iv-A4 Limitations of Mobile Devices
We used our testbed to verify that these small resource-footprint devices require the assistance of a cloudlet. Our findings are in line with previous studies, e.g., . The performance of a CNN model increases with the number of layers. We find that, even with layers, a CNN trained for CIFAR has GB size and hence cannot be stored in the RPs (see Fig. 2a). Similar conclusions hold for the KNN classifier that needs to locally store all training samples. Clearly, despite the successful efforts to reduce the size of ML models by, e.g. using compression ; the increasingly complex analytics and the small form-factor of devices will continue to raise the local versus cloudlet execution trade off.
Iv-A5 Classifier Assessment
In Fig. 2b we see that the accuracy (ratio of successful over total predictions) of the KNN classifier improves with the size of labeled data. Figure 2c presents the accuracy gains for CNN as more hidden layers are added. The gains are higher (up to 20%) for the digits that are difficult to recognize, e.g., and . Fig. 2d shows the CNN performance on CIFAR, which is lower as this is a more complex dataset (colored images, etc.). Overall, we see that the classifier performance depends on the algorithm (KNN, CNN), the settings (datasets, layers), and the objects.
Iv-B Performance Evaluation
Iv-B1 Resource Availability Impact
Fig. 3 shows the average accuracy and fraction of requests offloaded to the cloudlet with OnAlgo when we vary their power budget. As increases there are more opportunities to use the cloudlet (4-layer CNN) and obtain more accurate classifications than the local classifier (1-layer CNN). Furthermore, Fig. 2(c-d) show that MNIST is easier to classify and the gains of using a better classifier are smaller than with CIFAR. Hence, as increases in Fig. 3 the ratio of offloaded tasks increases at a faster pace with CIFAR than with MNIST.
Iv-B2 Comparison with Benchmarks
We compare OnAlgo to ATO and RCO. No-offloading (NO) serves as a baseline for these algorithms in Fig. 4. To ensure a realistic comparison, we set the rule for all algorithms that the cloudlet will not serve any task if the computing capacity constraint is violated. For RCO, the availability of energy is determined by computing the running average consumption at each device during the experiment. We employ two testbed scenarios, and a simulation with larger number of devices.
Scenario 1: Low accuracy improvement; high resources. We set777We have explicitly set a small power budget so as to highlight the impact of power constraints on the system performance; higher power budgets will still be a bottleneck for higher task request rates or images of larger size. and allowing the devices to offload many tasks, and the cloudlet to serve most of them; and used MNIST (has small improvement). We demonstrate the average accuracy and power consumption in Fig. 4a, where we see that OnAlgo outperforms both ATO and RCO by . Regarding power consumption, ATO achieves the best result since it gets high enough confidence on its local classifier (rarely offloads). RCO however, offloads almost every task as it has enough resources and does not refrain even when improvement is low. The reason it achieves lower accuracy than onAlgo is that it does not offload intelligently, and gets denied when the computing constraint is violated.
Scenario 2: High accuracy improvement; low resources. We set and not allowing many offloadings and cloudlet classifications. We used the CIFAR dataset which has a large performance difference between local and cloudlet classifiers. We see from Fig. 4b that OnAlgo achieves 28%-32% higher accuracy than both competing algorithms. RCO is constrained to very few offloadings due to the limited power budget, while ATO is resource-oblivious and offloads tasks regardless of the cloudlet’s capacity. This results in many denied offloadings that reduce ATO’s accuracy and unnecessarily increase the power consumption. OnAlgo consumes 60% less power than ATO as it frequently offloads its low-confidence tasks.
Scenario 3: Large number of users. Finally, we simulated the algorithms for a large number of users while using the experimentally measured parameters. We observe in Fig. 5a that the accuracy gradually drops (for all algorithms) since now a smaller percentage of the tasks can be served by the cloudlet. OnAlgo constantly outperforms both ATO and RCO by about since it adapts to the available resources. This is more evident in Fig. 5b that shows the fast-increasing energy cost of the two benchmark algorithms, as they either offload tasks that do not improve the performance, or offload tasks while the cloudlet is already congested (these tasks are dropped and energy is wasted). Power consumption of OnAlgo is up to less than that of RCO.
Iv-B3 Convergence of OnAlgo
Fig. 6 presents the convergence of OnAlgo for different step sizes . Based on the system parameters the bound given by Theorem 1 is approximately 0.01, 0.2 and 1 for the three values of Fig. 6. These are satisfied by the solution of OnAlgo in less than 300 iterations as observed in Fig. 6a. The convergence is faster for larger , which however is achieved at the cost of smaller convergence accuracy. The constraint violation bound is also respected as shown in Fig. 6b with the constraints being violated more often for small in the beginning, but improving as increases.
V Related Work
Edge & Distributed Computing. Most solutions partition compute-intense mobile applications and offload them to the cloud ; a solution that is unfit to enable low-latency applications. Cloudlets on the other hand, achieve lower delay  but have limited serving capacity, hence there is a need for an intelligent offloading strategy that we propose here. Previous works consider simple performance criteria, such as reducing computation loads , or power consumption  and focus on the architecture design. Also, Mobistreams  and Swing 
focus on collaborative data stream computations. The above systems either do not optimize the offloading policy, or use heuristics that do not cater for task accuracy.
Mobile and IoT Analytics. The importance of analytics has motivated the design of wireless systems that can execute such tasks. For instance, [24, 23] tailor deep neural networks for execution in mobile devices, while  and  minimize the execution time for known system parameters and task loads. Finally, [17, 26, 13] leverage the edge architecture to effectively execute analytics for IoT devices. The plethora of such system proposals, underlines the necessity for our online decision framework that provides optimal execution of analytics.
Optimization of Analytics. Prior works in computation offloading focus on different metrics such as number of served requests, [3, 21], and hence are not applicable here. In our previous work , we proposed a static collaborative optimization framework, which does not employ predictions nor accounts for computation constraints. Other works, e.g.  either rely on heuristics or assume static systems and known requests. Clearly, these assumptions are invalid for many practical cases where system parameters not only vary with time, but often do not follow i.i.d. processes. This renders the application of max-weight type of policies  inefficient. Our approach is fundamentally different and leads to an online robust algorithm and is inspired by dual averaging and primal recovery algorithms for static problems, see .
Improvement of ML Models. Clearly, despite the efforts to improve the execution of analytics at small devices, e.g., by residual learning or compression , the trade off between local low-accuracy and cloudlet high-accuracy execution is still important due to the increasing number and complexity of these tasks. This observation has spurred efforts for designing fast multi-tier (cloud to edge) deep neural networks  and for dynamic model selection , among others. These works are orthogonal to our approach and can be directly incorporated in our framework.
We propose the idea of improving the execution of data analytics at IoT devices with more robust instances running at cloudlets. The key feature of our proposal is a dynamic and distributed algorithm that makes the outsourcing decisions based on the expected performance improvement, and the available resources at the devices and cloudlet. The proposed algorithm achieves near-optimal performance in a deterministic fashion, and under minimal assumptions about the system behavior. This makes it ideal for the problem at hand where, the stochastic effects (e.g., expected accuracy gains) have unknown mean values and possibly non-i.i.d. behavior.
This publication has emanated from research supported in part by SFI research grants 17/CDA/4760, 16/IA/4610 and is co-funded under the European Regional Development Fund under Grant Number 13/RC/2077.
-  (2016) TensorFlow: a system for large-scale machine learning. In Proc. of USENIX OSDI, Cited by: §IV-A2.
-  (2017) Compression of deep neural networks for image instance retrieval. In Proc. of DCC, Cited by: §IV-A4, §V, footnote 6.
-  (2016) Efficient multi-user computation offloading for mobile-edge cloud computing. IEEE/ACM Trans. on Networking 24 (5), pp. 2795–2808. Cited by: §V.
-  (2011) CloneCloud: elastic execution between mobile device and cloud. In Proc. of EuroSys, Cited by: §V.
-  (2018) Cisco global cloud index: forecast and methodology, document id:1513879861264127. Cited by: §I.
-  (2010) Misco: a mapreduce framework for mobile systems. In Proc. of PETRA, Cited by: §V.
-  (1976) The distance-weighted k-nearest-neighbor rule. IEEE Trans. on Sys., Man, and Cybern. 6 (4), pp. 325–327. Cited by: §IV-A2.
-  (2018) Swing: swarm computing for mobile sensing. In Proc. of IEEE ICDCS, Cited by: §V.
-  (2018) Optimizing data analytics in energy constrained iot networks. In Proc. of WiOpt, Cited by: §V.
-  (2019) Improving iot analytics through selective edge execution: appendix. Note: https://1drv.ms/b/s!AoI5lEO8XUP1iQIjf1w0YeaUCa83?e=9IW474 Cited by: §I, §III-B, §IV-A1.
-  (2007) ”Data analysis using regression and multilevel/hierarchical models”. Cambridge University Press. Cited by: footnote 2.
-  (2006) Resource allocation and cross-layer control in wireless networks. Found. Trends Netw. 1 (1), pp. 1–144. Cited by: §II, §V.
-  (2018) Multitier fog computing with large-scale iot data analytics for smart cities. IEEE Internet of Things Journal 5 (2), pp. 677–686. Cited by: §V.
-  (2017) Machine learning paradigms for next-generation wireless networks. IEEE Wireless Comm. 24 (2), pp. 98–105. Cited by: §I.
-  (2009) Learning multiple layers of features from tiny images. Technical report Cited by: §IV-A2.
-  (1998) Gradient-based learning applied to document recognition. Proc. of the IEEE 86 (11), pp. 2278–2324. Cited by: §IV-A2.
-  (2019) Data analytics for fog computing by distributed online learning with asynchronous update. In Proc. of IEEE ICC, Cited by: §V.
-  (2017) MobiQoR: pushing the envelope of mobile edge computing via quality-of-result optimization. In Proc. of IEEE ICDCS, Cited by: §V.
-  (2017) Dynamic deep neural networks: optimizing accuracy-efficiency trade-offs by selective execution. arXiv:1701.00299. Cited by: §V.
-  (2018) Selective offloading in mobile edge computing for the green internet of things. IEEE Network 32 (1), pp. 54–60. Cited by: §V.
-  (2017) A survey on mobile edge computing: the communication perspective. IEEE Comm. Surv. Tut. 19 (4), pp. 2322–2358. Cited by: §I, §V.
-  (2009) Approximate primal solutions and rate analysis for dual subgradient methods. SIAM J. on Optimization 19 (4), pp. 1757–1780. Cited by: §I, §III-B, §V.
DeepDecision: a mobile deep learning framework for edge video analytics. In Proc. of IEEE INFOCOM, Cited by: §V, §V.
-  (2017) Delivering deep learning to mobile devices via offloading. In Proc. of VR/AR Network Workshop, Cited by: §V.
-  (2009) The case for vm-based cloudlets in mobile computing. IEEE Pervasive Computing 8 (4), pp. 14–23. Cited by: §I, §V.
-  (2017) Live data analytics with collaborative edge and cloud processing in wireless iot networks. IEEE Access 5 (), pp. 4621–4635. Cited by: §V.
-  (2018) Analytics for the internet of things: a survey. ACM Comput. Surv. 51 (4), pp. 74:1–74:36. Cited by: §I.
-  (2017) Distributed deep neural networks over the cloud, the edge and end devices. In Proc. of IEEE ICDCS, Cited by: §IV-A4, §V.
-  (2014) MobiStreams: a reliable distributed stream processing system for mobile devices. In Proc. of IEEE IPDPS, Cited by: §V.
-  (2019) Hetero-edge: orchestration of real-time vision applications on heterogeneous edge clouds. In Proc. of IEEE INFOCOM, Cited by: §V.