I Introduction
The world is rapidly moving towards smart cities, smart grids and the internet of everything (IoE). Therefore, the number of host devices served by edge networks has exploded and there has been an exponential increase in the amount of data that needs to be processed. One example of such processing is machine learning (ML), which is used in all types of applications such as recognition and object segmentation in images. It is prohibitive to transmit these datasets to the clouds for centralized processing and also a burden on backbone networks
[1]. The expectation is that the timecritical nature of such data would force us to do 90% of analytics on the edge servers and the nodes themselves (mobile phones, traffic cameras, UAV’s and autonomous vehicles) [2].This paradigm of edge processing has been supported by the latest works in literature about Mobile Edge Computing (MEC) and HierarchicalMEC (HMEC) [3, 4, 5, 6]. By taking into consideration the heterogenous computing and communication resources at the edge nodes and links, offloading decisions are optimized to minimize certain metrics such as delays, energy consumption or latency. While most MEC research focuses on generic computational tasks, a lot of attention has recently been given to offloading machine learning tasks in a distributed manner [7, 8, 9, 6].
DL is attracting a lot of attention in the ML community because of two practical scenarios: taskparallelization and distributed datasets. In taskparallelization, a central node (which may or may not be the edge server), also known as the orchestrator, distributes the learning tasks among other nodes to be performed locally on a subset of the complete data. The process cycles between randomly allocating batches for local learning, and collecting back the updated ML model and redistributing the data/model after aggregation. This is done due to multiple reasons (such as faster processing or lower energy consumption). In contrast, in the distributed dataset case, the local devices collect their own data and perform learning on parts of it as allocated by the orchestrator. The reason for keeping data local may be privacy or limited communication resources in the edge/host network.
Performing ML at the edge or in IoE environments is the manifestation of both scenarios described above. A lot of attention has been given to distributed learning as it relates to multicore processing machines and graphical processing units. These are capable processors which are connected using wired protocols and where energy consumption is not an issue. One example of such networks is the downpour Stochastic Gradient Descent (SGD)
[7]. Typically, in such schemes, the orchestrator waits for all learners to complete their tasks for each epoch/iteration of the ML algorithm. The idea behind this approach is to maximize accuracy by minimizing the discrepancy or ’staleness’ among the gradients by having all learners do the same number of epochs per cycle. Recently, some work has been carried out on allowing some staleness so that powerful devices with good communication links may perform more updates
[10]. The combined effect may be an increase in learning accuracy.These techniques were not applied to MEL until a relatively recently [11]. The works of [8, 11, 12] aimed to optimize the local number of epochs per node with respect to total global iterations in generic resourceconstrained edge environments. However, these works did not investigate optimizing the local updates while taking into account the heterogeneous nature of communication and computation in MEC’s. Recently, the work of [9] investigates the impact of optimizing the allocation of learning tasks as well as the number of local updates on the learning accuracy in MEC/HMEC environments. The results show significant gains in achieving a certain level of accuracy with respect to the global cycle time. However, there may still be room for improvement as certain devices may be idle for long times and can do a higher number of updates which may raise the overall accuracy.
To the best of the authors’ knowledge, this work is the first attempt to have a staleness aware algorithm for asynchronous MEL. Here, we must clarify that by asynchronous, we mean in terms of the number of updates each device is allowed to perform rather than the execution time for one set of local updates. This paper will consider optimizing the task allocation and number of local updates per learner in order to minimize the staleness among gradients so that a high accuracy can be achieved. The formulated optimization problem is shown to be an integer quadraticallyconstrained linear program (IQCLP) which is relaxed to a nonconvex QCLP. Analytical approximate solutions are derived based on the KKT conditions and Lagrangian analysis followed by a suggestandimprove (SAI) approach which is compared against solutions from numerical solvers available. The merits of the proposed solution are compared against the ETAbased asynchronous approach in
[10] and the synchronous approach in [9].Ii System Model for Asynchronous MEL
The general model for distributed ML and the MEL model for the synchronous case have both been discussed and welldefined in [9, 6]. For completeness’s sake, we will review some of the important concepts and parameters here. Distributed learning involves running a single machine learning (ML) task over a system of
learners. Typically, ML methods are based on gradient descent (GD) and mostly, in newer ML models such as deep neural networks, employ the the Stochastic GD (SGD) due to its superior accuracy.
Consider a set of learners in which learner , trains its local learning model, or learns from a batch of size data samples by performing learning epochs/updates/iterations. The total size of all batches is denoted by . Fig. 1
illustrates the described MEL system. The objective is to minimize the local loss function in order to minimize global loss such that accuracy is maximized
[8] .In an asynchronous environment, each learner will perform epochs and forward its updated set of parameters to the orchestrator. The orchestrator will aggregate the model parameters to form a globally optimized set and send back the updated model to each learner in the next cycle. Based on the channel conditions and the compute capability of each individual device, it will also offload samples (task parallelization) or assign a value for the subset size (distributed dataset) to each node . In both scenarios, it will also assign the number of updates to perform at each node. The learners will apply the ML algorithm to their assigned dataset and the process continues.
The time taken for offloading the optimal model and the partial dataset to each node, then for each learner to perform the ML task and send back the locally updated model, and for the orchestrator to perform global aggregation is defined as . In synchronous MEL, this time is bounded by and usually excludes the global aggregation process because it requires a lot less time compared to transmission and ML execution; is known as the global cycle clock. The orchestartor continues to repeat these global cycles for a certain number of global updates (for example, until a certain level of accuracy has been attained). In the current model, we will keep this time as but make the model asynchronous in terms of the number of updates each learner is allowed to perform.
We assume channel reciprocity where the set of optimal weights is transmitted back to learner on the same channel with power and gain as the one on which the orchestrator receives the locally computed parameters . This is because the time taken for aggregation and optimization is less than the time for learning and transmission. The orchestrator will generate the global parameter matrix as described in [8] and send the optimized number of updates each learner should perform along with the number of data samples to process and (in task parallelization) the randomly picked subset of size .
Given the above description, the times of each learner , , whose sum must be bounded by the global update clock , can be detailed as follows. The time taken to transmit the global parameter set and the allocated batch to learner is denoted by and can be expressed as^{1}^{1}1Note that the the first term of the numerator will not exist for the distributeddatasets scenario.:
(1) 
The first term in the numerator of (1) represents the size of the transmitted data where represents the number of features per data sample (e.g. image pixels) and represents the datatype (e.g. 32bit float or 64bit double). The second term represents the term taken for the orchestrator to send the optimal model where represents the precision with which model parameters are stored, and each represent the size of the models as they relate to the size of the dispatched dataset and the ML model, respectively. The denominator represents the achievable rate with respect to the channel parameters where is the available bandwidth and is the noise power spectral density.
The time needed by learner to execute one update of the ML algorithm is given by (2) where is the complexity of the learning technique in terms of clock cycles required and is the processing power of each learner in clocks per second. ML algorithms typically go over all features sequentially for each data sample at a time (or epoch), so, the time for one update for one sample is multiplied by and . (In case of batch learning at the local node, the complexity expression changes but remains the same).
(2) 
The final time is the one needed for learner to send its updated local parameter matrix back to the orchestrator. Using our assumption of channel reciprocity, can be computed as:
(3) 
Thus, the total time taken by learner to complete the above three processes is equal to:
(4) 
The total time can be rewritten as a quadratic expression of the optimization variables and as shown in (5)^{2}^{2}2Note that, for the distributeddatasets scenario, the only difference in the model is that the first term of the numerator in ) will not exist.. The quadratic, linear and constant coefficients are given by , and , respectively, where, , , and .
(5) 
Iii Problem Formulation
In the synchronous case, the number of updates for all learners and the dataset size for each are optimized so that the global cycle clock does not exceed for any device. However, this will mean that some devices are not working the full duration and waiting for others to complete their task. In the asynchronous setting, we allow devices to work for the full duration and each learner performs updates. Hence, the system is asynchronous due to the fact that each learner performs different number of updates within the global cycel clock without waiting for slower learners to catchup. The reason for this tweak is that the objective is to minimize the maximum staleness of the gradients between any two learners as shown below:
(6) 
So, the staleness between any two learners is the difference between the number of epochs each has performed. It has been shown in the literature that the loss function of SGDbased ML is minimized (and thus the learning accuracy is maximized) by minimizing the staleness between the gradients in Asynchronous SGD [10]. For synchronous MEL, accuracy is maximized by maximizing in each global cycle [9]. In the stalenessaware model presented in [10], the aggregator waits until at least one learner completes a preset maximum number of updates.
In our case, this will translate into a maxconstrained optimization and our problem is already NPhard. To ensure asynchronous operation, we bound each learner’s dataset size such that , . This ensures that a scenario where a highperforming node with a good channel to the orchestrator does not receive a very small dataset just to minimize staleness but compromise on accuracy. This choice is also justified by the fact that having a very small dataset can lead to underfitting which degrades accuracy. In the future work, we will look into finding an efficient solution for the maxconstrained problem.
Clearly, the relationship between and the optimization variables and is quadratic. Furthermore, the optimization variables and are all nonnegative integers. Consequently, the problem can be formulated as an ILP with quadratic and linear constraints as follows: ^{3}^{3}3Note that, for the distributeddatasets scenario, the only difference in the formulation is the simpler expression of . Thus, the problem type and solution remain the same with different expressions for the two scenarios.
(7a)  
s.t.  (7b)  
(7c)  
(7d)  
(7e)  
(7f) 
Constraint (7b) guarantees that , which means that all devices work for the full allotted time though they may perform different number of epochs. Constraint (7c) ensures that the sum of batch sizes assigned to all learners is equal to the total dataset size that the orchestrator needs to analyze. Constraints (7d) and (7e) are simply nonnegativity and integer constraints for the optimization variables. Please note that the solutions of (7) having any and/or being zero represent conditions where MEL is not feasible for learner . Constraint (7f) bounds the number of data points dispersed to each learner in order to ensure that each node performs learning on some part of a dataset and no single node is burdened with too many data samples. Therefore, the problem is an ILPQC, which is wellknown to be NPhard [13]. We will thus propose a simpler solution to it through the relaxation of the integer constraints in the next section.
Iv Proposed Solution
Iva Problem Transformation and Relaxation
Firstly, the problem is transformed using minmax transformation and the introduction of a slack variable . An additional constraint is added to to ensure the staleness is less than the slack variable which will ensure that the maximum staleness is minimized. As described in the previous section, the problem of interest is NPhard due to its integer decision variables. We simplify the problem by relaxing the integer constraints in (7d) and (7e), solving the relaxed problem, then flooring the obtained real results back into integers. Therefore, the relaxed problem can be written as follows:
(8a)  
s.t.  (8b)  
(8c)  
(8d)  
(8e)  
(8f) 
Please note that constraint (7e) has been eliminated due to the lower bound on . The above resulting program becomes a linear program with quadratic constraints. This problem can be solved by using interiorpoint or ADMM methods, and there are efficient solvers (such as OPTI, fmincon, IPOPT) that implement these approaches.
From the analytical viewpoint, the associated matrices to each of the quadratic constraints in (8c
) can be written in a symmetric form. However, these matrices will have two nonzero values that are positive and equal. The eigenvalues will thus sum to zero, which means these matrices are not positive semidefinite, and hence the relaxed problem is not convex. Consequently, we cannot derive the optimal solution of this problem analytically. Yet, we can still derive upper bounds on the optimal variables and solution using KKT conditions. The philosophy of our proposed solution is thus to calculate these upper bounds values on the optimal variables, then implement suggestandimprove (SAI) steps until a feasible integer solution is reached. Therefore, the next two subsections will show how to derive the upper bounds on the optimal variables using KKT conditions on the relaxed nonconvex problem.
IvB Upper Bounds using Lagrangian Analysis and the KKT conditions
Let and . The Lagrangian of the relaxed problem is given by:
(9) 
where the ’s , , and / , are the Lagrangian multipliers associated with the time constraints of the learners in (8c), the total batch size constraint in (8d), the nonnegative constraints of the number of epochs at each node in (8e) and the lower and upper bounds in (8f), respectively. The multipliers and are associated with the staleness between each two learners being less than the slack variable which we minimize over and . Note that the absolute value constraint in (8b) can be decoupled as and , .
The matrix where N is the number of possibilities of mutual staleness for K set of users, i.e. . For example, for a set of 4 users, and the matrix of possibilities will be:
(10) 
Using the KKT conditions , the following theorem gives a way to find the optimal values of and using the Lagrange multipliers.
Theorem 1
The optimal number of updates each user node can perform can be given by:
(11) 
Moreover, the optimal value of can be given by the following equation:
(12) 
Each element of the vectors
and is a function of the Lagrange multipliers and . Please refer to the proof below.Proof: The proof of this theorem can be found in Appendix A. The details about how to obtain and can be found in Appendix B.
As suspected, due to the relaxed problem being nonconvex with quadratic constraints, in some situations, the approach described above resulted in infeasible solutions. In that case, we performed constraint checks and then used the initial solution to carry out suggestandimprove (SAI) steps to reach a feasible solution. The set of feasible solutions was used as a starting point to the less complex improve method in order to reach the optimal solution.
V Results
This section presents the results of the proposed scheme by testing in MEL scenarios emulating realistic edge node environments and learning. We show the merits of the proposed solution compared to performing asynchronous learning with the equal task allocation (ETA) in terms of staleness and learning. For the staleness, one of the the metrics will be maximum staleness as described in (6). In addition, we would like to introduce average staleness as shown in (13) which will give a measure of the mutual staleness between every two learners for all learners. The metric for evaluating the learning performance is validation accuracy.
(13) 
Va Simulation Environment, Dataset, and Learning Model
The simulation environment considered is an indoor environment which emulates 802.11type links between the edge nodes that are located within a radius of 50m. We assume that that approximately half of the nodes have the processing capabilities of typical computing devices such as desktops/laptops and the other half consists of industrial microcontroller type nodes such as a Raspberry Pi. The employed channel model is summarized in Table 1 of [9].
As a benchmark, the MNIST dataset [14] is used to evaluate the proposed scheme. The training data comprises 60,000 28x28 pixel images contributing 784 features each. The ML algorithm tested is the a simple deep neural network with the following configuration . The input layer has 784 nodes for each feature and the output represents the number of classes (10 for each singledigit). The parameter set for this model consists of . The four submatrices that represent the weights . The biases denoted by , , and are vectors of length 300, 124, 60 and 10, respectively. Thus, the size of the model is 8,974,080 bits, which is fixed for all edge nodes. The forward and backward passes will require 1,123,736 floating point operations [15].
VB Staleness Analysis
Fig 2 shows the maximum and average staleness versus the number of nodes for global cycle times of 7.5s and 15s for the asynchronous schemes with optimized task allocation for both schemes (optimizerbased/numerical and SAI) and the ETA scheme as well. In general, the SAI based approach gives similar staleness to the numerical solution from the optimizer. The general trend is that as the number of updates increase, the staleness tends to increase. However, for , the maximum staleness does not exceed around 1 and the average staleness is between 0.40.6 as K increases for the asynchronous scheme with optimized batch allocation.
For example, for our scheme with 20 users at s, the maximum staleness is 1 compared to 4 for ETA which is 400% higher and the average staleness is 1.5 compared to 0.5 for our scheme which is 300% higher. One curious aspect to note is that for certain specific number of learners or , the asynchronous scheme is able to find an optimal solution where the staleness is zero. One such example is 14 for s and for s.
VC Learning Accuracy
Fig. 3 shows the learning accuracy for a system with a limit on the global cycle time of s consisting of 10, 15 and 20 learners, respectively. For example, in the case with 10 learners, the proposed scheme achieves an accuracy of 95% within 4 updates or 1 minute of learning as compared to the synchronous scheme which requires 8 updates, in other words, we obtain a gain of 50%. In contrast, the asynchronous scheme with equal task allocation fails to converge or even achieve a 95% accuracy. An accuracy of 95% is achieved by our scheme within 3 updates with 15 users whereas the other schemes require 4 updates; which gives us a gain of 25%. Moreover, our scheme achieves an accuracy of 97% within 8 updates whereas the other 2 methods require 10 global cycles leading to a gain of 25%.
A similar gain is achieved for a system with 20 learners for the 95% accuracy mark. For the case of 97% accuracy, our scheme requires 7 updates whereas the ETA needs 11 cycles, representing a gain of of about 64%. On the other hand, the synchronous scheme requires 8 updates which translates to a gain of only 12.5%. The gain appears marginal compared to the synchronous scheme because as the number of users increase, each learner has to process less data which means a larger number of synchronized updates can be done even in heterogeneous conditions. In contrast, the gain is significant compared to the ETA scheme because the staleness for ETA increases significantly versus for a fixed global cycle .
Vi Conclusion
This paper extends the work done on synchronous MEL to cover the optimized task allocation for asynchronous MEL. The focus was reducing the staleness among the gradients of the MEL system by minimizing the maximum difference between the number of updates done by each learner while respecting the delay requirements in resourceconstrained edge environments. The resulting optimization problem was an NPhard IQCLP which was relaxed to a nonconvex problem and solved using readily available solvers as well as theoretically ussing Lagrangian analaysis followed by the SAI approach. Through extensive simulations on the wellknown MNIST dataset, the proposed scheme was shown to perform better than the asynchronous ETA and the synchronous schemes in terms of learning accuracy and it was show that the analytical approximation closely matched the solution of the numerical solvers.
References
 [1] M. Chiang and T. Zhang, “Fog and IoT: An Overview of Research Opportunities,” IEEE Internet of Things Journal, vol. 3, no. 6, pp. 854–864, dec 2016. [Online]. Available: http://ieeexplore.ieee.org/document/7498684/
 [2] Rhea Kelly, “Internet of Things Data To Top 1.6 Zettabytes by 2020 – Campus Technology,” 2015. [Online]. Available: https://campustechnology.com/articles/2015/04/15/internetofthingsdatatotop16zettabytesby2020.aspx
 [3] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A Survey on Mobile Edge Computing: The Communication Perspective,” IEEE Communications Surveys & Tutorials, vol. 19, no. 4, pp. 2322–2358, 2017. [Online]. Available: http://arxiv.org/abs/1701.01090
 [4] C. You and K. Huang, “Mobile Cooperative Computing: EnergyEfficient PeertoPeer Computation Offloading,” pp. 1–33, 2017. [Online]. Available: http://arxiv.org/abs/1704.04595
 [5] M. Liu and Y. Liu, “PriceBased Distributed Offloading for MobileEdge Computing with Computation Capacity Constraints,” IEEE Wireless Communications Letters, pp. 1–4, 2017.
 [6] U. Mohammad and S. Sorour, “Adaptive Task Allocation for Mobile Edge Learning,” nov 2018. [Online]. Available: http://arxiv.org/abs/1811.03748
 [7] J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. A. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng, “Large_Deep_Networks_Nips2012,” pp. 1–11, 2012. [Online]. Available: papers3://publication/uuid/3BA537A9BC9A460495A4E6F64F6691E7
 [8] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “When Edge Meets Learning : Adaptive Control for ResourceConstrained Distributed Machine Learning,” in INFOCOM, 2018. [Online]. Available: https://researcher.watson.ibm.com/researcher/files/uswangshiq/SW{_}INFOCOM2018.pdf
 [9] U. Y. Mohammad and S. Sorour, “Adaptive Task Allocation for Mobile Edge Learning,” in 2019 2nd Workshop on Intelligent Computing and Caching at the Network Edge (wedge2019), Marrakech, Morocco, apr 2019.

[10]
Z. Wei, S. Gupta, X. Lian, and J. Liu, “StalenessAware AsyncSGD for distributed deep learning,”
IJCAI International Joint Conference on Artificial Intelligence
, vol. 2016Janua, pp. 2350–2356, 2016.  [11] T. Tuor, S. Wang, T. Salonidis, B. J. Ko, and K. K. Leung, “Demo abstract: Distributed machine learning at resourcelimited edge nodes,” INFOCOM 2018  IEEE Conference on Computer Communications Workshops, pp. 1–2, 2018.
 [12] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive Federated Learning in Resource Constrained Edge Computing Systems,” IEEE Journal on Selected Areas in Communications, no. Early Access, pp. 1–1, 2019. [Online]. Available: https://ieeexplore.ieee.org/document/8664630/
 [13] A. D. Pia, S. S. Dey, and M. Molinaro, “Mixedinteger Quadratic Programming is in NP,” pp. 1–10, 2014.
 [14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased Learning Applied to Document Recognition,” Proceedings of IEEE, vol. 86, no. 11, 1998.

[15]
Brendan Shillingford, “What is the time complexity of backpropagation algorithm for training artificial neural networks?  Quora,” 2016. [Online]. Available:
https://www.quora.com/Whatisthetimecomplexityofbackpropagationalgorithmfortrainingartificialneuralnetworks
Appendix A Proof of Theorem 1
From the KKT optimality conditions, we have the following condition on the Lagrangian in (9):
(14) 
The following sets of equations can be obtained after applying the derivatives for and in terms of the Lagrange multipliers, respectively, as shown in (15) and (16).
(15) 
(16) 
Solving for and will give the results shown in (11) and (12). The procedure to obtain and is given in Appendix B.
Appendix B Obtaining and
The maximum staleness constraint in (8b) can be rewritten as two separate inequalities as shown below:
(17) 
(18) 
The element of the vector denoted as is associated with the lagrange multipliers of the maximum staleness constraint inequality in (17) whereas is associated with the inequality in (18), and the way to calculate them is shown in (19) and (20), respectively.
(19) 
(20) 
As defined earlier, and .
In this case, after some manipulations, can be defined as the following:
(21) 
The start index and end indices of the first summation in (21) are defined in (22) and (23), respectively.
(22) 
(23) 
On the other hand, can be simply be defined as the following:
(24) 