I Introduction
Rapid migration towards smart infrastructure (cities, cars, grids, etc.) has caused an explosion of InternetofThings (IoT) devices on resourceconstrained wireless edge networks. A recent article reported reports that about every second another 127 devices connect to the internet with 41 billion devices expected by 2027 and Cisco expects that 800 zettabytes of data will be generated [4] on wireless edge networks. The distributed nature of this data will place a heavy financial burden on backbone networks and raise security/privacy concerns [2]. Thus, it is anticipated that edge servers and end devices (e.g. smart phones, cameras, drones, connected vehicles, etc.) will perform 90% of the data processing locally [8].
Machine Learning (ML) techniques have shown to perform better in many data analytics applications such as forecasting, image classification, clustering, etc. Many ML techniques, including regression, support vector machine (SVM) and neural networks (NN) are built on gradientbased learning. This usually involves optimizing the model parameters by iteratively adding to them the gradient of the loss, itself a function of the model parameters. In the distributed learning model considered in this paper, a central server called an orchestrator initiates the learning process on multiple learners where each learner performs the ML iterations on the local dataset, collects the local ML models from each learner, does the global update/aggregation, and sends back the optimal ML model for the next cycle until a stopping criteria is reached.
With the advent of Edge Artificial Intelligence (Edge AI), deploying ML models over end devices at edge networks will soon be the norm. Therefore, researchers have turned their focus to performing machine learning (ML) in a distributed manner (a.k.a. distributed learning (DL)) at the edge
[9, 10, 11, 6, 3, 12] in order to support edge analytics. In general, DL at the edge can be characterized by mobile edge learning (MEL) though the most commonly studied setup is federated learning (FL).The works of [10, 11, 3, 12] focus on jointly optimizing the number of local learning and global update cycles for FL. However, their approaches do not consider the inherent heterogeneity in the computing and communication capacities of different edge learners and links, respectively. Although the works of [1, 13] have optimized resource allocation while maintaining accuracy, they do not investigate the impact of batch allocation. The implications of wireless computation/communication heterogeneity on optimizing batch allocation to different learners for maximizing accuracy while satisfying a delay constraint were studied in [6].
To the best of the authors’ knowledge, this work is the first attempt at jointly optimizing batch size allocated to learners and synchronizing the number of local and total iterations of the ML algorithm across all learners while satisfying time delay constraints for the MEL paradigm. Therefore, our proposed work differs from existing literature in the following ways: 1) considers the impact of batch size allocation 2) models the MEL system on actual channel parameters and device capabilities as opposed to generic resource consumption 3) optimizes both, the number of local and total updates as opposed to maximizing the local updates per global update.
The superiority of our proposed heterogeneity aware (HA) approach is shown by comparing its performance to the heterogeneityunaware (HU) approach of [10, 11]
. Tests on classifying the MNIST dataset
[5] using a DNN show that the HA approach is superior in terms of achieving a lower loss and providing higher validation accuracy. The rest of the paper is organized as follows: Section 2 introduces the global MEL model with time constraints. The problem of interest in this paper is formulated in Section 3 and our proposed solution is described in Section 4. Section 5 presents the results and Section 6 concludes the paper.Ii MEL System Model
Iia Gradientbased Learning Preliminaries
Consider a dataset that consists of samples that can be trained using ML where each sample for has a set of features denoted by and a target . The objective is to find the relationship between and using a set of parameters
such that a loss function,
(or for short), is minimized. Because it is generally difficult to find an analytical solution, typically an iterative gradient descent approach is used to optimize the set of model parameters such that where represents the time step or iteration and is the learning rate typically set on the interval . In deterministic gradient descent (DGD), the ML model goes over each sample onebyone, or more commonly, batchbybatch using a minibatch approach, until it reaches sample #; completing one epoch. If data is reshuffled randomly in between epochs, this method is know as stochastic GD (SGD). A total of
epochs may be performed depending on the stopping criteria.IiB Transition to MEL
An MEL system consists of an orchestrator and learners where data samples are allocated to learner , so that it performs learning iterations. Each learner has a computational capacity of in Hz and an associated communication channel to the orchestrator. An example of a DL system is illustrated in fig. 1. We assume that perfectly orthogonal channels exist.
DL as described in section I in an MEL setting gives rise to two possibilities: offloaded learning (OL) and federated learning (FL). In the former, the orchestrator has the complete dataset and also retransmits optimally allocated batches from the randomly shuffled dataset back to the learners. In the latter, the orchestrator only informs the learner of how many iterations to do and on what sample size of a locally stored dataset. This approach has been more commonly studied in literature as opposed to OL. FL is just a subset of the OL approach with the component of batch retransmission from the orchestrator to each learner removed. Thus, the MEL model discussion will focus on the more general offloaded learning scenario but all variations for the federated learning scenario will be clarified whenever needed.
We define as the size of the batch allocated to learner in bits and as the size of the model in bits . The variables and represent precision with which the data and model are stored, respectively, represents the feature vector size of for . represents size of as defined by the ML model whereas is the proportion of the model dependent on the dataset size.
Between any two global update cycles, the orchestrator, which owns the global model, sends the data and model parameters to each learner in parallel^{1}^{1}1In the FL scenario, the orchestrator only sends the global model to each learner . Learner selects samples from its private dataset., waits for all learners to complete local learning iterations, and then receives the locally updated model followed by global aggregation. The communication of the models (and data) between the orchestrator and each learner occurs over a channel having a bandwidth , a channel power gain and with a noise power spectral density of . We assume that over one iteration of the global update and that channel parameters remain constant during global aggregation.
Furthermore, learner has a local processor resource dedicated to the DL task for an ML model of complexity that requires clock cycles to perform one local iteration. Given the above descriptions, the times of each learner , comprise: orchestrator transmission time needed to send and data samples^{2}^{2}2For the FL scenario, the only difference in the model is that the first term of the numerator will not exist. to learner , the duration needed by learner to perform one local update cycle, and time is the one needed for learner to send its updated local parameter matrix to the orchestrator. The times ,, and , respectively, can be expressed as:
(1) 
(2) 
(3) 
Iii Problem Formulation
As mentioned in Section I, the objective of this paper is to optimize the task allocation, i.e. the distributed batch sizes for each learner and the associated updates to be performed locally for a total of updates, such that the global DL loss is minimized and thus, the accuracy is maximized. To this end, the problem is formulated as a lossfunction minimization problem over the optimization variables , and .
Consider that after every local iterations, a global aggregation will be performed and a total of global aggregations are performed. In any global update cycle, between any learner the orchestrator , there will be one communication round each and local updates. For now, to facilitate the analysis, let us assume that is an integer multiple of such that and that the communication and computation related parameters remain unchanged over the complete training process. In that case, each learner needs time for one local update and for the global aggregation for . Overall, local updates and global updates will be performed. Then,the total time consumed by learner denoted by can be expressed as:
(4) 
Later on, we will shoe how the values of and for each set of local ML iterations in one global cycle will be recalculated according to the latest channel parameters and computational capabilities. The total training time within which the process should be completed is given bounded by . Because the iterations occur in parallel over the learners, we need the time for the most timeconsuming learner to be less than such that . Alternatively, it is sufficient for this condition to hold that .
This point differentiates our work from that of [10] where we actually capture the time consumed by parallel local update processes rather than a generic resource consumption model. Therefore, the optimization problem can be written as:
(5)  
s.t. 
The constants ,,and can be defined as:
(6a)  
(6b)  
(6c) 
It is generally impossible to find an exact expression relating the optimization variables to the objective for most ML models. Therefore, the objective will be reformulated as a function of the convergence bounds on the DL process over the edge. For more details on these bounds, the readers are referred to [10]. We will use the results and extend the discussion to our formulation, and then propose a strategy to jointly find the optimal , , and .
Iiia Convergence Bounds
The convergence bounds have been derived and welldiscussed in [10]. For completeness, we will present some of the important results here in order to support our analysis. Let us continue with the assumption that is an integer multiple of . Then, the global aggregation will only occur at every updates. I.e. the local updates occur at every iteration and a global update will occur whenever for . For any interval defined over , define an auxiliary global model denoted by which would have been calculated if a global update occurred as follows:
(7) 
Let the local model parameter set of learner be denoted by and the local loss by . Then, the optimal model at iteration can be obtained by:
(8) 
The optimal will only be visible when and for that iteration, the global loss can be defined by:
(9) 
The following assumptions are made about the loss function at learner : is convex, , and for any ,
. These assumptions will hold for ML models with convex loss function such as linear regression and SVM. By simulations, we will show that the proposed solutions work for nonconvex models such as the neural networks with ReLU activation.
Let us also assume that the local loss function at does not diverge by more than such that and . Furthermore, . For any , . Recall that is the learning rate and
can be estimated by
where:(10) 
Based on this, the objective can be written as a function of the difference of the global loss after iteration and the optimal global loss. Given the above assumptions about the loss function, and the constraints on the optimization variables and the time taken by learner , the optimization problem can be written as:
(11a)  
s.t.  (11b)  
(11c)  
(11d)  
(11e)  
(11f)  
(11g)  
(11h)  
(11i)  
(11j) 
Constraint (11b) guarantees that the time consumed by a total of updates does not exceed the total training time available given by seconds. Constraint (11c) ensures that the total dataset comprising samples is utilized. Constraints (11d)  (11f) are simply nonnegativity and integer constraints for the optimization variables where , and/or all ’s being zero represent cases where DL is not possible in the MEL environment. Constraints (11h) and (11g) represent a bound on the learning rate, meaning it should be small enough such that it guarantees convergence. When (11h) holds, (11g) will always hold. Constraints (11i) and (11j) define a lower bound on the gap between the optimal loss and the auxiliary loss at interval and the global loss, respectively, where . The parameter represents the interval that minimizes the difference between the auxiliary loss and the global loss. The variables , , and appear in a single term in (11g) which represents a control parameter. Later on, this term will be represented by but for now, we will continue with the original terms make the analysis relatable to the original variables.
It is assumed that (typically ) and . Furthermore and can be set to small enough values such that , and the constraints in (11g)(11j) are satisfied. For a smooth function, Bernoulli’s inequality will hold implying that . Furthermore, once all the assumptions about the loss function constraints are satisfied, it can be shown that . Thus, the problem in (11) can be reformulated as:
(12a)  
s.t.  (12b)  
(12c)  
(12d)  
(12e)  
(12f) 
Note that the integer constraints on and have been relaxed in (12f) and (12e) which will help in proposing a solution.
Iv Proposed Solution
The idea of the proposed solution is to rewrite the objective as a function of by using the constraints on the total time consumption and the fact that the system must train the model on at least training samples.
Iva Relating Bounds to
The orchestrator can ensure that the bounds in constraints (11h)(11j) are satisfied by choosing small enough values for and . In that case, if constraint (11g) holds, the denominator of the objective function will be positive. Furthermore, if we relax the integer constraint on , the optimal value for the total learning iterations can be given by:
(13) 
By using the equality constraint in (12c), and rearranging (13) to make the subject, and defining two new variables and , we can write as a function of .
(14) 
The objective function denoted by can be rewritten as a function of in the following manner:
(15) 
Theorem 1
is strictly convex on the domain .
Proof: Please refer to Appendix A for the proof.
Because does not have a closed form solution, the optimal can be obtained by solving the following problem:
(16) 
The value of can be difficult to obtain because is unbounded. However, we can limit the search space by and then use a brute force approach to find the optimal . In fact, a binary search procedure has been proposed in [10] which has a complexity of . Once has been determined, can be obtained using (14) and reset to this new value. The values of for the next updates can be obtained using (13). Because the integer constraint on was relaxed in (12), they can be set by flooring the actual value. This process is repeated for each global cycle until the total training time is consumed. This process is summarized in algorithm 1.
V Simulation Results
Va Simulation Environment, Dataset, and Learning Model
The learners are assumed to be located in a cellular type environment and are assumed to be a combination of smart phone and Raspberry PI type microcontrollers. The channel parameters and device capabilities are listed in table 1. To test our proposed MEL paradigm, the commonly used MNIST [5]
dataset is trained using a DNN with 3 hidden layers consisting of 300, 124 and 60 neurons, respectively. The details of the resulting model sizes and complexities are discussed in
[6].Parameter  Value 

Cell Attenuation Model  dB [7] 
Node Bandwidth  5 MHz 
Device proximity  500m 
Transmission Power  23 dBm 
Noise Power Density  174 dBm/Hz 
Computation Capabilities  GHz 
MNIST Dataset size  54,000 images 
MNIST Dataset Features  784 () pixels 
For the simulation, we consider a set of learners and test for total training times of s. It was found that a value of for the learning rate works very well and setting in the range provided solutions that converge. is set to the case where only 3 global aggregations would be done on .
VA1 Loss and Validation Accuracy
We plot the final loss value after training for time in figure 2 and the final accuracy in figure 3 for both approaches, the proposed HA approach and the HU approach in [10]. As expected, as the training time is increased, the loss value decreases for all approaches. However, for the HA approach, there is just a slight increase in validation accuracy because it is able to achieve that in minimum time. The main conclusion is that optimizing and jointly influences the possible number of global aggregations and the total iterations which helps in converging to a lower loss and a higher final accuracy. For example, the loss of the HA approach is lower by 0.030.05 which represents gains in the range of 27%  40%. Furthermore, training for 300s using the HA approach achieves a 97% accuracy within 300s of training, a value not achieved by the HU approach even in 600s.
Vi Conclusion
This paper extends the efforts towards the MEL paradigm by jointly optimizing the task size allocation for each learner and the number of local ML iterations in a global cycle for distributed ML over the wireless edge. The problem uses existing bounds on the DL paradigm to relate the optimization variables to the loss function which is shown to be convex. It is shown that optimal value of the local updates minimizes the upper bound on the loss difference and the total iterations and batch sizes after every global step. A heuristic approach is proposed to carry out the global learning process. Through simulations, it shown that our HA scheme performed much better in terms of the possible number of updates and learning accuracy compared to the HU scheme.
References
 [1] (201909) A Joint Learning and Communications Framework for Federated Learning over Wireless Networks. arXiv eprints, pp. arXiv:1909.07972. External Links: 1909.07972, Link Cited by: §I.
 [2] (201612) Fog and IoT: An Overview of Research Opportunities. IEEE Internet of Things Journal 3 (6), pp. 854–864. External Links: Document, ISSN 23274662, Link Cited by: §I.
 [3] (2019) Demonstration of Federated Learning in a ResourceConstrained Networked Environment. In 2019 IEEE International Conference on Smart Computing (SMARTCOMP), External Links: Document, ISBN 9781728116891, Link Cited by: §I, §I.
 [4] (2020) Comprehensive Guide to IoT Statistics You Need to Know in 2020. External Links: Link Cited by: §I.
 [5] (1998) Gradientbased Learning Applied to Document Recognition. Proceedings of IEEE 86 (11), pp. 2278 – 2324. External Links: Link Cited by: §I, §VA.
 [6] (201904) Adaptive Task Allocation for Mobile Edge Learning. In 2019 IEEE Wireless Communications and Networking Conference Workshop (WCNCW), pp. 1–6. External Links: Document, ISBN 9781728109220, Link Cited by: §I, §I, §VA.
 [7] (201812) MultiObjective Resource Optimization for Hierarchical Mobile Edge Computing. In 2018 IEEE Global Communications Conference: Mobile and Wireless Networks (Globecom2018 MWN), Abu Dhabi, United Arab Emirates, pp. 1–6. External Links: Document, Link Cited by: TABLE I.
 [8] (2015) Internet of Things Data To Top 1.6 Zettabytes by 2020. External Links: Link Cited by: §I.
 [9] (2017) Distributed Deep Neural Networks over the Cloud, the Edge and End Devices. Proceedings  International Conference on Distributed Computing Systems, pp. 328–339. External Links: Document, 1709.01921, ISBN 9781538617915, ISSN 10636927 Cited by: §I.
 [10] (2018) When Edge Meets Learning : Adaptive Control for ResourceConstrained Distributed Machine Learning. In INFOCOM, External Links: arXiv:1804.05271v1, Link Cited by: §I, §I, §I, §IIIA, §III, §III, §IVA, §VA1.
 [11] (2019) Adaptive Federated Learning in Resource Constrained Edge Computing Systems. IEEE Journal on Selected Areas in Communications (Early Access), pp. 1–1. External Links: Document, ISSN 07338716, Link Cited by: §I, §I, §I.

[12]
(201902)
Enabling Flexible Resource Allocation in Mobile Deep Learning Systems
. IEEE Transactions on Parallel and Distributed Systems 30 (2), pp. 346–360. External Links: Document, ISSN 10459219, Link Cited by: §I, §I.  [13] (201911) Energy Efficient Federated Learning Over Wireless Communication Networks. arXiv eprints, pp. arXiv:1911.02417. External Links: 1911.02417, Link Cited by: §I.
Appendix
A. Proof of Theorem 1
The objective function is strictly convex under certain conditions because . Because is a positive integer, the optimal value is the argument that minimizes . The reciprocal of in (15) can be separated into two terms and as follows:
(17a)  
(17b) 
Therefore, the objective function can be rewritten as where and . Moreover, the term can be written as the reciprocal of where
(18) 
The constants , and . We can say that where is a control parameter that can be set empirically.
For brevity, we will represent as where may be , , , , or . We will also represent and as and , respectively. Using this new notation, can be given as follows:
(19) 
By definition and because they are related to the time consumed by user , and assuming that the constraint in (11g) holds. Hence, if we can show that each of the three terms in (19) are strictly positive, then will be strictly convex. We need to show that and . Furthermore, if we can show that and follow the same sign, .
The first derivatives of , , and can be given by:
(20a)  
(20b)  
(20c) 
The variables and , respectively, and both are positive quantities. For , the first term outside the square bracket is always negative whereas the term inside is a sum of positive quantities whereas is a sum of negative quantities. Therefore, and . Hence, we need to show that or find the domain on which .
The complete expressions of and can be given by:
(21a)  
(21b) 
It can be shown that and are strictly greater than zero because they are both a sums of positive terms. Hence, the leftmost term in (19) is strictly positive. Thus we need to show that or find the domain for which this is true.
Although the complete expression for is omitted for brevity, it can be shown that the necessary and sufficient conditions to achieve and is to satisfy . From the Bernoulli inequality, we know that . Assuming the worst case where the equality holds, the expression can be written as . By expanding the expression, we can show that we need to check the following inequality:
(22) 
We know that as long as a feasible is found, we need to satisfy the second term enclosed by the square brackets. Writing as a function of and, we can see that the condition on is the following:
(23) 
Recall that where is chosen such that , and . Hence, it follows that . If we plot, against the domain of , we notice that for and hence, for to be strictly negative and to be strictly positive, it is sufficient for . Hence, we have now proved that is strictly convex because as long as is a positive integer and the ML model variables are selected as defined.