I Introduction
Ia Motivation and Background
The accelerated migration towards the era of smart cities mandates the deployment of a large number of InternetofThings (IoT) devices, generating exponentially increasing amounts of data at the edge of the network. This data is currently sent to cloud severs for analytics and decisionmaking that improve the performance of a wide range of systems and services. However, it is expected that the rate and nature of this generated data will prohibit such centralized processing and analytics option. Indeed, the size of data will surpass the capabilities of current and even future wireless networks and internet backbone to transfer them to cloud datacenters [1]. In addition, the nature of this data and the timecriticality of their processing/analytics will enforce 90% of their processing to be done locally at edge servers and/or at the edge (mostly mobile) nodes themselves (e.g., smart phones, laptops, monitoring cams, drones, connected and autonomous vehicles, etc.) [2].
The above options of edge processing are supported by the late advancements in the area of mobile edge computing (MEC) [3] in general, especially collaborative MEC [4, 5, 6] and hierarchical MEC (HMEC) [7, 8]. While the former enables edge nodes to decide on whether to perform their computing task locally or offload them to the edge servers, the latter involves offloading tasks among edge nodes themselves, and possibly the edge servers. Such task offloading decisions are usually done while respecting the heterogeneous spare computing resources and communication capacities of the edge network’s nodes and links, respectively, so as to minimize task completion delays and/or energy consumption.
While almost all works on MEC and HMEC focused on managing simple independent computing/processing tasks (i.e., tasks initiated separately by each of the edge nodes), HMEC can be easily extended to enable collaboration among edge nodes, and possibly with remote edge or even cloud servers, in performing distributed processing of the same tasks initiated by one or multiple orchestrating nodes. In this setting, the orchestrating node(s) will distributed computing tasks to the processing edge nodes for local processing and then collect their results to conclude the process. The decisions on the task size allocated to each node can again be done, given the heterogeneous communication capacities and spare computing resources in the edge network, so as to achieve delay guarantees and energy savings.
In addition, most of the previous works on MEC and HMEC were limited to the level of basic processing technology which clearly contradicts with the increasing trends of using machine learning (ML) tools for data analytics. ML include a wide range of techniques, starting from simple regression to deep neural networks. These technique give highly accurate results but are computationally expensive and require large amounts of data. Sometimes, the models may even be undergoing retraining online to search for the optimum parameters. Hence, a lot of research has been done on how to parallelize these algorithms, by distributing the data over multiple nodes. However, most of this works have considered this procedure at the cloud level and over wired distributed computing environments .
Extending the above distributed learning paradigm to resourceconstrained edge servers and nodes was not explored until very recently. [9, 10, 11]. The work of [9]
explore the possibility of connected mobile android devices in a distributed manner to train collected images in order to improve accuracy. The objective is to be able to train adaptive deep learning models based on the collected images. The authors of
[10] propose an exit strategy model for a distributed deep learning network. Based on an accuracy measure, the learning is either done collaboratively between users, at the edge or on the cloud. A new deep learning algorithm is designed to recognize food for dietary needs [11] using a mobile device and a server by splitting the processing between the device and edge.As one can observe, the objective of these approaches is to train networks in a distributed manner in order to improve training accuracy. The work of [10] does aim to restrict transmissions to the edge or cloud by defining an acceptable level of accuracy. However, these works do not tackle distributing the machine learning task in wireless MEC’s by optimizing resource utilization and maintaining an acceptable level of accuracy simultaneously.
More recently, Tuor et al. [12] aimed to unify the number of local learning iterations in resourceconstrained edge environments in order to maximize accuracy. The proposed approach jointly optimized the number of local learning and global update cycles at the learning edge servers/nodes (or learners for short) and orchestrator, respectively. To deal with the problem of “deviating gradients”, a newer approach is suggested in [13] to improve the frequency of global updates. However, all the above works investigating distributed ML on resource constrained wireless devices assumed equal distribution of task sizes to all edge nodes/servers, thus ignoring the typical heterogeneity in computing and communication capacities of different nodes and links, respectively. The implications of such computing and communication heterogeneity on optimizing the task allocation to different learners, selecting learning models, improving learning accuracy, minimizing local and global cycle times, and/or minimizing energy consumption, are clearly gamechanging, but yet never investigated.
IB Contribution
To the best of the authors’ knowledge, this work is the first attempt to develop realistic distributed learning algorithms, with performance guarantees, on cloudlet(s) of heterogeneous wireless edge nodes. We will refer to this new paradigm as “Mobile Edge Learning (MEL)”. To achieve the above MEL endgoal, new research tracks needs to be developed to explore the interplay and joint optimization of learning model selection, learning accuracy, task allocation, resource provisioning, node selection/arrangements, local/global cycle times, and energy consumption. This paper will inaugurate this MEL research, by considering the problem of dynamic task allocation for distributed learning over heterogeneous wireless edge learners (i.e., edge nodes with heterogeneous computing capabilities and heterogeneous wireless links to the orchestrator). This task allocation will be conducted so as to maximize the learning accuracy, while guaranteeing that the total times of data distribution/aggregation over heterogeneous channels, and local computing iterations at the heterogeneous nodes, are bounded by a preset duration by the orchestrator. The maximization of the learning accuracy is achieved by maximizing the number of local learning iterations per global update cycle [12].
To this end, the problem is first formulated as quadraticallyconstrained integer linear problem. Being an NPhard problem, the paper relaxes it to a nonconvex problem over real variables. Analytical upper bounds on the optimal solution of this relaxed problem are derived using Lagrangian analysis and KKT conditions. The proposed algorithm will thus start from these computed bounds, and then runs suggestandimprove steps to reach a feasible integer solution. For large number of learners, we also proposed a heuristic solution based on implementing suggestandimprove steps starting from equal batch allocation. The merits of these proposed algorithms will finally be exhibited through extensive simulations, comparing their performances to both numerical solutions and the equal task allocation approach of
[12, 13].Ii System Model for MEL
Iia Distributed Learning Background
Distributed learning is defined by the operation of running one machine learning task on one global dataset over a system of learners. Learner , trains its local learning model on a subset of the global dataset, where . We will refer to each of these dataset subsets as a batch. The number of samples in each batch is denoted by , and the size of the global dataset is denoted by .
In machine learning (ML), the loss function is typically defined by
, where represents one sample of batch ,is the vector of features or observations,
is the associate label or class to which sample belongs, and is the parameter matrix of the employed learning approach. For most ML algorithms, the parameter matrix consists a weight vectors. For instance, it consists of the weights and biases in neural networks. Since the observation/feature vectors and labels are not variable, the loss function can be concisely denoted by .Given the above notation, the local loss function at each learner can be given by:
(1) 
The global loss function at the orchestrator is thus computed by aggregating all learners’ loss functions as follows:
(2) 
The objective of the orchestrator in any distributed learning setting is to minimize the global loss function over the parameter matrix, which can be expressed as:
(3) 
The optimal solution for this type of problems is generally difficult to obtain analytically. Consequently, these problems are solved using gradient descent (GD) or stochastic gradient descent (SGD) methods, based on the employed batch distribution approach among learners. By applying the GD or SGD method at the
th learner, the local parameter matrix is thus updated at the th local iteration as follows:(4) 
Clearly, the local parameter matrices differ from one another during local update iterations, because they are processing different batches. After local iterations, the learners send these local parameter matrices to the orchestrator, which can then recompute an updated unique parameter matrix at the global update cycle with some kind of averaging such as:
(5) 
IiB Transition to MEL System Model
The modelling and formulation presented above [12] does not lend itself well to wireless nor heterogeneous edge learner environments, where learners have different computing capabilities and various channel qualities to the orchestrator. To migrate the general distributed learning model to the MEL paradigm, we will redefine the distributed learning model in an MEC/HMEC context.
Consider an edge orchestrator (e.g., edge server or even one of the edge nodes) wants to perform a distributed learning task on a specific dataset over heterogeneous wireless edge learners. In each global cycle, it thus sends to each learner a batch of random samples^{1}^{1}1This model assumes randomized batch allocations to learners in each global cycle, and SGD update approach to compute the gradients. This choice is justified by the generality of this approach (i.e., The deterministic GD approach does not need to send data in very cycle) and its proven superior learning accuracy performance [14]. from this dataset and the initial parameter matrix .
The batch , allocated to learner , is assumed to have a size of bits, which can be computed as follows:
(6) 
where is the number of features in the dataset, and is the data bit precision. For example, the MNIST dataset has 60000 images of size 28x28 stored as unsigned integers, and thus Mbits. On the other hand, the size of the parameter matrix is assumed to be of size bits, which can be expressed for learner as:
(7) 
where is the model bit precision (typically floating point precision). As shown in the above equation, the parameter matrix size consists of two parts, one depending on the batch size (represented by the term , where is the number of model coefficients related to each sample of the batch), and the other related to the constant size of the employed ML model (denoted by ).
As mentioned above, the orchestrator sends the concatenation of the aforementioned data and model bits with power over a wireless channel with bandwidth and complex channel power gain . Once this information is received by each learner , it sets is local parameter matrix . It then performs local update cycles on this local parameter matrix, using its allocated batch . For each iteration in typical ML, the algorithm sequentially goes over all features once for each data sample. Consequently, the number of computations required per iteration is equal to:
(8) 
which clearly depends on the number of data samples assigned to each node and the computational complexity of the model. Although the data maybe stored as other types, the operations themselves are typically floating point operations.
The updated at each learner is then transmitted back to the orchestrator with the same power on the same channel. The server will then update the global parameter matrix as described in (5). Once done, it sends back this updated vector with a batch of new random sample from the dataset to each learner, and the process repeats.
To this end, define the time as the duration initiated when the orchestrator starts allocating and sending batches to learner in each global cycle, until the orchestrator starts this cycle’s global update operation. We will refer to the time as the global cycle clock (to differentiate it from the total global cycle duration consisting of the global cycle clock and the global update processing time). Clearly, this duration should encompass the time needed by the three processes of the distributed processing in each global cycle:

The time needed to send the allocated batch and global parameter matrix to learner , . Defining the transmission rate between the orchestrator and learner by , the time can be expressed for learner as:
(9) where is the noise power spectral density.

times the time needed to perform one local update iteration at learner , . Defining as learner ’s local processor frequency dedicated to update the parameter matrix , the time can be expressed for learner as:
(10) 
The time needed to receive the updated local parameter matrix from learner , . Assuming the channel between learner and the orchestrator is reciprocal and does not change during the duration of one global update cycle, the time can be computed for learner as:
(11)
Thus, the total time taken by learner to complete the above three processes is equal to:
(12) 
We will refer to the time of learner as its roundtrip distributed processing duration. Clealy, must be smaller than or equal to the global cycle clock , for the orchestrator to have all the needed information to perform its global cycle update processing in time.
Iii Problem Formulation
As mentioned in Section I, the objective of this first paper on MEL is to optimize the task allocation (i.e., distributed batch size for each learner ) so as to maximize the accuracy of the distributed learning process in each global cycle (and thus eventually the accuracy of the entire learning process), within a preset global cycle clock by the orchestrator. It is well established in the literature that the loss function in general ML, and local and global loss functions in distributed learning (expressed in (1) and (2)), using GD or SGD, are minimized by increasing the number of learning iterations [15]. For distributed learning, this is equivalent to maximizing the number of local iterations in each global cycle [16]. Thus, maximizing the MEL accuracy is achieved by maximizing .
Given the above model and facts, our objective can be thus reworded as optimizing the assigned batch size to each of the learners so as to maximize the number of local iterations per global updated cycle, while bounding the roundtrip distributed processing duration by the preset global cycle clock . The optimization variables in this problems are thus and . We can thus rewrite the expression of in (IIB) as a function of the optimization variables as follows:
(13) 
where , , and represent the quadratic, linear, and constant coefficients of learner in terms of the optimization variables and , expressed as:
(14)  
(15)  
(16) 
Clearly the relationship between and the optimization variables and is quadratic. Furthermore, the optimization variables and
are all nonzero integers. Consequently, the problem of interest in this paper can be formulated as an integer linear program with quadratic constraints and linear constraints as follows:
(17a)  
s.t.  (17b)  
(17c)  
(17d)  
(17e) 
Constraint (17b) will make sure that the roundtrip distributed processing times are less than . Constraint (17c) guarantees that the sum of sizes of the assigned batches to all learners is equal to the entire dataset size that the orchestrator needs to analyze. Constraints (17d) and (17e) are setting the number of local iterations and assigned batch sizes to nonnegative integers.
The above problem is an integer linear program with quadratic constraints (ILPQC), which is wellknown to be NPhard [17]. We will thus proposed a simpler solution to it through relaxation of the integer constraint in the next section.
Iv Proposed Solution
Iva Problem Relaxation
As shown in the previous section, the problem of interest in this paper is NPhard due to its integer decision variables. We thus propose to simplify the problem by relaxing the integer constraints in (17d) and (17e), solving the relaxed problem, then rounding the obtained real results back into integers. The relaxed problem can be thus given by:
(18a)  
s.t.  (18b)  
(18c)  
(18d)  
(18e) 
where the cases for and/or all ’s being zero represent scenarios where MEL will not be feasible, and thus the orchestrator must send the learning tasks to the edge or cloud server. The above resulting program becomes a linear program with quadratic constraints. This problem can be solved by using interiorpoint or ADMM methods, and there are efficient solvers (such as OPTI) that implement these approaches [18].
On the other hand, the associated matrices for each of the quadratic constraints in the relaxed problems can be written in a symmetric form. However, these matrices will have two nonzero values that are positive and equal to each other. The eigenvalues will then sum to zero, which means these matrices are not positive semidefinite, and hence the relaxed problem is nonconvex. Consequently, we cannot derive the optimal solution of this problem analytically. Yet, we can still derive upper bounds on the optimal variables and optimal solution using Lagrangian relaxation and KKT conditions.
The philosophy of our proposed solution is thus to calculate these upper bounds values on the optimal variables, then implement suggestandimprove steps until a feasible integer solution is reached. The next two subsections will thus derive the upper bounds on the optimal variables, by applying Lagrangian analysis and KKT conditions, respectively, on the relaxed nonconvex problem.
IvB Upper Bounds using the KKT conditions
The Lagrangian function of the relaxed problem is expressed as:
(19) 
where the ’s , /, and / , are the Lagrangian multipliers associated with the time constraints of the learners in (18b), the total batch size constraint in (18c), and the nonnegative constraints of all the optimization variables in (18d) and (18e), respectively. Note that the equality constraint in (18c) can be represented in the form of two inequality constraints and , and thus is associated with two Lagrangian multipliers and . Using the wellknown KKT conditions, the following theorem introduces upper bounds on the optimal variables of the relaxed problem.
Theorem 1
The optimal values of the allocated batch sizes to different users in the relaxed problem satisfy the following bound:
(20) 
Moreover, the analytical upper bound on the optimization variable belongs to the solution set the polynomial given by:
(21) 
where , and .
Proof: From the KKT optimality conditions, we have the following relations given by the following equations:
(22) 
(23) 
(24) 
(25) 
(26) 
(27) 
(28) 
From the conditions in (22), we can see that the batch size at user must satisfy (20). Moreover, it can be inferred from (24) that the bound in (20) holds with equality for having .
In addition, it is clear from (25) that either (which means that there is no feasible MEL solution and the orchestrator must offload the entire task to the edge/cloud servers) or (which we get if the problem is feasible). By rewriting the bound on in (20) as an equality and taking the sum over all , we have the following relation:
(29) 
The expression on the rightmost handside has the form of a partial fraction expansion of a rational polynomial function of . Therefore, we can expand it in the following way:
(30) 
Finally, the expanded form can be cleaned up in the form of a rational function with respect to , which is equal to the total dataset size .
(31) 
Please note that the degrees of the numerator and denominator will be and , respectively. Furthermore, the poles of the system will be , and, since , the system will be stable. Furthermore, is not a feasible solution for the problem, because it is eliminated by the constraint. Therefore, we can rewrite (31) as shown in (21). By solving this polynomial, we obtain a set of solutions for , one of them is feasible. The problem being nonconvex, this feasible solution will constitute the upper bound to the solution of the relaxed problem.
Though it was expected that the above bounds for in (20) and solution for from (21) should undergo a suggestandimprove steps to a feasible solution, we have found in our extensive simulations presented in Section V that these expressions were always already feasible. This means that no suggestandimprove steps were needed, and that the expressions can be directly used for batch allocation to achieve the optimal for the relaxed problem.
IvC Heuristic Method for Large
We can see that obtaining the above solution for in (21) will require solving th order polynomial, which may be computationally expensive for large . An alternative to do in such large settings can be done as follows. We can easily infer that the two extremes of batch allocation in the original problem are either:

nodes are offloaded one data sample each and the remaining node takes samples. In this case, the sum of the reciprocals of batch sizes will be plus a negligible value. As the number of nodes increases ( will be large), will be unbounded.

Equal batch allocation where each node processes samples. In this case, the sum of the reciprocals of batch sizes will be .
From the above cases, one option for large would thus be to start the solution from an equal batch allocation setting, and use the proposed suggestandimprove steps from this starting point. By setting , and summing the reciprocals of in (20) and solving for , we can thus use the following starting point for :
(32) 
V Simulation Results
In this section, we test our proposed adaptive task allocation solutions in MEL scenarios emulating realistic learning and edge node environments. More specifically, the OPTIbased solution to the relaxed version of the ILPQC formulation, analytical results from (20) and (21) (UBAnalytical), and the heuristic upperbound/suggestandimprove (UBSAI) solution are tested. We also show the merits of these two solutions compared to the equal task allocation (ETA) scheme employed in [12, 13]. We will first introduce the simulation environment, and then present the testing results.
Parameter  Value 

Attenuation Model  dB [19] 
System Bandwidth  100 MHz 
Node Bandwidth  5 MHz 
Device proximity  50m 
Transmission Power  23 dBm 
Noise Power Density  174 dBm/Hz 
Computation Capability  2.4 GHz and 700 MHz 
Pedestrian Dataset size (d)  9,000 images 
Pedestrian Dataset Features  648 () pixels 
MNIST Dataset size (d)  60,000 images 
MNIST Dataset Features  784 () pixels 
Va Simulation Environment
A typical MEC will consist of a cloudlet of heterogeneous devices, channel and computing wise. In our simulation, the edge nodes are assumed to located in an area of 50m of radius. Half of the considered nodes emulates the capacity of a typical fixed/portable computing device (e.g., laptops, tablets, roadside units. etc.) and the other half emulates the capacity of commercial microcontrollers (e.g., Raspberry Pi) that can be attached to different indoor or outdoor systems (e.g., smart meters, traffic cameras). The setting thus emulates an edge environment that can be located either indoor or outdoor. The employed channel model between these devices is summarized in Table I, which emulates 802.11 type links between the edge nodes.
Two datasets are considered in our simulations, namely the pedestrian [20] and MNIST [21]
datasets. The pedestrian datasets has 9,000 training images consisting of 684 features (18 x 36 pixels). The ML model used for this dataset is a singlelayer neural network with 300 neurons in the hidden layer. For this model, the weight matrix
is the concatenation of two submatrices, where , neither of which depending on the batch size (). Thus, the size of the model is 6,240,000 bits, which is fixed for all edge nodes. The forward and backward passes will require 781,208 floating point operations [22]. On the other hand, the MNIST dataset consists of 60,000 images 28x28 images (784 features). The employed ML model for this data is a 3layer neural network with the following configuration .VB Simulation Results for Pedestrian Dataset
Fig. 1 shows the number of local iterations achieved by all tested approaches versus the number of edge nodes, for and seconds. We can first notice increases as the number of edge nodes increases. This is trivial as less batch sizes will be allocated to each of the nodes as their number increase. We can also see that the performance of the OPTIbased, UBAnalytical, and UBSAI solutions are identical for all simulated number of edge nodes and global update clocks. For both global cycle clock cases, the figures show that both adaptive task allocate approaches result in much more local iterations than the ETA approach. For example, for seconds, edge nodes can perform only iterations each, whereas can be achieved with our proposed solutions, a gain of 450%. Another interesting result is that the performance of ETA scheme for seconds is actually much lower than the performance of our proposed solutions for . In other words, our scheme can achieve a better level of accuracy as the ETA scheme in half the time.
Fig. 2 illustrates the number of local iterations versus the global cycle clock, for , and edge nodes. Clearly, increasing enables more time for all learners to perform more local updates. Again, the OPTIbased, UBAnalytical, and UBSAI approaches performs similarly for all simulated scenarios. Comparing this performance to ETA, the figures shows that, when , our solutions can perform around 28 local iterations on 20 edge learners, versus only be 42 iterations using ETA, a gain of 420%. As the global cycle clock increases to seconds, our scheme can reach upto 138 iterations whereas the ETA scheme achieves only 30 updates, less than the number achieved by our scheme in 20 seconds cycle clock.
VC Simulation Results for MNIST Dataset
Fig. 3 shows the results for training using the MNIST dataset with a deep neural network model. Similar to the case of the pedestrian dataset, the performance of the OPTIbased, UBAnalytical, and UBSAI solutions are all identical, and give better performance than the ETA scheme. In general, less updates are possible compared to the smaller pedestrian dataset and model. However, the optimized approach makes it possible to perform more than 30 updates for 20 nodes with a cycle time of 60 seconds. When , at the OPTIbased approach for adaptive batch allocation give updates whereas only 3 updates are possible with ETA, a gain of 400%.
Vi Conclusion
This paper inaugurates the research efforts towards establishing the novel MEL paradigm, enabling the design of performanceguaranteed distributed learning solutions for wireless edge nodes with heterogeneous computing and communication capacities. As a first MEL problem of interest, the paper focused on exploring the dynamic task allocation solutions that would maximize the number of local learning iterations on distributed learners (and thus improving the learning accuracy), while abiding by the global cycle clock of the orchestrator. The problem was formulated as an NPhard ILPQC problem, which was then relaxed into a nonconvex problem over real variables. Analytical upper bounds on the relaxed problem’s optimal solution were then derived and were found to solve it optimally in all simulated scenarios. For large , we proposed a heuristic solution based on implementing suggestandimprove steps starting from equal batch allocation. Through extensive simulations using wellknown datasets, the proposed solutions were shown to both achieve the same performance as the numerical solvers of the ILPQC problem, and to significantly outperform the equal ETA approach.
References
 [1] M. Chiang and T. Zhang, “Fog and IoT: An Overview of Research Opportunities,” IEEE Internet of Things Journal, vol. 3, no. 6, pp. 854–864, dec 2016. [Online]. Available: http://ieeexplore.ieee.org/document/7498684/
 [2] Rhea Kelly, “Internet of Things Data To Top 1.6 Zettabytes by 2020 – Campus Technology,” 2015. [Online]. Available: https://campustechnology.com/articles/2015/04/15/internetofthingsdatatotop16zettabytesby2020.aspx
 [3] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A Survey on Mobile Edge Computing: The Communication Perspective,” IEEE Communications Surveys & Tutorials, vol. 19, no. 4, pp. 2322–2358, 2017. [Online]. Available: http://arxiv.org/abs/1701.01090
 [4] C. You and K. Huang, “Mobile Cooperative Computing: EnergyEfficient PeertoPeer Computation Offloading,” pp. 1–33, 2017. [Online]. Available: http://arxiv.org/abs/1704.04595
 [5] M. Liu and Y. Liu, “PriceBased Distributed Offloading for MobileEdge Computing with Computation Capacity Constraints,” IEEE Wireless Communications Letters, pp. 1–4, 2017.
 [6] Y. Li, L. Sun, and W. Wang, “Exploring devicetodevice communication for mobile cloud computing,” 2014 IEEE International Conference on Communications (ICC), pp. 2239 – 44, 2014. [Online]. Available: http://dx.doi.org/10.1109/ICC.2014.6883656
 [7] X. Cao, F. Wang, J. Xu, R. Zhang, and S. Cui, “Joint Computation and Communication Cooperation for Mobile Edge Computing,” in 16th International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt), 2018. [Online]. Available: http://arxiv.org/abs/1704.06777
 [8] U. Y. Mohammad and S. Sorour, “MultiObjective Resource Optimization for Hierarchical Mobile Edge Computing,” in 2018 IEEE Global Communications Conference: Mobile and Wireless Networks (Globecom2018 MWN), Abu Dhabi, United Arab Emirates, dec 2018.
 [9] D. Li, T. Salonidis, N. V. Desai, and M. C. Chuah, “DeepCham: Collaborative edgemediated adaptive deep learning for mobile object recognition,” Proceedings  1st IEEE/ACM Symposium on Edge Computing, SEC 2016, pp. 64–76, 2016.
 [10] S. Teerapittayanon, B. McDanel, and H. T. Kung, “Distributed Deep Neural Networks over the Cloud, the Edge and End Devices,” Proceedings  International Conference on Distributed Computing Systems, pp. 328–339, 2017.
 [11] C. Liu, Y. Cao, Y. Luo, G. Chen, V. Vokkarane, M. Yunsheng, S. Chen, and P. Hou, “A New Deep LearningBased Food Recognition System for Dietary Assessment on An Edge Computing Service Infrastructure,” IEEE Transactions on Services Computing, vol. 11, no. 2, pp. 249–261, 2018.
 [12] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “When Edge Meets Learning : Adaptive Control for ResourceConstrained Distributed Machine Learning,” in INFOCOM, 2018. [Online]. Available: https://researcher.watson.ibm.com/researcher/files/uswangshiq/SW{_}INFOCOM2018.pdf
 [13] T. Tuor, S. Wang, T. Salonidis, B. J. Ko, and K. K. Leung, “Demo abstract: Distributed machine learning at resourcelimited edge nodes,” INFOCOM 2018  IEEE Conference on Computer Communications Workshops, pp. 1–2, 2018.
 [14] L. Bottou and O. Bousquet, “The Tradeoffs of Large Scale Learning,” in Advances in Neural Information Processing Systems, J. C. Platt, D. Koller, Y. Singer, and S. Roweis, Eds. NIPS Foundation (http://books.nips.cc), 2008, vol. 20, pp. 161–168. [Online]. Available: http://leon.bottou.org/papers/bottoubousquet2008
 [15] J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. A. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng, “Large_Deep_Networks_Nips2012,” pp. 1–11, 2012. [Online]. Available: papers3://publication/uuid/3BA537A9BC9A460495A4E6F64F6691E7

[16]
Z. Wei, S. Gupta, X. Lian, and J. Liu, “StalenessAware AsyncSGD for
distributed deep learning,”
IJCAI International Joint Conference on Artificial Intelligence
, vol. 2016Janua, pp. 2350–2356, 2016.  [17] A. D. Pia, S. S. Dey, and M. Molinaro, “Mixedinteger Quadratic Programming is in NP,” pp. 1–10, 2014.
 [18] J. Currie and D. I. Wilson, “OPTI: Lowering the Barrier Between Open Source Optimizers and the Industrial MATLAB User,” in Foundations of ComputerAided Process Operations, N. Sahinidis and J. Pinto, Eds., Savannah, Georgia, USA, 2012.
 [19] S. Cebula, A. Ahmad, J. M. Graham, C. V. Hinds, L. A. Wahsheh, A. T. Williams, and S. J. DeLoatch, “Empirical channel model for 2.4 GHz ieee 802.11 wlan,” Proceedings of the 2011 International Conference on Wireless Networks, 2011.
 [20] S. Munder and D. M. Gavrila, “An experimental study on pedestrian classification.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 11, pp. 1863–1868, 2006. [Online]. Available: https://ieeexplore.ieee.org/document/1704841
 [21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased Learning Applied to Document Recognition,” Proceedings of IEEE, vol. 86, no. 11, 1998.

[22]
Brendan Shillingford, “What is the time complexity of backpropagation algorithm for training artificial neural networks?  Quora,” 2016. [Online]. Available:
https://www.quora.com/Whatisthetimecomplexityofbackpropagationalgorithmfortrainingartificialneuralnetworks
Comments
There are no comments yet.