Adaptive Task Allocation for Mobile Edge Learning

11/09/2018 ∙ by Umair Mohammad, et al. ∙ University of Idaho 0

This paper aims to establish a new optimization paradigm for implementing realistic distributed learning algorithms, with performance guarantees, on wireless edge nodes with heterogeneous computing and communication capacities. We will refer to this new paradigm as "Mobile Edge Learning (MEL)". The problem of dynamic task allocation for MEL is considered in this paper with the aim to maximize the learning accuracy, while guaranteeing that the total times of data distribution/aggregation over heterogeneous channels, and local computing iterations at the heterogeneous nodes, are bounded by a preset duration. The problem is first formulated as a quadratically-constrained integer linear problem. Being an NP-hard problem, the paper relaxes it into a non-convex problem over real variables. We thus proposed two solutions based on deriving analytical upper bounds of the optimal solution of this relaxed problem using Lagrangian analysis and KKT conditions, and the use of suggest-and-improve starting from equal batch allocation, respectively. The merits of these proposed solutions are exhibited by comparing their performances to both numerical approaches and the equal task allocation approach.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

I-a Motivation and Background

The accelerated migration towards the era of smart cities mandates the deployment of a large number of Internet-of-Things (IoT) devices, generating exponentially increasing amounts of data at the edge of the network. This data is currently sent to cloud severs for analytics and decision-making that improve the performance of a wide range of systems and services. However, it is expected that the rate and nature of this generated data will prohibit such centralized processing and analytics option. Indeed, the size of data will surpass the capabilities of current and even future wireless networks and internet backbone to transfer them to cloud data-centers [1]. In addition, the nature of this data and the time-criticality of their processing/analytics will enforce 90% of their processing to be done locally at edge servers and/or at the edge (mostly mobile) nodes themselves (e.g., smart phones, laptops, monitoring cams, drones, connected and autonomous vehicles, etc.) [2].

The above options of edge processing are supported by the late advancements in the area of mobile edge computing (MEC) [3] in general, especially collaborative MEC [4, 5, 6] and hierarchical MEC (H-MEC) [7, 8]. While the former enables edge nodes to decide on whether to perform their computing task locally or offload them to the edge servers, the latter involves offloading tasks among edge nodes themselves, and possibly the edge servers. Such task offloading decisions are usually done while respecting the heterogeneous spare computing resources and communication capacities of the edge network’s nodes and links, respectively, so as to minimize task completion delays and/or energy consumption.

While almost all works on MEC and H-MEC focused on managing simple independent computing/processing tasks (i.e., tasks initiated separately by each of the edge nodes), H-MEC can be easily extended to enable collaboration among edge nodes, and possibly with remote edge or even cloud servers, in performing distributed processing of the same tasks initiated by one or multiple orchestrating nodes. In this setting, the orchestrating node(s) will distributed computing tasks to the processing edge nodes for local processing and then collect their results to conclude the process. The decisions on the task size allocated to each node can again be done, given the heterogeneous communication capacities and spare computing resources in the edge network, so as to achieve delay guarantees and energy savings.

In addition, most of the previous works on MEC and H-MEC were limited to the level of basic processing technology which clearly contradicts with the increasing trends of using machine learning (ML) tools for data analytics. ML include a wide range of techniques, starting from simple regression to deep neural networks. These technique give highly accurate results but are computationally expensive and require large amounts of data. Sometimes, the models may even be undergoing re-training online to search for the optimum parameters. Hence, a lot of research has been done on how to parallelize these algorithms, by distributing the data over multiple nodes. However, most of this works have considered this procedure at the cloud level and over wired distributed computing environments .

Extending the above distributed learning paradigm to resource-constrained edge servers and nodes was not explored until very recently. [9, 10, 11]. The work of [9]

explore the possibility of connected mobile android devices in a distributed manner to train collected images in order to improve accuracy. The objective is to be able to train adaptive deep learning models based on the collected images. The authors of

[10] propose an exit strategy model for a distributed deep learning network. Based on an accuracy measure, the learning is either done collaboratively between users, at the edge or on the cloud. A new deep learning algorithm is designed to recognize food for dietary needs [11] using a mobile device and a server by splitting the processing between the device and edge.

As one can observe, the objective of these approaches is to train networks in a distributed manner in order to improve training accuracy. The work of [10] does aim to restrict transmissions to the edge or cloud by defining an acceptable level of accuracy. However, these works do not tackle distributing the machine learning task in wireless MEC’s by optimizing resource utilization and maintaining an acceptable level of accuracy simultaneously.

More recently, Tuor et al. [12] aimed to unify the number of local learning iterations in resource-constrained edge environments in order to maximize accuracy. The proposed approach jointly optimized the number of local learning and global update cycles at the learning edge servers/nodes (or learners for short) and orchestrator, respectively. To deal with the problem of “deviating gradients”, a newer approach is suggested in [13] to improve the frequency of global updates. However, all the above works investigating distributed ML on resource constrained wireless devices assumed equal distribution of task sizes to all edge nodes/servers, thus ignoring the typical heterogeneity in computing and communication capacities of different nodes and links, respectively. The implications of such computing and communication heterogeneity on optimizing the task allocation to different learners, selecting learning models, improving learning accuracy, minimizing local and global cycle times, and/or minimizing energy consumption, are clearly game-changing, but yet never investigated.

I-B Contribution

To the best of the authors’ knowledge, this work is the first attempt to develop realistic distributed learning algorithms, with performance guarantees, on cloudlet(s) of heterogeneous wireless edge nodes. We will refer to this new paradigm as “Mobile Edge Learning (MEL)”. To achieve the above MEL end-goal, new research tracks needs to be developed to explore the interplay and joint optimization of learning model selection, learning accuracy, task allocation, resource provisioning, node selection/arrangements, local/global cycle times, and energy consumption. This paper will inaugurate this MEL research, by considering the problem of dynamic task allocation for distributed learning over heterogeneous wireless edge learners (i.e., edge nodes with heterogeneous computing capabilities and heterogeneous wireless links to the orchestrator). This task allocation will be conducted so as to maximize the learning accuracy, while guaranteeing that the total times of data distribution/aggregation over heterogeneous channels, and local computing iterations at the heterogeneous nodes, are bounded by a preset duration by the orchestrator. The maximization of the learning accuracy is achieved by maximizing the number of local learning iterations per global update cycle [12].

To this end, the problem is first formulated as quadratically-constrained integer linear problem. Being an NP-hard problem, the paper relaxes it to a non-convex problem over real variables. Analytical upper bounds on the optimal solution of this relaxed problem are derived using Lagrangian analysis and KKT conditions. The proposed algorithm will thus start from these computed bounds, and then runs suggest-and-improve steps to reach a feasible integer solution. For large number of learners, we also proposed a heuristic solution based on implementing suggest-and-improve steps starting from equal batch allocation. The merits of these proposed algorithms will finally be exhibited through extensive simulations, comparing their performances to both numerical solutions and the equal task allocation approach of

[12, 13].

Ii System Model for MEL

Ii-a Distributed Learning Background

Distributed learning is defined by the operation of running one machine learning task on one global dataset over a system of learners. Learner , trains its local learning model on a subset of the global dataset, where . We will refer to each of these dataset subsets as a batch. The number of samples in each batch is denoted by , and the size of the global dataset is denoted by .

In machine learning (ML), the loss function is typically defined by

, where represents one sample of batch ,

is the vector of features or observations,

is the associate label or class to which sample belongs, and is the parameter matrix of the employed learning approach. For most ML algorithms, the parameter matrix consists a weight vectors. For instance, it consists of the weights and biases in neural networks. Since the observation/feature vectors and labels are not variable, the loss function can be concisely denoted by .

Given the above notation, the local loss function at each learner can be given by:

(1)

The global loss function at the orchestrator is thus computed by aggregating all learners’ loss functions as follows:

(2)

The objective of the orchestrator in any distributed learning setting is to minimize the global loss function over the parameter matrix, which can be expressed as:

(3)

The optimal solution for this type of problems is generally difficult to obtain analytically. Consequently, these problems are solved using gradient descent (GD) or stochastic gradient descent (SGD) methods, based on the employed batch distribution approach among learners. By applying the GD or SGD method at the

-th learner, the local parameter matrix is thus updated at the -th local iteration as follows:

(4)

Clearly, the local parameter matrices differ from one another during local update iterations, because they are processing different batches. After local iterations, the learners send these local parameter matrices to the orchestrator, which can then re-compute an updated unique parameter matrix at the global update cycle with some kind of averaging such as:

(5)

Ii-B Transition to MEL System Model

The modelling and formulation presented above [12] does not lend itself well to wireless nor heterogeneous edge learner environments, where learners have different computing capabilities and various channel qualities to the orchestrator. To migrate the general distributed learning model to the MEL paradigm, we will redefine the distributed learning model in an MEC/HMEC context.

Consider an edge orchestrator (e.g., edge server or even one of the edge nodes) wants to perform a distributed learning task on a specific dataset over heterogeneous wireless edge learners. In each global cycle, it thus sends to each learner a batch of random samples111This model assumes randomized batch allocations to learners in each global cycle, and SGD update approach to compute the gradients. This choice is justified by the generality of this approach (i.e., The deterministic GD approach does not need to send data in very cycle) and its proven superior learning accuracy performance [14]. from this dataset and the initial parameter matrix .

The batch , allocated to learner , is assumed to have a size of bits, which can be computed as follows:

(6)

where is the number of features in the dataset, and is the data bit precision. For example, the MNIST dataset has 60000 images of size 28x28 stored as unsigned integers, and thus Mbits. On the other hand, the size of the parameter matrix is assumed to be of size bits, which can be expressed for learner as:

(7)

where is the model bit precision (typically floating point precision). As shown in the above equation, the parameter matrix size consists of two parts, one depending on the batch size (represented by the term , where is the number of model coefficients related to each sample of the batch), and the other related to the constant size of the employed ML model (denoted by ).

As mentioned above, the orchestrator sends the concatenation of the aforementioned data and model bits with power over a wireless channel with bandwidth and complex channel power gain . Once this information is received by each learner , it sets is local parameter matrix . It then performs local update cycles on this local parameter matrix, using its allocated batch . For each iteration in typical ML, the algorithm sequentially goes over all features once for each data sample. Consequently, the number of computations required per iteration is equal to:

(8)

which clearly depends on the number of data samples assigned to each node and the computational complexity of the model. Although the data maybe stored as other types, the operations themselves are typically floating point operations.

The updated at each learner is then transmitted back to the orchestrator with the same power on the same channel. The server will then update the global parameter matrix as described in (5). Once done, it sends back this updated vector with a batch of new random sample from the dataset to each learner, and the process repeats.

To this end, define the time as the duration initiated when the orchestrator starts allocating and sending batches to learner in each global cycle, until the orchestrator starts this cycle’s global update operation. We will refer to the time as the global cycle clock (to differentiate it from the total global cycle duration consisting of the global cycle clock and the global update processing time). Clearly, this duration should encompass the time needed by the three processes of the distributed processing in each global cycle:

  1. The time needed to send the allocated batch and global parameter matrix to learner , . Defining the transmission rate between the orchestrator and learner by , the time can be expressed for learner as:

    (9)

    where is the noise power spectral density.

  2. times the time needed to perform one local update iteration at learner , . Defining as learner ’s local processor frequency dedicated to update the parameter matrix , the time can be expressed for learner as:

    (10)
  3. The time needed to receive the updated local parameter matrix from learner , . Assuming the channel between learner and the orchestrator is reciprocal and does not change during the duration of one global update cycle, the time can be computed for learner as:

    (11)

Thus, the total time taken by learner to complete the above three processes is equal to:

(12)

We will refer to the time of learner as its round-trip distributed processing duration. Clealy, must be smaller than or equal to the global cycle clock , for the orchestrator to have all the needed information to perform its global cycle update processing in time.

Iii Problem Formulation

As mentioned in Section I, the objective of this first paper on MEL is to optimize the task allocation (i.e., distributed batch size for each learner ) so as to maximize the accuracy of the distributed learning process in each global cycle (and thus eventually the accuracy of the entire learning process), within a preset global cycle clock by the orchestrator. It is well established in the literature that the loss function in general ML, and local and global loss functions in distributed learning (expressed in (1) and (2)), using GD or SGD, are minimized by increasing the number of learning iterations [15]. For distributed learning, this is equivalent to maximizing the number of local iterations in each global cycle [16]. Thus, maximizing the MEL accuracy is achieved by maximizing .

Given the above model and facts, our objective can be thus re-worded as optimizing the assigned batch size to each of the learners so as to maximize the number of local iterations per global updated cycle, while bounding the round-trip distributed processing duration by the preset global cycle clock . The optimization variables in this problems are thus and . We can thus re-write the expression of in (II-B) as a function of the optimization variables as follows:

(13)

where , , and represent the quadratic, linear, and constant coefficients of learner in terms of the optimization variables and , expressed as:

(14)
(15)
(16)

Clearly the relationship between and the optimization variables and is quadratic. Furthermore, the optimization variables and

are all non-zero integers. Consequently, the problem of interest in this paper can be formulated as an integer linear program with quadratic constraints and linear constraints as follows:

(17a)
s.t. (17b)
(17c)
(17d)
(17e)

Constraint (17b) will make sure that the round-trip distributed processing times are less than . Constraint (17c) guarantees that the sum of sizes of the assigned batches to all learners is equal to the entire dataset size that the orchestrator needs to analyze. Constraints (17d) and (17e) are setting the number of local iterations and assigned batch sizes to non-negative integers.

The above problem is an integer linear program with quadratic constraints (ILPQC), which is well-known to be NP-hard [17]. We will thus proposed a simpler solution to it through relaxation of the integer constraint in the next section.

Iv Proposed Solution

Iv-a Problem Relaxation

As shown in the previous section, the problem of interest in this paper is NP-hard due to its integer decision variables. We thus propose to simplify the problem by relaxing the integer constraints in (17d) and (17e), solving the relaxed problem, then rounding the obtained real results back into integers. The relaxed problem can be thus given by:

(18a)
s.t. (18b)
(18c)
(18d)
(18e)

where the cases for and/or all ’s being zero represent scenarios where MEL will not be feasible, and thus the orchestrator must send the learning tasks to the edge or cloud server. The above resulting program becomes a linear program with quadratic constraints. This problem can be solved by using interior-point or ADMM methods, and there are efficient solvers (such as OPTI) that implement these approaches [18].

On the other hand, the associated matrices for each of the quadratic constraints in the relaxed problems can be written in a symmetric form. However, these matrices will have two non-zero values that are positive and equal to each other. The eigenvalues will then sum to zero, which means these matrices are not positive semi-definite, and hence the relaxed problem is non-convex. Consequently, we cannot derive the optimal solution of this problem analytically. Yet, we can still derive upper bounds on the optimal variables and optimal solution using Lagrangian relaxation and KKT conditions.

The philosophy of our proposed solution is thus to calculate these upper bounds values on the optimal variables, then implement suggest-and-improve steps until a feasible integer solution is reached. The next two subsections will thus derive the upper bounds on the optimal variables, by applying Lagrangian analysis and KKT conditions, respectively, on the relaxed non-convex problem.

Iv-B Upper Bounds using the KKT conditions

The Lagrangian function of the relaxed problem is expressed as:

(19)

where the ’s , /, and / , are the Lagrangian multipliers associated with the time constraints of the learners in (18b), the total batch size constraint in (18c), and the non-negative constraints of all the optimization variables in (18d) and (18e), respectively. Note that the equality constraint in (18c) can be represented in the form of two inequality constraints and , and thus is associated with two Lagrangian multipliers and . Using the well-known KKT conditions, the following theorem introduces upper bounds on the optimal variables of the relaxed problem.

Theorem 1

The optimal values of the allocated batch sizes to different users in the relaxed problem satisfy the following bound:

(20)

Moreover, the analytical upper bound on the optimization variable belongs to the solution set the polynomial given by:

(21)

where , and .

Proof: From the KKT optimality conditions, we have the following relations given by the following equations:

(22)
(23)
(24)
(25)
(26)
(27)
(28)

From the conditions in (22), we can see that the batch size at user must satisfy (20). Moreover, it can be inferred from (24) that the bound in (20) holds with equality for having .

In addition, it is clear from (25) that either (which means that there is no feasible MEL solution and the orchestrator must offload the entire task to the edge/cloud servers) or (which we get if the problem is feasible). By re-writing the bound on in (20) as an equality and taking the sum over all , we have the following relation:

(29)

The expression on the right-most hand-side has the form of a partial fraction expansion of a rational polynomial function of . Therefore, we can expand it in the following way:

(30)

Finally, the expanded form can be cleaned up in the form of a rational function with respect to , which is equal to the total dataset size .

(31)

Please note that the degrees of the numerator and denominator will be and , respectively. Furthermore, the poles of the system will be , and, since , the system will be stable. Furthermore, is not a feasible solution for the problem, because it is eliminated by the constraint. Therefore, we can re-write (31) as shown in (21). By solving this polynomial, we obtain a set of solutions for , one of them is feasible. The problem being non-convex, this feasible solution will constitute the upper bound to the solution of the relaxed problem.  

Though it was expected that the above bounds for in (20) and solution for from (21) should undergo a suggest-and-improve steps to a feasible solution, we have found in our extensive simulations presented in Section V that these expressions were always already feasible. This means that no suggest-and-improve steps were needed, and that the expressions can be directly used for batch allocation to achieve the optimal for the relaxed problem.

Iv-C Heuristic Method for Large

We can see that obtaining the above solution for in (21) will require solving -th order polynomial, which may be computationally expensive for large . An alternative to do in such large settings can be done as follows. We can easily infer that the two extremes of batch allocation in the original problem are either:

  1. nodes are offloaded one data sample each and the remaining node takes samples. In this case, the sum of the reciprocals of batch sizes will be plus a negligible value. As the number of nodes increases ( will be large), will be unbounded.

  2. Equal batch allocation where each node processes samples. In this case, the sum of the reciprocals of batch sizes will be .

From the above cases, one option for large would thus be to start the solution from an equal batch allocation setting, and use the proposed suggest-and-improve steps from this starting point. By setting , and summing the reciprocals of in (20) and solving for , we can thus use the following starting point for :

(32)

V Simulation Results

In this section, we test our proposed adaptive task allocation solutions in MEL scenarios emulating realistic learning and edge node environments. More specifically, the OPTI-based solution to the relaxed version of the ILPQC formulation, analytical results from (20) and (21) (UB-Analytical), and the heuristic upper-bound/suggest-and-improve (UB-SAI) solution are tested. We also show the merits of these two solutions compared to the equal task allocation (ETA) scheme employed in [12, 13]. We will first introduce the simulation environment, and then present the testing results.

Parameter Value
Attenuation Model dB [19]
System Bandwidth 100 MHz
Node Bandwidth 5 MHz
Device proximity 50m
Transmission Power 23 dBm
Noise Power Density -174 dBm/Hz
Computation Capability 2.4 GHz and 700 MHz
Pedestrian Dataset size (d) 9,000 images
Pedestrian Dataset Features 648 () pixels
MNIST Dataset size (d) 60,000 images
MNIST Dataset Features 784 () pixels
TABLE I: List of simulation parameters

V-a Simulation Environment

A typical MEC will consist of a cloudlet of heterogeneous devices, channel and computing wise. In our simulation, the edge nodes are assumed to located in an area of 50m of radius. Half of the considered nodes emulates the capacity of a typical fixed/portable computing device (e.g., laptops, tablets, road-side units. etc.) and the other half emulates the capacity of commercial micro-controllers (e.g., Raspberry Pi) that can be attached to different indoor or outdoor systems (e.g., smart meters, traffic cameras). The setting thus emulates an edge environment that can be located either indoor or outdoor. The employed channel model between these devices is summarized in Table I, which emulates 802.11 type links between the edge nodes.

Two datasets are considered in our simulations, namely the pedestrian [20] and MNIST [21]

datasets. The pedestrian datasets has 9,000 training images consisting of 684 features (18 x 36 pixels). The ML model used for this dataset is a single-layer neural network with 300 neurons in the hidden layer. For this model, the weight matrix

is the concatenation of two sub-matrices, where , neither of which depending on the batch size (). Thus, the size of the model is 6,240,000 bits, which is fixed for all edge nodes. The forward and backward passes will require 781,208 floating point operations [22]. On the other hand, the MNIST dataset consists of 60,000 images 28x28 images (784 features). The employed ML model for this data is a 3-layer neural network with the following configuration .

V-B Simulation Results for Pedestrian Dataset

Fig. 1: Performance comparison of all schemes for and seconds vs

Fig. 1 shows the number of local iterations achieved by all tested approaches versus the number of edge nodes, for and seconds. We can first notice increases as the number of edge nodes increases. This is trivial as less batch sizes will be allocated to each of the nodes as their number increase. We can also see that the performance of the OPTI-based, UB-Analytical, and UB-SAI solutions are identical for all simulated number of edge nodes and global update clocks. For both global cycle clock cases, the figures show that both adaptive task allocate approaches result in much more local iterations than the ETA approach. For example, for seconds, edge nodes can perform only iterations each, whereas can be achieved with our proposed solutions, a gain of 450%. Another interesting result is that the performance of ETA scheme for seconds is actually much lower than the performance of our proposed solutions for . In other words, our scheme can achieve a better level of accuracy as the ETA scheme in half the time.

Fig. 2 illustrates the number of local iterations versus the global cycle clock, for , and edge nodes. Clearly, increasing enables more time for all learners to perform more local updates. Again, the OPTI-based, UB-Analytical, and UB-SAI approaches performs similarly for all simulated scenarios. Comparing this performance to ETA, the figures shows that, when , our solutions can perform around 28 local iterations on 20 edge learners, versus only be 42 iterations using ETA, a gain of 420%. As the global cycle clock increases to seconds, our scheme can reach up-to 138 iterations whereas the ETA scheme achieves only 30 updates, less than the number achieved by our scheme in 20 seconds cycle clock.

Fig. 2: Performance comparison of all schemes for and vs
Fig. 3: MNIST results (a) Performance comparison of all schemes for and seconds vs (b) Performance comparison of all schemes for and vs

V-C Simulation Results for MNIST Dataset

Fig. 3 shows the results for training using the MNIST dataset with a deep neural network model. Similar to the case of the pedestrian dataset, the performance of the OPTI-based, UB-Analytical, and UB-SAI solutions are all identical, and give better performance than the ETA scheme. In general, less updates are possible compared to the smaller pedestrian dataset and model. However, the optimized approach makes it possible to perform more than 30 updates for 20 nodes with a cycle time of 60 seconds. When , at the OPTI-based approach for adaptive batch allocation give updates whereas only 3 updates are possible with ETA, a gain of 400%.

Vi Conclusion

This paper inaugurates the research efforts towards establishing the novel MEL paradigm, enabling the design of performance-guaranteed distributed learning solutions for wireless edge nodes with heterogeneous computing and communication capacities. As a first MEL problem of interest, the paper focused on exploring the dynamic task allocation solutions that would maximize the number of local learning iterations on distributed learners (and thus improving the learning accuracy), while abiding by the global cycle clock of the orchestrator. The problem was formulated as an NP-hard ILPQC problem, which was then relaxed into a non-convex problem over real variables. Analytical upper bounds on the relaxed problem’s optimal solution were then derived and were found to solve it optimally in all simulated scenarios. For large , we proposed a heuristic solution based on implementing suggest-and-improve steps starting from equal batch allocation. Through extensive simulations using well-known datasets, the proposed solutions were shown to both achieve the same performance as the numerical solvers of the ILPQC problem, and to significantly outperform the equal ETA approach.

References