I Introduction
As mobile and IoT devices have emerged in large numbers and are generating a massive amount of data [31], Machine Learning (ML) has been witnessed to go through a highspeed development due to big data and improving computing capacity, which prompted the development of Artificial Intelligence (AI) to revolutionize our life [10]. The conventional ML framework focuses on central data processing, which requires widely distributed mobile devices to upload their local data to a remote cloud for global model training [17]. However, the cloud server is hard to exploit such a multitude of data from massive user devices as it easily suffers external attack and data leakage risk. Given the above threats to data privacy, many device users are reluctant to upload their private raw data to the cloud server [7, 9, 1].
To tackle the data security issue in centralized training, a decentralized ML named Federated Learning (FL) is widely envisioned as an appealing approach [12]. It enables mobile devices collaboratively build a shared model while preserving privacy sensitive data locally from external direct access. In the prevalent FL algorithm such as Federated Averaging (FedAvg), each mobile device trains a model locally with its own dataset and then transmits the model parameters to the cloud for a global aggregation [2]. With the great potential to facilitate largescale data collection, FL realizes model training in a distributive fashion.
Unfortunately, FL suffers from a bottleneck of communication and energy overhead before reaching a satisfactory model accuracy due to long transmission latency in wide area network (WAN) [24]. With devices’ limited computing and communication capacity, plethora of model transmission rounds occur which degrades the learning performance under training time budget. And plenty of energy overhead is required for numerous computation and communication iterations which is challenging to low battery devices. In addition, as many ML models are of large size, directly communicating with the cloud over WAN by a massive number of device users could worsen the congestion in backbone network, leading to significant communication latency.
To mitigate such issues, we leverage the power of Mobile Edge Computing (MEC), which is regarded as a promising distributed computing paradigm in 5G era for supporting many emerging intelligent applications such as video streaming, smart city and augmented reality [23]. MEC allows delaysensitive and computationintensive tasks to be offloaded from distributed mobile devices to edge servers in proximity, which offers realtime response and high energy efficiency [6, 16, 28]. Along this line, we propose a novel Hierarchical Federated Edge Learning (HFEL) framework, in which edge servers as intermediaries between mobile devices and the cloud, can perform edge aggregations of local models from devices in proximity. When each of them achieves a given learning accuracy, updated models at the edge are transmitted to the cloud for global aggregation. Intuitively, HFEL can help to reduce the significant communication overhead over the WAN transmissions between the device users and the cloud via edge model aggregations. Moreover, through the coordination by the edge servers in proximity, more efficient joint communication and computation resource allocation among the device users can be achieved, which can enable effective training time and energy overhead reductions.
Nevertheless, to realize the great benefits of HFEL, we still face the following challenges: 1) how to solve a joint computation and communication resource allocation for each device to achieve training acceleration and energy saving? The training time to converge to a predefined accuracy level is one of the most important performance metrics of FL. While energy minimization of batteryconstrained devices is the main concern in MEC [28]. Both training time and energy minimization depend on mobile devices’ computation capacities and communication resource allocation from edge servers. As the resources of an edge server and its associated devices are generally limited, such optimization is nontrivial to achieve. 2) How to associate the proper set of device users to an edge server for efficient edge model aggregation? As in Fig. 1, densely distributed mobile devices are generally able to communicate with multiple edge servers. From the perspective of an edge server, it is better to communicate with as many mobile devices as possible for edge model aggregation to improve learning accuracy. While more devices choose to communicate with the same edge server, the less communication resource that each device would get, which accounts for longer communication delay. As a result, the computation and communication resource allocations for the devices and their edge association issues should be carefully addressed to accomplish costefficient learning performance in HFEL.
As a thrust for the grand challenges above, in this paper we formulate a joint computation and communication resource allocation and edge server association problem for global learning cost minimization in HFEL. Unfortunately, such optimization problem is NPhard. Hence we decompose the original optimization problem into two subproblems: 1) resource allocation problem and 2) edge association problem, and accordingly put forward an efficient integrated scheduling algorithm for HFEL. For resource allocation, given a set of devices which are scheduled to upload local models to the same edge server, we can solve an optimal policy, i.e., the amount of contributed computation capacity of each device and bandwidth resource that each device is allocated to from the edge server. Moreover, for edge association, we can work out a feasible set of devices (i.e., a training group) for each edge server through cost reducing iterations based on the optimal policy of resource allocation within the training group. The iterations of edge association process finally converge to a stable system point, where each edge server owns a stable set of model training devices to achieve global cost efficiency and no edge server will change its training group formation.
In a nutshell, our work makes the key contributions as follows:

We propose a hierarchical federated edge learning (HFEL) framework which enables great potentials in low latency and energyefficient federated learning and formulate a holistic joint computation and communication resource allocation and edge association model for global learning cost minimization.

We decompose the challenging global cost minimization problem into two subproblems: resource allocation and edge association, and accordingly devise an efficient HFEL resource scheduling algorithm. With the optimal policy of the convex resource allocation subproblem given a training group of a single edge server, a feasible edge association strategy can be solved for each edge server through cost reducing iterations which are guaranteed to converge to a stable system point.

Extensive numerical experiments demonstrate that our HFEL resource scheduling algorithm is capable of achieving superior performance gain in global cost saving over comparing benchmarks and better training performance than conventional devicecloud based FL.
Ii System Model
Symbol  Definitions  Symbol  Definitions 

set of mobile devices  set of edge servers  
set of available mobile devices for edge server  device ’s training data set  
the th input sample of a device  a labeled output of of a device  
local training accuracy  a constant related to the number of local training iterations  
number of local iterations  index of local training iteration  
training model of device at th iteration  learning rate  
number of CPU cycles for device to process one sample data  the minimum and maximum computation capacity of device  
CPU frequency variable of device for local training  computation delay and energy respectively of local iterations of device  
effective capacitance coefficient of device ’s computing chipset  set of devices who choose to transmit their model parameters and gradients to edge server  
edge server ’s total bandwidth  ratio of bandwidth allocated to device from edge server  
achievable transmission rate of device  background noise  
transmission power of device  channel gain of device  
communication time and energy respectively for device to transmit local model to edge server  device ’s or edge server ’s update size of model parameters and gradients  
aggregated model by edge server  edge training accuracy  
edge iteration number  energy and delay respectively under edge server with the set of devices  
delay and energy respectively for edge model uploading by edge server to the cloud  edge server ’s transmission rate to the cloud  
transmission power of edge server per second  dataset under edge server with set of devices  
total dataset of the set of devices  global model aggregated by the cloud under one global iteration  
systemwide energy and delay respectively under one global iteration  weighting parameters of energy and delay for device training requirements, respectively 
In the HFEL framework, we assume a set of mobile devices , a set of edge servers and a cloud server . Let represent the set of available mobile devices communicated with edge server . In addition, each device owns its local data set where denotes the th input sample and is the corresponding labeled output of for the federated learning tasks. The key notations used in this paper are summarized in Table I.
Iia Learning process in HFEL
We consider our HFEL architecture as Fig. 1, in which one training model goes through model aggregation in edge layer and cloud layer. Therefore, the shared model parameters by mobile devices in a global iteration involve edge aggregation and cloud aggregation. To quantify the training overheads in the HFEL framework, we formulate the energy and delay overheads in edge aggregation and cloud aggregation within one global iteration.
IiA1 Edge Aggregation
At this stage, it includes three steps: local model computation, local model transmission and edge model aggregation. That is, the local model is first trained by mobile devices and then transmitted to their associated edge servers for edge aggregation, which can be elaborated as the following steps.
Step 1. Local model computation. In this step for a device , it needs to solve the machine learning model parameter which characterizes the output value
with the loss function
. The loss function on the data set of device is defined as(1) 
To achieve a local accuracy which is common to all the devices for a same model, device needs to run a number of local iterations formulated as for a wide range of iterative algorithms [14]. Constant depends on the data size and the machine learning task. That is, each device ’s task is to figure out its local update at th local iteration as
(2) 
until and is the predefined learning rate [8].
Accordingly, the formulation of computation delay and energy overheads incurred by device can be given in the following. Let be the number of CPU cycles for device to process one sample data. Considering that each sample has the same size, the total number of CPU cycles to run one local iteration is . We denote the allocated CPU frequency of device for computation by with . Thus the total delay of local iterations of can be formulated as
(3) 
and the energy cost of the total local iterations incurred by device can be given as [3]
(4) 
where represents the effective capacitance coefficient of device ’s computing chipset.
Step 2. Local model transmission. After finishing local iterations, each device will transmit its trained to a selected edge server which apparently incurs wireless transmission delay and energy. Then for an edge server , we characterize the set of devices who choose to transmit their model parameters to by .
In this work, we consider an orthogonal frequencydivision multiple access (OFDMA) protocol for devices in which edge server provides a total bandwidth . Define as the bandwidth allocation ratio for device such that ’s resulting allocated bandwidth is . Let denote the achievable transmission rate of device which is defined as
(5) 
where is the background noise, is the transmission power, and is the channel gain of device (which is referred to [25]). Let denote the communication time for device to transmit with the data size of model parameters to edge server . Thus can be characterized as
(6) 
Given the communication time and power of , the energy cost of to transmit is
(7) 
Step 3. Edge model aggregation. At this step, each edge server receives the updated model parameters from its connected devices and then averages them as
(8) 
where is aggregated data set under edge server .
After that, edge server broadcasts to its devices in for the next round of local model computation (i.e. step 1). In other words, step 1 to step 3 of edge aggregation will iterate until edge server reaches an edge accuracy which is the same for all the edge servers. We can observe that each edge server won’t access the local data of each device , thus preserving the personal data privacy. In order to achieve the required model accuracy, for a general convex machine learning task, the number of edge iterations is shown to be [20]
(9) 
where is some constant that depends on the learning task. Note that our analysis framework can also be applied when the connection between the convergence iterations and model accuracy is known in the nonconvex learning tasks.
Since an edge server typically has strong computing capability and stable energy supply, the edge model aggregation time and the energy overhead for broadcasting the aggregated model parameter is not considered in our optimization model. Also, since the time and energy cost for a device receiving the boardcasted model parameter is small compared with the model parameter uploading case and keeps almost constant during each iteration, we also ignore this part in our model. Thus, after edge iterations, the total energy cost of the devices in edge server ’s training group is given by
(10) 
Similarly, the delay including computation and communication for edge server to achieve an edge accuracy can be derived as
(11) 
From (11), we see that the bottleneck of the computation delay is affected by the last device who finishes all the local iterations, while the communication delay bottleneck is determined by the device who spends the longest time in model transmission after local training.
IiA2 Cloud Aggregation
At this stage, there exists two steps: edge model uploading and cloud model aggregation. That is, each edge server uploads to the cloud for global aggregation after edge iterations.
Step 1. Edge model uploading. Let denote the edge server ’s transmission rate to the remote cloud for edge model uploading, the transmission power per sec and the edge server ’s update size. We then derive the delay and energy for edge model uploading by edge server respectively as
(12)  
(13) 
Step 2. Cloud model aggregation. At this final step, the remote cloud receives the updated models from all the edge servers and aggregates them as:
(14) 
where .
As a result, neglecting the aggregation time on the cloud which is much smaller than that on the mobile devices, we can obtain the systemwide energy and delay under one global iteration as
(15)  
(16) 
For a more clear description, we provide one global aggregation iteration procedure of HFEL in Algorithm 1. Such global aggregation procedure can be repeated by pushing the global model parameter to all the devices via the edge servers, until the stopping condition (e.g., the model accuracy or total training time) is satisfied.
IiB Problem Formulation
Given the system model above, we now consider the system performance optimization problem in terms of energy and delay overheads minimization within one global iteration. Let represent the importance weighting indicators of energy and delay for the training objectives, respectively. Then the HEFL optimization problem is formulated as follows:
where (17a) and (17c) respectively represent the uplink communication resource constraints and computation capacity constraints, (17d) and (17e) ensure all the devices in the system participate in the model training, and (17f) requires that each device is allowed to associate with one edge server for model parameter uploading and aggregation for sake of cost saving.
Unfortunately, this optimization problem is NPhard in general due to its combinatorial nature in user partitioning for edge association and its coupling with the resource allocation issue. This implies that for large inputs it is impractical to obtain the global optimal solution in a realtime manner. Thus, efficient approximating algorithm with lowcomplexity is highly desirable and this motivates the HFEL scheduling algorithm design in the following.
IiC Overview of HFEL Scheduling Scheme
Since the optimization problem (17) is NPhard in general, a common and intuitive solution is to design a feasible and computation efficient approach to approximately minimize the system cost. Here we adopt the divideandconquer principle and decompose the HFEL scheduling algorithm design issue into two key subproblems: resource allocation within a single edge server and edge association across multiple edge servers.
As shown in Fig. 2, the basic procedures of our scheme are elaborated as follows:

We first carry out an initial edge association strategy (e.g., each device connects to its closest edge server). Given the initial edge association strategy, we then solve the optimal resource allocation for the devices within each edge sever (which is given in Section III later on).

Then we define that for each device, it has two possible adjustments to perform to improve edge association scheme: transferring or exchanging (which will be formally defined in Section IV later on). These adjustments are permitted to be carried out if they can improve the systemwide performance without damaging any edge server’s utility.

When a device performs a permitted adjustment, it incurs a change of systematic edge association strategy. Thus we will work out the optimal resource allocation for each edge server with updated edge association.

All the devices iteratively perform possible adjustments until there exists no permitted adjustment, i.e., no change of systematic edge association strategy.
As shown in the following sections, the resource allocation subproblem can be efficiently solved in practice using convex optimization solvers, and the edge association process can converge to a stable point within a small number of iterations. Hence the resource scheduling algorithm for HFEL can converge in a fast manner and is amendable for practical implementation.
Iii Optimal Resource Allocation Within Single Edge Server
In this section, we first focus on the optimal overhead minimization within a single edge server, i.e., considering joint computation and communication resource allocation subproblem under edge server given scheduled training group of devices .
To simplify the notations, we first introduce the following terms:
where and are constants related to device ’s parameters and system setting. Then through refining and simplifying the aforementioned formulation (17) in a single edge server scenario, we can derive a subproblem formulation of edge server ’s overhead minimization under one global iteration as follows:
For the optimization problem in (18), we can show it is a convex optimization problem as stated in the following.
Theorem 1
The resource allocation subproblem (18) is convex.
Proof. The subformulas of consist of the following three parts: 1) , 2) and 3) , each of which is intuitively convex in its domain and all constraints get affine such that problem (18) is convex.
Moreover, by exploring the KarushKuhnTucker (KKT) conditions of problem (18), we can obtain the following structural result.
Theorem 2
The optimal solutions to device ’s bandwidth and computation capacity allocations and under edge server of (18) satisfy
(19) 
Proof. First to make (18) better tractable, let and . Then problem (18) can be further transformed to
Given , problem (20) is convex such that it can be solved by the Lagrange multiplier method. The partial Lagrange formula can be characterized as
where and are the Lagrange multipliers related to constraints (20a) and (20d). Applying KKT conditions, we can derive the necessary and sufficient conditions in the following.
(21)  
(22)  
(23)  
(24)  
(25)  
(26) 
From (21) and (22), we can derive the following relations:
(27)  
(28)  
(29) 
based on which, another relation expression can be obtained according to (24) as follows.
(30) 
Hence, we can easily have
(31) 
Finally, replacing with (29) in expression (31), the optimal bandwidth ratio can be easily figured out as in (19).
Given the results in Theorem 1 and 2, we are able to efficiently solve the resource allocation problem (18) with the detailed algorithm in Algorithm 2. Specifically, by replacing with (19), we can transform problem (18) to an equivalent convex optimization problem as follows.
(32)  
(33) 
Since the original problem (18) is convex and is convex with respective to , the transformed problem (32) above is also convex, which can be solved by some convex optimization solvers (e.g., CVX and IPOPT) to obtain optimal solution . After that, optimal solution can be derived based on (19) given . Note that by such problem transformation, we can greatly reduce the size of decision variables in the original problem (18) which can help to significantly reduce the solution computing time in practice.
Iv Edge Association For multiple edge servers
We then consider the edge association subproblem for multiple edge servers. Given the optimal resource allocation of scheduled devices under a single edge server, the key idea of solving systematic overhead minimization is to efficiently allocate a bunch of devices to each edge server for edge model aggregation. In the following, we will design an efficient edge association for all the edge servers, in order to gradually improve the overall system performance iteratively.
First we introduce some critical concepts and definitions about edge association by each edge server in the following.
Definition 1
In our system, a local training group is termed as a subset of which choose to upload their local models to edge server for edge aggregation. Correspondingly, the utility of can be derived as which takes a minus sign over the minimum cost of solving resource allocation subproblem for edge server .
Definition 2
An edge association strategy is defined as the set of local training groups of all the edge servers, where , such that the systemwide utility given scheduled can be denoted as .
For the whole system, which kind of edge association strategy it prefers depends on due to the global overhead minimization objective. Hence, to compare different edge association strategies, we define a utilitarian order based on which reflects the preference of all the edge servers on forming local training groups.
Definition 3
Given two different edge association strategies and , we define a pareto order as if and only if and for and with of each edge server , we have , indicating that edge association strategy is preferred over to gain lower overhead by all the edge servers.
Next, we can solve the overhead minimization problem by constantly adjusting edge association strategy , i.e., each edge server’s training group formation, to gain lower overhead in accordance with perato order . The edge association adjusting will result in termination with a stable where no edge server in the system will deviate its local training group from .
Obviously the adjustment of edge association strategy basically results from the change of each edge device’s local training group formation. In our system, it is permitted to perform some edge association adjustments with utility improvement based on which we define as follows.
Definition 4
A device transferring adjustment by means that device with retreats its current training group and joins another training group . Causing a change from to , the device transferring adjustment is permitted if and only if .
Definition 5
A device exchanging adjustment between edge servers and means that device and device are switched to each other’s local training group. Causing a change from to , the device exchanging adjustment is permitted if and only if .
Based on the wireless communication between devices and edge servers, each device reports all its detailed information (including computing and communication parameters) to its available edge servers. Then each edge server will calculate its own utility and manage the edge association adjustments through cellular communication with other edge servers.
With the iteration of every permitted adjustment which brings a systematic overhead decrease by , the edge association adjustment process will terminate to be stable where no edge server will deviate from the current edge association strategy.
Definition 6
A edge association strategy is at a stable system point if no edge server will change to obtain lower global training overhead with unchanged.
That is, at a stable system point , no edge server will deviate its local training group formation from to achieve lower global FL overhead given optimal resource allocation within .
Iva Edge association algorithm
Next, we devise an edge association algorithm to achieve cost efficiency in HFEL for all the edge servers and seek feasible computation and communication resource allocation for their training groups. Note that in our scenario, the edge server has perfect knowledge of the channel gains and computation capacities of its local training group which can be obtained by feedback, and can connect with each other through cellular links. Thus, our decentralized edge association process is implemented by all the edge servers, which consists of two steps: initialized allocation and edge association as described in Algorithm 3.
In the first stage, initialization allocation procedure is as follows.

First for each edge server , local training group is randomly formed.

Then given , edge server solves resource allocation subproblem, obtaining and and deriving .

After the initial edge association of all the edge servers completes, an initial edge association strategy can be achieved.
In the second stage, let each device perform all the possible permitted edge association adjustments until no local training group will be changed. Specially, a historical group set is maintained for each edge server to record the group composition it has formed before with the corresponding utility value so that repeated calculations can be avoided.
After the edge association algorithm converges, all the involved mobile devices will execute local training with the optimal resource allocation strategy and that each edge server broadcasts to its local training group.
Theorem 3
The proposed edge association process in Algorithm 3 will converge to a stable system point.
Proof. We prove the convergence to a stable system point of our scheme by contradiction. Suppose that Algorithm 3 works out an edge association strategy which has not converged to a stable system point. It indicates that there exists some edge servers able to perform permitted adjustments based on Definition 6 to reach another with . However, according to the recursion terminal condition of Algorithm 3, our scheme ends up with no permitted adjustment of each edge server such that there is no edge association strategy which has isolated the existence of . As a consequence, the solution derived by our edge association algorithm satisfies the property of stable system point, where no edge server will deviate its local training group formation from and all the devices fulfill feasible resource allocation to execute cost efficient hierarchical federated learning.
Extensive performance evaluation in Section V shows that the proposed edge association algorithm can converge in a fast manner, with an almost linear convergence speed.
V Performance Evaluation
In this section, we carry out simulations to evaluate: 1) the global cost saving performance of the proposed resource scheduling algorithm and 2) HFEL performance in terms of test accuracy, training accuracy and training loss. From the perspective of devices’ and edge servers’ availability, all the devices and edge servers are distributed randomly within an entire area.
Parameter  Value 

Maximum Bandwidth of Edge Servers  10 Mhz 
Device Transmission Power  200 mW 
Device CPU Freq.  [1, 10] Ghz 
Device CPU Power  600 mW 
Processing Density of Learning Tasks  [30, 100] cycle/bit 
Background Noise  W 
Device Training Size  [5, 10] MB 
Updated Model Size  25000 nats 
Capacitance Coefficient  
Learning rate  0.0001 
Va Performance gain in cost reduction
Typical parameters of devices and edge servers are provided in Table II with image classification learning tasks on a dataset MNIST [15]. To characterize mobile device heterogeneity for MNIST dataset, we have each device maintain only two labels over the total of labels and each of them has different sample sizes based on the law power in [18]. Furthermore, each device trains with full batch size. Under varying mobile device number from to and edge server number from to , we compare our algorithm to the following schemes to present the performance gain in cost reduction:

Random edge association: each edge server selects the set of mobile devices in a random way and then solves the optimal resource allocation for . That is, it only optimizes resource allocation subproblem given a set of devices.

Greedy edge association: each device can select the connected edge server sequentially based on the geographical distance to each edge server in an ascending order. After that, each edge server solves the optimal resource allocation with . It also only optimizes resource allocation subproblem without edge association similar to random resource allocation.

Computation optimization: in this scheme, resource allocation subproblem for each solves optimal computation capacity given evenly distribution of bandwidth ratio.

Communication optimization: in this scheme, resource allocation subproblem for each solves optimal bandwidth ratio allocation with random computation capacity decision .

Uniform resource allocation: in this scheme, we leverage the same edge association strategy as our proposed algorithm. While in the resource allocation subproblem, the bandwidth of each edge server is evenly distributed to mobile devices in and the computation capacity of is randomly determined between and . That is, edge association subproblem is solved without resource allocation optimization.

Proportional resource allocation: for all the edge servers, we as well adopt edge association strategy to improve . While in the resource allocation subproblem, the bandwidth of each edge server is distributed to each reversely proportional to the distance such that communication bottle can be mitigated. Similarly, random computation capacity decision of is . Similar to uniform resource allocation, only edge association subproblem is solved.
As presented in both Fig. 6 and Fig. 6 in which uniform resource allocation is regarded as benchmark, our HFEL algorithm achieves the lowest global cost ratio compared to the proposed schemes.
First we explore the impact of different device number on the performance gain in global cost reduction by fixing edge server number as . As described in Fig. 6, HFEL algorithm accomplishes a satisfying global cost ratio as compared to uniform resource allocation as device number grows. While compared to computation optimization, greedy device allocation, random device allocation, communication optimization and proportional resource allocation schemes, our algorithm is more efficient and fulfills up to and performance gain in global cost reduction, respectively.
Then with device number fixed as , Fig. 6 exhibits that our HFEL algorithm still outperforms the other comparing schemes. For example, the global cost ratio compared to uniform resource allocation scheme of the HFEL scheme is as respectively. Meanwhile, our HFEL algorithm can achieve up to , and global cost reduction ratio over computation optimization, greedy device allocation, random device allocation, communication optimization and proportional resource allocation schemes, respectively.
Note that greedy device allocation and random device allocation schemes only optimize resource allocation subproblem without edge association. While proportional resource allocation and uniform resource allocation strategies solve edge association without resource allocation optimization. It can be figured out that the performance gain of resource allocation optimization in global cost reduction greatly dominates that of edge association solution.
Further, we show the average iteration numbers of our algorithm in Fig. 6 with growing number of devices from to , and the average iteration numbers of our algorithm in Fig. 6 with the number of edge servers ranging from to . The results show that the convergence speed of the proposed edge association strategy is fast and grows (almost) linearly as mobile device and edge server size increase, which shows the computation efficiency of edge association algorithm.
VB Performance gain in training loss and accuracy
In this subsection setting, the performance of HFEL is validated on dataset MNIST [15] and FEMNIST [5] (an extended MNIST dataset with labels which is partitioned based on the device of the digit or character) compared to the classic FedAvg algorithm [2]. Moreover, each device trains with full batch size on both MNIST and FEMNIST.
We consider edge servers and devices participating in the training process for experiment. All the datasets are split with for training and for testing in a random way. In the training process, global iterations are executed during each of which all the devices go through the same number of local iterations in both HFEL and FedAvg schemes.
Fig. 1212 demonstrate test accuracy, training accuracy and training loss respectively on MNIST dataset as global iteration grows. The loss function is defined according to [8]. As is shown, our HFEL algorithm has higher test accuracy and training accuracy than FedAvg both by around . And HFEL has lower training loss than FedAvg by around . That is for the fact that based on the same number of local iterations during one global iteration, devices in HFEL additionally undergo several rounds of model aggregation in edge servers such that they benefit from model updates at the edge. However for the devices in FedAvg, they only train with local datasets without receiving information from external network for learning improvement during a global iteration.
Fig. 1212 present the training performance on dataset FEMNIST. Compared to FedAvg, the increments in terms of test accuracy and training accuracy of HFEL under FEMNIST are up to be and respectively. While the reduction of training loss of HFEL compared with FedAvg under FEMNIST is around .
Moreover, the effect of different local iteration numbers on convergence speed is exhibited in Fig. 16 and 16 through iterations. As we can see, with the same number of edge iteration as and an increase of local iteration number from to , the convergence speed shows an obvious acceleration both in MNIST and FEMNIST datasets, which implies the growth of has a positive impact on convergence time. While in Fig. 16 and 16, we conduct experiments considering a fixed product of and as and the values of growing from to . Fig. 16 and 16 present that a decreasing number of local iterations and increasing number of edge iterations lead to a reduction of communication rounds with the cloud to reach the test accuracy of for MNIST dataset and for FEMNIST dataset, respectively. Hence, properly increasing edge iteration rounds can help to reduce propagation delay and improve convergence speed in HFEL.
Vi related work
To date, federated learning (FL) has been envisioned as a promising approach to guarantee personal data security compared to conventional centralized training at the cloud. It only requires local models trained by mobile devices with local datasets to be aggregated by the cloud such that the global model can be updated iteratively until the training process converges.
Nevertheless, faced with long propagation delay in widearea network (WAN), FL suffers from a bottleneck of communication overhead due to thousands of communication rounds required between mobile devices and the cloud. Hence a majority of studies have focused on reducing communication cost in FL [13, 4, 26, 11]. Authors in [13] proposed structured and sketched local updates to reduce the model size transmitted from mobile devices to the cloud. While authors in [4] introduced lossy compression and federated dropout to reduce cloudtodevice communication cost, extending the work in [13]. [26] figured out a communicationmitigated federated learning (CMFL) algorithm in which devices only upload local updates with high relevance scores to the cloud. Further, considering that communication overhead often dominates computation overhead [2], authors in [11] increased computation on each device during a local training round by modifying the classic federated averaging algorithm in [2] as LoAdaBoost FedAvg. While in our work, thanks to the emergence of mobile edge computing (MEC) which migrates computing tasks from the network core to the network edge, we propose a hierarchical Federated Edge Learning (HFEL) framework. In HFEL, mobile devices first upload local models to proximate edge servers for partial model aggregation which can offer faster response rate and relieve core network congestion.
Similarly, some existing literature also proposed hierarchical federated learning in MEC such as [19] which presented a faster convergence speed than the FedAvg algorithm. Although a basic architecture about hierarchical federated learning has been built in [19], the heterogeneity of mobile device involved in FL is not considered. When largescale devices with different dataset qualities, computation capacities and battery states participate in FL, resource allocation needs to be optimized to achieve cost efficient training.
There have been several existing research on the resource allocation optimization of mobile devices for different efficiency maximization objectives in edgeassisted FL [29, 21, 27, 22, 30, 8, 25]. Yu et al. worked on federated learning based proactive content caching (FPCC) [29]. While Nishio et al. proposed an FL protocal called FedCS to maximize the participating number of devices with a predefined deadline based on their wireless channel states and computing capacities [21]. Further, the authors extended their study of FedCS to [27] in which data distribution differences are considered and solved by constructing independent identically distributed (IID) dataset. In [22], the authors aimed at accelerating training process via optimizing batchsize selection and communication resource allocation in a federated edge learning (FEEL) framework. [30] explored energyefficient radio resource management in FL and proposed energyefficient strategies for bandwidth allocation and edge association. Dinh et al. worked on a resource allocation problem that captures the tradeoff between convergence time and energy cost in FL [8]. While in [25], local accuracy, transmit power, data rate and devices’ computing capacities were jointly optimized for FL training time minimization.
In our HFEL framework, we target at solving computation and bandwidth resource allocation of each device for training cost minimization in terms of energy and delay. Furthermore, edge association is optimized for each edge server under the scenario where more than one edge server is involved in HFEL and each device is able to communicate with multiple edge servers. While the literature [22, 30, 8, 25] take only one edge server into account for resource allocation. Along a different line, we work on training cost minimization in terms of energy and delay by considering 1) joint computation and bandwidth resource allocation for each device and 2) edge association for each edge server.
Vii conclusion
Federated Learning (FL) has been proposed as an appealing approach to handle data security issue of mobile devices compared to conventional machine learning at the remote cloud with raw data. To enable great potentials in lowlatency and energyefficient FL, we introduce hierarchical Federated Edge Learning (HFEL) framework in which model aggregation is partially migrated to edge servers from the cloud. Furthermore, a joint computation and communication resource scheduling model under HFEL framework is formulated to achieve global cost minimization. Yet proving the minimization problem owns extremely high time complexity, we devise an efficient resource scheduling algorithm which can be decomposed into two subproblems: resource allocation given a scheduled set of devices for each edge server and edge association for all the edge servers. Through cost reducing iterations of solving resource allocation and edge association, our proposed HFEL algorithm terminates to a stable system point where it fulfills substantial performance gain in cost reduction compared with the benchmarks.
Eventually, compared to conventional federated learning without edge servers as intermediaries [2], the HFEL framework accomplishes higher global and test accuracies and lower training loss as our simulation results show.
References
 [1] (2013Mar.) Consumer data privacy in a networked world: a framework for protecting privacy and promoting innovation in the global digital economy. Journal of Privacy and Confidentiality 4 (2). External Links: Link, Document Cited by: §I.
 [2] (201704) CommunicationEfficient Learning of Deep Networks from Decentralized Data. Artificial Intelligence and Statistics, pp. 1273–1282. Cited by: §I, §VB, §VI, §VII.
 [3] (1996) Processor design for portable systems. Journal of Vlsi Signal Processing Systems for Signal Image & Video Technology 13 (23), pp. 203–221. Cited by: §IIA1.
 [4] (2018) Expanding the reach of federated learning by reducing client resource requirements. arXiv preprint arXiv:1812.07210. Cited by: §VI.
 [5] (2018) Leaf: a benchmark for federated settings. arXiv preprint arXiv:1812.01097. Cited by: §VB.
 [6] (2015) Efficient multiuser computation offloading for mobileedge cloud computing. IEEE/ACM Transactions on Networking 24 (5), pp. 2795–2808. Cited by: §I.
 [7] (2019) EU personal data protection in policy and practice. Springer. Cited by: §I.
 [8] (2019) Federated learning over wireless networks: convergence analysis and resource allocation. arXiv preprint arXiv:1910.13067. Cited by: §IIA1, §VB, §VI, §VI.
 [9] (201406) Privacy and big data. Computer 47 (6), pp. 7–9. External Links: Document, ISSN 15580814 Cited by: §I.
 [10] (2016) Deep learning. MIT press. Note: http://www.deeplearningbook.org Cited by: §I.
 [11] (2018) Loadaboost: lossbased adaboost federated machine learning on medical data. arXiv preprint arXiv:1811.12629. Cited by: §VI.
 [12] (2016) Federated optimization: distributed machine learning for ondevice intelligence. arXiv preprint arXiv:1610.02527. Cited by: §I.
 [13] (2016) Federated learning: strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492. Cited by: §VI.
 [14] (2017) Semistochastic coordinate descent. Optimization Methods and Software 32 (5), pp. 993–1005. Cited by: §IIA1.
 [15] (199812) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86, pp. 2278 – 2324. External Links: Document Cited by: §VA, §VB.

[16]
(202001)
Edge ai: ondemand accelerating deep neural network inference via edge computing
. IEEE Transactions on Wireless Communications 19 (1), pp. 447–457. External Links: Document, ISSN 15582248 Cited by: §I.  [17] (2017) Multikey privacypreserving deep learning in cloud computing. Future Generation Computer Systems 74, pp. 76 – 85. External Links: ISSN 0167739X, Document Cited by: §I.
 [18] (2018) Federated optimization in heterogeneous networks. arXiv preprint arXiv:1812.06127. Cited by: §VA.
 [19] (2019) Edgeassisted hierarchical federated learning with noniid data. arXiv preprint arXiv:1905.06641. Cited by: §VI.
 [20] (2017) Distributed optimization with arbitrary local solvers. Optimization Methods and Software 32 (4), pp. 813–848. Cited by: §IIA1.
 [21] (201905) Client selection for federated learning with heterogeneous resources in mobile edge. In ICC 2019  2019 IEEE International Conference on Communications (ICC), Vol. , pp. 1–7. External Links: Document, ISSN 15503607 Cited by: §VI.
 [22] (2019) Accelerating dnn training in wireless federated edge learning system. arXiv preprint arXiv:1905.09712. Cited by: §VI, §VI.

[23]
(201610)
Edge computing: vision and challenges
. IEEE Internet of Things Journal 3 (5), pp. 637–646. External Links: Document, ISSN 23722541 Cited by: §I.  [24] (2004) A note on maximizing a submodular set function subject to a knapsack constraint. Operations Research Letters 32 (1), pp. 41–43. Cited by: §I.
 [25] (2019) Cellfree massive mimo for wireless federated learning. arXiv preprint arXiv:1909.12567. Cited by: §IIA1, §VI, §VI.
 [26] (201907) CMFL: mitigating communication overhead for federated learning. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), Vol. , pp. 954–964. External Links: Document, ISSN 10636927 Cited by: §VI.
 [27] (2019) Hybridfl for wireless networks: cooperative learning mechanism using noniid data. arXiv preprint arXiv:1905.07210. Cited by: §VI.
 [28] (201703) Energyefficient resource allocation for mobileedge computation offloading. IEEE Transactions on Wireless Communications 16 (3), pp. 1397–1411. External Links: Document, ISSN 15582248 Cited by: §I, §I.
 [29] (201812) Federated learning based proactive content caching in edge computing. In 2018 IEEE Global Communications Conference (GLOBECOM), Vol. , pp. 1–6. External Links: Document, ISSN 1930529X Cited by: §VI.
 [30] (2019) Energyefficient radio resource allocation for federated edge learning. arXiv preprint arXiv:1907.06040. Cited by: §VI, §VI.
 [31] (201908) Edge intelligence: paving the last mile of artificial intelligence with edge computing. Proceedings of the IEEE 107 (8), pp. 1738–1762. External Links: Document, ISSN 15582256 Cited by: §I.