Vast amount of data is generated today by mobile devices, from smart phones to autonomous vehicles, drones, and various Internet-of-things (IoT) devices, such as wearable sensors, smart meters, and surveillance cameras. Machine learning (ML) is key to exploit these massive datasets to make intelligent inferences and predictions. Most ML solutions are centralized; that is, they assume that all the data collected from numerous devices in a distributed manner is available at a central server, where a powerful model is trained on the data. However, offloading these huge datasets to an edge or cloud server over wireless links is often not feasible due to latency and bandwidth constraints. Moreover, in many applications dataset reveal sensitive personal information about their owners, which adds privacy as another concern against offloading data to a centralized server. A recently proposed alternative approach is federated edge learning (FEEL) [ML_overair2, ML_overair3, ML_overair4, int_edge1], which enables ML at the network edge without offloading any data.
Federated learning (FL) is a collaborative ML framework [fedlearn1, fedlearn2], where random subsets of devices are selected in an offline manner to update model parameters based on locally available data. Local models are periodically averaged and exchanged among participating devices. This can either be done with the help of a parameter server, which collects the local model updates and shares the updated global model with the devices; or, in a fully distributed manner, where the devices taking part in the collaborative training process seek consensus on the global model through device-to-device communications.
Although the communication bottleneck of FL has been acknowledged in the ML literature, and various communication-efficient distributed learning techniques have been introduced, implementation of these techniques on wireless networks, particularly in heterogeneous cellular networks (HCNs), and the successful orchestration of the large scale learning problem have not been fully addressed. To this end, there are some very recent works that focus on the distributed machine learning framework with a particular focus on wireless communications [ML_overair1, ML_overair2, ML_overair3, ML_overair4, ML_overair5, ML_overair6, ML_overair7, ML_overair8, FL_wireless1, FL_wireless2]. Most of these works propose new communication-efficient learning strategies, specific to wireless networks, which is called the over-the-air aggregation [ML_overair1, ML_overair2, ML_overair3, ML_overair4, ML_overair5, ML_overair6, ML_overair7, ML_overair8]. In this approach, mobile computing nodes are synchronised for concurrent transmission of their local gradient computations or model updates over the wireless channel, and the parameter server receives the noisy version of the gradient sum via utilizing the superposition property of the wireless channel. Although, over-the-air aggregation is a promising solution to mitigate the communication bottleneck in the future communication networks, it imposes stringent synchronization requirements and very accurate power alignment, or, alternatively, the use of a very large number of antennas [ML_overair8].
In this paper, we focus on FEEL across HCNs, and introduce a communication-efficient hierarchical ML framework. In this framework mobile users (MUs) with local datasets are clustered around small-cell base stations (SBSs) to perform distributed stochastic gradient descent (SGD) with decentralized datasets, and these SBSs communicate with a macro-cell base station (MBS) periodically to seek consensus on the shared model of the corresponding ML problem. In order to further reduce the communication latency of this hierarchical framework, we utilize gradient sparsification, and introduce an optimal resource allocation scheme for synchronous gradient updates.
Distributed hierarchical SGD framework has been recently studied in [sgd_local4, sgd_local5], and hierarchical FL is considered in [fedlearn8]. However, only periodic averaging strategy is employed in these works, and the wireless nature of the communication medium is not taken into account. Our contributions in this paper can be summarized as follows:
We introduce a hierarchical FL framework for HCNs and provide a holistic approach for the communication latency with a rigorous end-to-end latency analysis.
We employ communication efficient distributed learning techniques, in particular, sparsification and periodic averaging, jointly, and design a resource allocation strategy to minimize the end-to-end latency.
Finally, focusing on the distributed image classification problem using popular dataset CIFAR-10, we demonstrate that, with the proposed approach, communication latency in a large scale FEEL framework, implemented over HCNs, can be reduced dramatically without sacrificing the accuracy much.
Ii System Model
Consider a HCN with a single MBS and SBSs. In this cellular setting, MUs collaborate to jointly solve an optimization problem of the form
is a vector of sizedenoting the parameters of the model to be learned and is the training loss associated with the th training data sample. We assume that MU has a training data set that is not shared with any other MUs due to bandwidth and privacy constraints. Note that calculating the loss over all of the dataset is time consuming and in some cases not feasible since it cannot fit in the memory. Thus, we employ minibatch SGD in which MU uses a subset of its dataset to calculate the loss. We assume that the batch size for all . In distributed learning, each MU calculates the loss gradients with respect to its own data set. Then, the gradients are shared with other MUs through either peer-to-peer links, or using a central entity (MBS in this work). The MBS collects the gradients, aggregates them by taking the average, and eventually transmits the average gradient to all the MUs 111Note that the MBS can update the model itself and transmit the updated model instead of the average gradients, thus avoiding replicating the update at MUs. However, it is possible to apply sparsification on the average gradient to improve latency in this manner.. Each MU upon receiving the average gradient applies a gradient descend step as follows:
where is the mini-batch randomly chosen from the data samples of MU , and is the learning rate. The generic federated learning (FL) algorithm is described in Algorithm 1. The Algorithm 1 is synchronous in the sense that MBS waits until the gradients from all the MUs are received.
) introduce latency to the training time specially considering the deep neural networks with tens of millions of parameters. Hence, an efficient communication protocol is required for this purpose considering the synchronous nature of Algorithm1. We assume that the bandwith available for communication is Hz. We employ an orthogonal access scheme with OFDM, and assign distinct sub-carriers to MUs . Denote by the number of sub-carriers, where is sub-carrier spacing. We denote the channel gain between MU and MBS on sub-carrier by , where is the complex channel coefficient. The distance of MU to MBS is denoted by , and the path loss exponent by .
Ii-a Uplink Latency
For the latency analysis, we consider the fixed-rate transmission policy with sub-optimal power allocation introduced in [goldsmith], which is simple to implement and performs closely to the optimal water-filling power allocation algorithm. The power allocation policy is truncated channel inversion, which only allocates power if the channel gain is above a threshold, otherwise does not use that subcarrier.
Let denote the power allocated to sub-carrier by MU based on the observed channel gain , and let be the set of uplink (UL) sub-carriers assigned to MU . We should satisfy the average power constraint:
where the expectation is with respect to the probability density function (pdf) of the channel gain,. Since the channel gain is i.i.d over sub-carriers the power constraint becomes
According to the truncated channel inversion policy, the allocated power by MU on sub-carrier becomes
where ensures that the power constraint in (4) is met and,
is the normalized channel gain and is the AWGN noise power on a single sub-carrier. The average power constraint in (4) results in [goldsmith]:
Rather than Shannon capacity, we consider a practical approach where the bits are modulated using -ary QAM. For a given target bit error rate (BER) requirement, the instantaneous UL rate of MU to the MBS on sub-carrier becomes [goldsmith]:
where if the argument is true and otherwise. For the maximum expected transmission rate, we have
The average UL rate of MU for gradient aggregation:
Each MU uses bits to quantize each element of its gradient vector. Since the model has parameters, each MU needs to send bits in total to the MBS at each iteration. To minimize the latency of uploading the gradients to the MBS, we should allocate the sub-carriers so that the minimum average UL rate among MUs is maximized. Hence, we perform the following optimization problem:
For the given solution of (II-A), i.e., the optimal sub-carrier allocation, , the uplink latency of MU on average is
where . Accordingly, the uplink latency in aggregating the gradients of MUs is equal to
Ii-B Downlink Latency
After all the MUs transmit their gradients to the MBS, the average gradient is calculated and MBS is required to transmit the average value back to the MUs. However, since the all workers share a common message, we employ a broadcast policy for this case. We assume that the MBS employ a rateless coding scheme that is adapted to the worst instantaneous signal-to-noise ratio (SNR) on each subcarrier. We assume that the MBS allcoates its available power uniformely over all subcarriers. Specifically, let be the SNR of worker on subcarrier . Then, the instantaneous broadcast rate on subcarrier becomes:
The broadcast will end when all parameters are received by the workers. The broadcast latency, , can be computed as follows:
where the expectation is with respect to the PDF of the channel gain. Per iteration, the end-to-end latency of the FL protocol is given by
Ii-C Sub-carier Allocation Policy
The optimal sub-carrier allocation problem in (II-A) is presented in Algorithm 2. It starts by assigning a single sub-carrier for each MU. Then with a single sub-carrier, each MU optimizes the threshold in (11). Then the algorithm looks for a MU with minimum average UL rate, i.e., , and allocates a single carrier to that MU. Then MU updates its threshold and value. This procedure continues until all available sub-carriers are allocated. The following theorem establishes the optimality of the proposed policy.
The sub-carrier allocation policy in Algorithm 2 is optimal.
The proof is by induction. Let the number of sub-carriers be . Then, a single sub-carrier is allocated to each MU first (line 2), since otherwise there will be a single MU with a rate of zero. The optimal choice is to allocate the remaining single sub-carrier to . To see why, let denote the rate of MU when sub-carriers are allocated. Now consider an alternative policy that allocates the remaining sub-carrier to a different MU . Denote the rates achieved under this alternative policy by . It is obvious that . Thus, the optimal policy allocates the remaining sub-carrier to MU . Now, assume that Algorithm 2 allocates sub-carrier optimally. We need to prove the optimality of the algorithm for . Consider that sub-carriers are allocated and we need to allocate the last sub-carrier. The last sub-carrier is allocated to so that number of its sub-carrier becomes . The alternative policy allocates the last sub-carrier to another MU . It is clear that . Hence, the alternative policy is sub-optimal. This concludes the proof. ∎
Iii Distributed Hierarchical Federated Learning
In centralized FL [fedlearn6]
, MUs transmit their computation results (local gradient estimates) to the parameter server (MBS) for aggregation at each iteration. However, in large scale networks, this centralized framework may result in high communication latency and thus increases the convergence time. To this end, we introducehierarchical federated learning, in which multiple parameter servers are employed to reduce the communication latency. In the proposed hierarchical framework, MUs are clustered according to their locations. In each cluster a small cell base station (SBS) is tasked with being the parameter server. At each iteration, MUs in a cluster send their local gradient estimates to the SBS for aggregation, instead of the MBS. Then, the SBSs compute the average gradient estimates and transmit the results back to their associated MUs to update their model accordingly.
In this framework, gradient communication is limited to clusters, particularly between the MUs and the corresponding SBSs. This not only reduces the communication distance, communication latency, but also allows the spatial reuse of the available communication resources. On the other hand, limiting the gradient communications within clusters may prevent convergence to a single parameter model (i.e., global consensus).
To this end, we combine aforementioned intra-cluster gradient aggregation method with inter-cluster model averaging strategy, such that after every consecutive intra-cluster iterations, SBSs send their local model updates to the MBS to establish a global consensus. The overall procedure is illustrated in Figure 1.
Denote by the set of MUs belonging to cluster , with being the number of clusters. During each consecutive intra-cluster iterations, the local gradient estimates of the MUs are aggregated within the clusters. For example, at iteration each MU , for computes the local gradient estimate, denoted by , and transmits it to the SBS in cluster . Then, the SBS aggregates the gradients simply by taking the average,
This average is then sent back by the SBS to the MUs in its cluster, and the model at cluster is updated as
After iterations, all SBSs transmit their models to the MBS through UL fronthaul links. The MBS calculates the model average , and transmits it to the SBSs over the DL fronthaul links. Upon receiving the model update, the SBSs share it with the MUs in their cluster. Hence, after iterations all the MUs share a common model parameters, globally. The HFL algorithm is presented in Algorithm 3.
Iii-a Communication Latency analysis
In the hierarchical scheme, after clustering the MUs, clusters are colored so that any two clusters with the same color are separated by at least distance to minimize interference between clusters. For simplicity, we assume that there is zero interference on receivers located beyond . If colors are used in total, the available OFDM sub-carriers are divided into groups, and the sub-carriers in each group are allocated to clusters with a particular color. Consequently, in each cluster the number of available OFDM sub-carriers is proportional to . Before the delay analysis, we have the following assumptions regarding the location of the MUs.
The MUs are uniformly distributed and each cluster contains equal number of MUs.
The MUs are uniformly distributed and each cluster contains equal number of MUs.
The SBSs are located at the origin of the corresponding clusters.
In the local gradient update step of HFL (see Figure 0(a)) communication latency analysis is similar to the FL. The only difference is the number of sub-carriers inside the clusters which is . Moreover, the MUs transmit to the SBSs instead of MBS. Denote by the maximum average UL rate of MU . The UL latency of gradient aggregation in cluster is denoted by , . Similarly, let the maximum average DL rate of MU . The DL latency of gradient aggregation in cluster is denoted by , .
After iterations, SBSs send the model to the MBS for the purpose of averaging the clusters model. Let , be the UL, DL rate of SBSs to the MBS, respectively. The UL, DL latency at each period of iterations become and , respectively. There is also the latency of transmitting the average model by SBSs to their associated MUs. The average latency associated with a period of hierarchical distributed SGD becomes.
where and is the latency of -th iteration of UL and DL aggregation in cluster , respectively. The average per iteration latency of HFL becomes .
Iv Sparse Communications
The trend of going deeper in the depth of the neural networks for increasing the accuracy have resulted in NNs with tens of millions of parameters. The amount of data required to be communicated is challenging even for cable connections let alone the wireless links. On top of performing periodical parameter averaging, sparse communication can be used to significantly improve the latency. In sparsification the fact that the gradient vector is sparse is used to transmit only a fraction (i.e., ) of parameters and considerably reduce latency.
To make sure that all gradients are transmitted eventually, a separate parameter vector, is used to accumulate the error associated with the gradients that are not transmitted. The gradients that are not transmitted will grow in time, as recorded by , and eventually will be transmitted. More specifically the error buffer is calculated as
Now each MU , instead of transmitting transmits to be aggregated at the MBS (or SBSs in the clusters).
Note that the vanilla SGD used an Algorithm 1 and 3 is the simplest form of a optimizer and its performance is quite poor in large scale optimization problems. An efficient way of accelerating the performance of vanilla SGD is to apply momentum method. In momentum method the parameters are updated as following:
where is the momentum and is the aggregated gradient.
Directly applying momentum to the sparsed gradients will result in a poor performance and momentum correction is required. Here, we directly employ the method preoposed in [SGD_sparse1]
Sparsification delays transmitting gradients that are too small. When they are finally transmitted they become stale and slow down the convergence. To combat the staleness [SGD_sparse1], we apply the inverted sparsification to both accumulated gradients and momentum factor as follows:
The mask simply prevents the stale momentums to be applied. The detailed algorithms for sparse federated SGD is represented in Algorithm 4.
|, , ,||Sparsification parameters for uplink from MU to SBS, downlink from SBS to MU, uplink from SBS to MBS and downlink from MBS to SBS.|
|, ,||Model errors due to sparsification before downlink from MBS to SBSs, downlink from to MU and uplink from to MBS respectively.|
|, and||Parameter model at th MU, th SBS and MBS respectively.|
|, and||Model difference send to SBSs from MBS, to MBS from and to MUs from respectively.|
Iv-a Sparse Communication and Error Accumulation
Our proposed HFL framework consists of 4 communication steps: uplink from MU to SBS, downlink from SBS to MU, uplink from SBS to MBS and downlink from MBS to SBS. For each communication step, we employ different sparsification parameters, , , and respectively, to speed up the communication. We introduce the function , which maps a dimensional vector to its sparse form where only portion of the indicies have non-zero values.
The sparsification procedure in each step leads to an error in the parameter model and thus slows down the convergence. To overcome this issue we employ the discounted error accumulation technique, similar to [SGD_q4, fedlearn7], which uses the discounted version of the error for the next model update. Before the details of the error accumulation strategy, we want to introduce the following parameters and which denotes the parameter model at th SBS and MBS respectively. We note that to employ sparsification for model averaging it is more convenient to transmit the model difference rather than the model. To this end, we introduce the reference models and for SBSs and MBS respectively so that each SBS sends the model difference based on to the the MBS and based on to the corresponding MUs. In the proposed HFL framework, we use , , and to keep the model errors due to sparsification before downlink from MBS to SBSs, downlink from to MU and uplink from to MBS respectively. The overall procedure for the HFL framework is illustrated in Algorithm 5, where error accumulation strategy is employed at lines 21, 28, 34 and parameters and are the discount factors for the error accumulation.
V Numerical Results
|Number of sub-carriers|
|MBS Tx power||W|
|SBS Tx Power||W|
|MU Tx power||W|
V-a Network topology
We consider a circular area with radius meters where users are generated uniformly randomly inside it. We consider hexagonal clusters where the diameter of circle inscribed is meters. The SBSs are exactly resided in the center of the hexagons. To mitigate the interference between the clusters, we use a simple reuse pattern [reuse] as shown in Figure 2. We assume that the fronthaul link is times faster than the UL, DL between MUs and SBSs 222 For a MIMO the fornthaul rate estimate is Gbps with 3GPP split option 2 and Gbps with 3GPP split option 7 [5gfronthaul].. Total number of clusters are .
There are sub carriers with sub carrier spacing of KHz. The maximum transmit powers of MBS and SSBs are and W, respectively and maximum transmit power of MUs is W [earth].
V-B Implementation guideline
In our numerical analysis, we consider the CIFAR-10 dataset for image classification problem with 10 different image classes [dataset2] and train the ResNet18 architecture [training2]. For further details regarding the trained NN structures please refer to [nn].
We dont apply weight decay to batch normalization parameters
For the simulations, we also utilize some large batch training tricks such as scaling the learning rate and employing a warm-up phase [training1]. In all simulations, data sets are divided among the MUs without any shuffling and through the iterations MUs train the same subset of the dataset as in the FL framework and we set the batch size for training to . In general, batch size is accepted as the baseline batch size with the corresponding learning rate and the initial learning rate is adjusted according to the cumulative batch size accordingly [training1, training2]. Hence, we set the initial learning to , also we consider the first epoch as the gradual warm-up phase where training starts with a small learning rate and then increased linearly at each iteration until it reaches the value of the given initial learning rate. For the network training, we follow the guideline in [sgd_local4], we train the network for 300 epochs, and at the end of 150th epoch we drop the initial learning rate by factor of 10, and similarly end of the 225th epoch we drop the learning rate by factor of 10 again. Further, for the weight decay factor333
We dont apply weight decay to batch normalization parametersand the momentum term we use and respectively in all simulations. Finally, for the discounted error accumulation we set and .
We first study the amount of speed up in latency achieved by HFL versus FL. We measure the speed by comparing the per iteration latency of HFL i.e., and FL, . More specifically, speed up . By varying the number of MUs in each cluster, and for different periods of , we plot the speed up achieved by HFL, when sparsity parameters , , , are used, in Figure 3. We observe that HFL achieves good latency speed up with respect to FL and it improves when the period increases.
Clustering reduces the communication distance and as a result improves the SNR. The amount of improvement depends on the amount of reduction in path-loss. In Figure 4, we illustrate the amount of latency speed up due to clustering as a function of the path-loss exponent, . When the path-loss exponent is increased, the SNR in centralized scheme is punished more severely than clustering scheme due to longer communication paths. Thus, the latency speed up improves when the path-loss is more severe.
To see the importance of sparsification, we compare the HFL and FL with sparse HFL and sparse FL in Figure 4(a) and 4(b), respectively. For both FL and HFL the sparsification provides a significant improvment. However, the latency improvement in HFL is more robust with respect to increasing the number of MUs. This is due to the fact that MBS is serving more number of MUs than a single MBS, and hence the scarcity of resources in macro cell has more impact to the latency than small cells.
The Top-1 accuracy achieved by FL and HFL algorithms are illustrated in Figure 6 for CIFAR-10 data set trained with ResNet 18. We observe that latency speed up delivered by HFL over FL schemes does not compromise the accuracy of the ML model. In fact, a closer look at the accuracy (average over runs) presented in Table III show that HFL is able to achieve a better accuracy than FL in all situations. The results for the last epoch is reprted in Table III, where the Baseline result is obtained by training a single MU on the whole training set. We observe a small degradation in the accuracy for our introduced hierarchical distributed learning strategy. We conjecture that this degradation is mainly due to the use of momentum SGD at each MU instead of a global momentum and due to sparsification. We also observe that FL, based on [SGD_sparse1], performs poorly compared to HFL. We believe that this poor performance is mainly due to the downlink sparsification that we consider.
We believe that using an additional global momentum term [sgd_local4] or utilizing momentum averaging strategy [sgd_local3] accuracy of the HFL can be improved further.
|CIFAR-10 [dataset2] / ResNet18 [training2]||Baseline||-|
|CIFAR-10 [dataset2] / ResNet18 [training2]||FL||28 MUs|
|CIFAR-10 [dataset2] / ResNet18 [training2]||HFL,||clusters MUs|
|CIFAR-10 [dataset2] / ResNet18 [training2]||HFL,||clusters MUs|
|CIFAR-10 [dataset2] / ResNet18 [training2]||HFL,||clusters MUs|
As mentioned above, a momentum correction strategy for the clusters can help to improve the accuracy as weel as increasing the convergence speed We will consider this as a future work.
In addition, one of the recent research area regarding the federated learning framework is the non-IID data distribution among the MUs [fedlearn5, fedlearn7, fedlearn8, fedlearn9] hence, we are planning to extend our study to the non-IID distribution scenario as well. Finally, another research direction that we are planning to investigate is the optimal batchsize for MUs. In [training3], it has been shown that learning speed increases with the batchsize until certain point, hence we believe that a good strategy for federated learning is to adjust the batchsize according to the number users. We are planing to extend our analysis in this direction and do an extended analysis on training time including the computation time of the MUs.