I Introduction
With the rapid development of machine learning (ML) techniques, emerging applications, including virtual and augmented reality, Internet of things, autonomous driving and ehealth services, are penetrating into human lives
[1]. ML models for these applications are typically trained in central clouds. However, centralized training leads to high communication costs, and causes privacy concerns in applications that involve sensitive personal data.Meanwhile, the end devices such as smart phones, vehicles and sensors and the infrastructures like base stations (BSs) and road side units are being equipped with more computing resources, enabling intensive computations at the network edge, namely multiaccess edge computing [2, 3, 4]. With the help of edge intelligence and to address the privacy concerns, a distributed ML framework called federated learning (FL) has been proposed recently [5, 6, 7], where end devices, called workers, learn a shared ML model collaboratively using their local data, with the help of a central parameter server (PS) which aggregates the global model and coordinates the training process. Since the PS acquires the model update from each worker rather than their data, the privacy is preserved.
The high communication costs and nonindependent and identically distributed (i.i.d.) data are the two major bottlenecks in FL [7]. According to [8], when using highly noni.i.d. data for FL, the accuracy drops by for MNIST and for CIFAR10 as compared to using i.i.d. data. They prove that the noni.i.d. level, i.e., the difference between the local and global data distributions, is the root cause of the performance degradation. This problem is tackled by sharing publicly available i.i.d. data with the workers in [8], or workers sharing a limited portion of their data with the PS in [9].
The communication burden of FL mainly comes from the global model aggregation, which can be reduced by efficient scheduling and resource allocation [10, 11, 12, 13, 14], gradient quantization and sparsification [14, 15, 16], or via analog aggregation [15, 16, 17]. An analytical study on the convergence rate achieved by random, round robin and proportional fair scheduling policies is carried out in [10]. An energyefficient bandwidth allocation and worker scheduling scheme is proposed in [11], minimizing the energy consumption while maximizing the fraction of workers scheduled. A more general resource constraint, including both communication and computing resources, is considered in [12, 13]. In [14], a hierarchical FL architecture is proposed, and the endtoend latency is minimized by jointly considering model sparsification and the twotier update interval. Quantization and error accumulation techniques are further considered in [15, 16] to reduce the communication cost.
Most papers on FL consider digital transmission for global model aggregation. However, the communication latency scales with the number of workers [17]. Observing that the PS is interested only in the average of local models rather than their individual values, a promising solution is to use analog aggregation which exploits the waveformsuperposition property of a wireless multiple access channel (MAC) [15, 16, 17]. If the workers synchronize with each other and align the transmit power, the summation of local models can be carried out overtheair
. The tradeoff between signaltonoise ratio (SNR) and the amount of exploited data is analyzed in
[17], while gradient compression and error accumulation are considered in [15, 16] to further improve the bandwidth efficiency of FL. While overtheair aggregation requires channel state information at the workers, it is shown in [18] that this requirement can be released if the PS has multiple antennas.Existing papers on analog aggregation mainly consider power allocation under specific channel models, and have not addressed the noni.i.d. data. In this work, we consider analog aggregation for FL, where each worker has a longterm energy budget, and data redundancy is introduced to the system via data exchange or overlapped data collection. We propose an energyaware dynamic worker scheduling policy, which maximizes the average weighted fraction of scheduled workers without assuming specific channel models or requiring any future information, and analyze its performance based on Lyapunov optimization. Experiments using MNIST dataset show that, data redundancy can bring significant accuracy improvement when data is noni.i.d., while the proposed policy can smartly utilize the available energy to achieve closetooptimal accuracy.
The rest of the paper is organized as follows. In Sec. II, we introduce the system model and problem formulation. The worker scheduling policy is proposed in Sec. III, along with its performance analysis. Experiment results are presented in Sec. IV, and the paper is finally summarized in Sec. V.
Ii System Model and Problem Formulation
Iia Federated Learning Architecture
As shown in Fig. 1(a), we consider an FL system with a single PS and homogeneous workers . To tackle noni.i.d. data, we consider introducing data redundancy to the system via 1) data exchange, i.e., workers exchange their local data with neighboring workers they trust, or 2) overlapped collection, i.e., in IoT networks, the sensing area of sensors (workers) are overlapped with each other. For example, in Fig. 1(b), there are 4 data collection points, and each worker collects data from 2 points. As opposed to [8] and [9], this does not require sharing any data samples at the PS. We assume that there are original datasets generated by workers (in data exchange case) or data collection points (in overlapped collection case), each has the same number of data samples, denoted by . The global dataset is defined as , with data samples . For simplicity and without loss of generality, we assume in the following.
The data redundancy of the system is denoted by , indicating that each original dataset is stored at different workers. The local dataset owned by worker is denoted by , with . One example to obtain redundancy is to collect or exchange the original datasets in a cyclic manner: Let worker stores , where the set of indexes is , with
(1) 
Fig. 1(b) is an example of obtaining redundancy in the data collection scenario, with .
The goal of FL is to minimize the global loss
(2) 
where
is a loss function designed for the FL task, and
is the parameter vector to be optimized.
In the th training round, the PS broadcasts the global parameter vector obtained in the last round to all the workers. We assume that the PS is a more capable node with sufficient energy resources (e.g., a BS); therefore the broadcast of the global parameter vector is errorfree. Each worker randomly picks up a fraction of data samples , with
, to evaluate its local gradient estimate
as:(3) 
Here we can let , so that data redundancy does not bring additional computing workloads to the workers for training, but only increases the storage cost.
We define an indicator function , where if worker is scheduled to upload gradient in the th round, and otherwise. Further define the set of workers scheduled in round as . The global parameter vector is updated according to
(4) 
where is the learning rate.
IiB Analog Aggregation
For the aggregation of the local gradients, we consider analog transmission via a wireless MAC with subchannels. If worker is scheduled, its local gradient is evenly partitioned into segments , where, for , is a vector with either or entries, and transmitted via subchannel .
In order to carry out the summation of the local gradients overtheair, all the scheduled workers need to be synchronized and align their transmit power. Specifically, in round , denote the power allocated to worker within subchannel by , which satisfies
(5) 
where is the channel gain between worker and the PS in subchannel , and is a power scalar to determine the received SNR. We assume that remains constant within each round, but we do not limit to any specific distribution. We also assume that each worker has perfect knowledge of its current channel gains , . Within subchannel , each scheduled worker transmits to the PS. The total communication latency in each round is times the symbol duration, regardless of the number of workers scheduled. We remark that, the consideration of subchannels enables us to implement analog aggregation in the current digital transmit systems such as orthogonal frequency division multiplexing (OFDM) with minor changes [17]. However, we consider a workerlevel schedule rather than a subchannelworkerlevel schedule in this work, so that the PS can receive the whole gradients of the scheduled workers, as shown in (4).
At the PS side, the received signal over subchannel can be written as
(6) 
where
is an i.i.d. additive white Gaussian noise (AWGN) vector, with each entry following the standard normal distribution. The
th segment of the global parameter vector, , is updated according to(7) 
And finally, the global parameters are concatenated into vector .
IiC Problem Formulation
Our objective is to minimize the global loss after training rounds, by optimizing the worker schedule and power allocation . Meanwhile, we also want to explore how data redundancy impacts the performance. The problem is formulated as:
(8a)  
(8b)  
(8c)  
(8d) 
The first constraint (8b) states that the average energy consumed by each worker in each training round cannot exceed budget , due to the battery limitation of wireless devices.^{1}^{1}1We unify each round to a unit time length, and use power and energy interchangeably in this paper without any ambiguity. The second constraint (8c) states that all the scheduled workers align their power to enable overtheair computation.
Since the loss function is usually different for different kinds of machine learning tasks, and the evolution of the parameters during the training process is very complex, it is hard to express explicitly. Meanwhile, the convergence rate of distributed SGD is found to be positively correlated to the number of workers scheduled, as shown in [11] and references therein. Therefore, we consider an alternative optimization problem that maximizes the average weighted fraction of scheduled workers:
(9) 
where characterizes the importance of scheduling more workers in the th round.
We further fix to a predefined value for . This assumption is based on the fact that, with analog aggregation, the convergence speed of the FL task is not very sensitive to the SNR or the average transmit power, according to [16]. Then can be obtained from (8c) after deciding whether to schedule worker or not. Finally, the energy consumption is given by
(10) 
The alternative problem can be formulated as
(11a)  
(11b)  
(11c) 
Iii Energyaware Worker Scheduling
The key challenge to solve is that, constraint (11b) is a longterm energy budget. However, in practice, the channel gains and the power of the gradients cannot be acquired before the th round, and they may not be i.i.d. over time. Therefore, we design online worker scheduling policies in this section, and carry out performance analysis without assuming any specific distributions for the channel.
For any worker , it is easy to see that, its energy constraint (11b) and the scheduling decision are independent of other workers. Then can be equivalently decoupled into individual problems
(12a)  
(12b)  
(12c) 
The combination of the optimal solution of for all workers is the optimal solution of . And by solving , each worker can decide whether or not to update gradient individually. In what follows, we design online solutions to .
Iiia Myopic Scheduling for Shortterm Fixed Energy Constraint
(13) 
A simple way to handle the average energy constraint (12b) is to remove the longterm summation, and transform it to a shortterm fixed energy constraint , for . Then the myopic scheduling policy can be given by
(14) 
In the th round, worker acquires the current channel gains and the powers of the gradient for , and calculates the required energy to send its gradient estimate. If the required energy is no more than the budget , worker is scheduled.
IiiB EnergyAware Dynamic Scheduling for Longterm Average Energy Constraint
Although the myopic scheduling policy is simple and can satisfy the original longterm average energy budget, it actually introduces a much tighter energy constraint. In other words, the worker uses its energy in a more conservative manner, and is less likely to be scheduled compared with that allowed by the original energy budget (12b).
To schedule workers more efficiently, we propose an energyaware dynamic scheduling policy, as shown in Algorithm 1. We construct a virtual queue with , and its evolution is given by
(15) 
The value of the virtual queue indicates the deficit between the current energy consumption and the budget.
As shown in Lines 34 in Algorithm 1, in each round, first, the PS broadcasts the uptodate global parameter vector, and each worker runs a local gradient estimation step in parallel based on its local data. In Lines 56, each worker makes the scheduling decision by comparing the weighted energy and the weighted utility , where is an adjustable weight parameter. The virtual queue transforms the longterm energy budget into instantaneous energy constraint: if is large, it is more likely that , so that the worker tends to not update the gradient to save energy; and vice versa. Also, by introducing , can be obtained without any future information, or other workers’ states and decisions. Therefore, the proposed algorithm is energyaware, online and distributed. As shown in Lines 79, all the workers then update their virtual queues, and the scheduled workers transmit their gradients to the PS synchronously. The global parameter vector is finally aggregated by the PS.
IiiC Performance Analysis
To analyze the performance of the proposed algorithm, we assume that in this subsection, and refer to the Lypunov optimization technique [19]
. We consider a nonergodic version of Lyapunov optimization, i.e., all the random variables can be noni.i.d. across time; that is, 1) the distribution of the power of the gradient
is unknown; 2) the channel and user mobility are not limited to specific models; 3) the total number of rounds is finite.Define as the optimal utility of achieved by the offline genieaided solution, and the average utility achieved by Algorithm 1. Let be the optimal utility of , and . By applying the energyaware dynamic scheduling policy, we get the following theorem.
Theorem 1.
When , the average weighted fraction of scheduled workers achieved by Algorithm 1 satisfies:
(16) 
and the total energy consumption of worker is bounded by:
(17) 
where .
Proof.
See Appendix A. ∎
Theorem 1 shows that, the average utility and the total energy consumption achieved by the proposed energyaware dynamic scheduling policy have deviation bounds, compared with the optimal genieaided policy and the energy budget, respectively. Both deviations are positively correlated to the maximum energy deficit , and can be tradedoff by the weight parameter .
IiiD Discussions
We remark that, according to (13), when , the scheduling indicator if . However, the energy cost may be very high, leading to a large upper bound on the energy deficit , and thus a large deviation from the optimal genieaided solution. Moreover, the worker cannot be scheduled for many rounds afterwards, which in turn reduces the average utility. Therefore, although in the classical Lyapunov optimization, and the worstcase utility and energy consumption can be guaranteed, in the experiments, we set , which is more reasonable.
In Algorithm 1, the weight parameter is also involved in balancing the utility and energy consumption. In practice, we can set as a relatively large value at the beginning of the training process, and decrease it across time, since 1) Scheduling more workers at the initial training rounds helps the learning process converge faster; 2) The power of the gradients reduces across time, according to the experiments in the following; 3) The battery level decreases as time goes by, so that we should use energy more and more conservatively.
Iv Experiments
In this section, we evaluate the gain of data redundancy and the performance of the proposed worker scheduling policy for a digit recognition FL task, using the MNIST dataset^{2}^{2}2http://yann.lecun.com/exdb/mnist/ with training samples and test samples. We consider workers, subchannels and training rounds. We divide the dataset in either an i.i.d. or a noni.i.d. fashion. For i.i.d. data partition, the data samples are randomly partitioned into datasets . For the noni.i.d case, the data samples are first sorted by the order of digits, and each dataset is formed by data samples from a single digit. Each worker stores of these datasets in a cyclic manner according to (1), a total of samples. During each round, a fraction of data samples are randomly chosen from the data samples worker stores for training, so that the amount of computation for training is the same under different data redundancies.
We train a multilayer perceptron (MLP) with a 784neuron input layer, a 64neuron hidden layer, and a 10neuron softmax output layer. The total number of parameters is 50890. Cross entropy is adopted as the loss function, and rectified linear unit (ReLU) activation is used. We set the learning rate
as 0.05, the dropout probability as 0.5, and use a momentum of 0.5. We consider Rayleigh fading channels from the workers to the PS, where the channel gain
follows the standard complex normal distribution, and each entry of the additive noise vector follows standard normal distribution. Also, the power scalar is set as . For the energyaware dynamic scheduling policy, we let , and .In Fig. 2, we first explore the impact of data redundancy by evaluating the accuracy of the MLP using the myopic scheduling policy under both i.i.d. and noni.i.d. data. The energy budget of each worker is set the same, with or . Overall, the accuracy with i.i.d. data outperforms that with noni.i.d. data. When data is i.i.d., redundancy hardly brings any benefit to the system, as the workers can already use images of different digits to train the model even if . However, data redundancy can significantly improve the accuracy of the MLP under noni.i.d. data. Specifically: 1) Given the energy budget, as the data redundancy increases, the accuracy of the model increases with diminishing marginal gain. When , an improvement of is achieved by increasing redundancy from to , and from to . 2) Increasing the energy budget helps to improve the performance of the MLP since more workers can be scheduled in each round, while the system is less sensitive to the energy constraint in the case of i.i.d. data. Comparing with , the accuracy improves by when , by when , but less than when .
In Fig. 3, we plot the power of the gradients to guide the parameter design of the energyaware dynamic scheduling policy. Each scatter point represents the power of the gradient of a worker in that round, and the lines are obtained by averaging these across workers, i.e., . We find that, the power of the gradients reduces dramatically in the first 1015 training rounds, and is quite stationary afterwards. Motivated by this observation, the weight parameter is set as:
(18) 
The performance of the proposed energyaware dynamic scheduling policy is then presented in Fig. 4. As shown in Fig. 4(a), we compare the accuracy of the dynamic scheduling policy with the myopic policy and an upperbound, which is achieved by setting , so that all the workers can be scheduled in each round. We can see that, under both i.i.d. and noni.i.d. data distributions, the proposed algorithm can achieve closetooptimal accuracy, while outperforming the myopic policy. Note that under i.i.d. data distribution, we further reduce the energy budget to . Since the power of the gradient is high at the beginning of training, no worker can satisfy the energy constraint using the myopic policy. However, the dynamic scheduling policy enables workers to borrow energy from the future, so that workers can be scheduled. Fig. 4(b) and Fig. 4(c) present the fraction of workers scheduled in each round, and the maximum cumulative energy consumption over workers, respectively. By using the dynamic scheduling policy, more workers can be scheduled compared with the myopic policy, and the energy can be fully utilized. Specifically, with noni.i.d. data, when and data redundancy is , dynamic policy schedules an average of workers in each round, while myopic policy can only schedule workers. Moreover, the dynamic scheduling policy schedules more workers at the beginning while the energy is sufficient, so that the convergence of the MLP is accelerated.
V Conclusions
We have considered analog aggregation for FL over wireless channels and introduced data redundancy to the system to deal with noni.i.d. data. We have proposed an energyaware worker scheduling policy to maximize the weighted fraction of scheduled workers, which works in an online, distributed manner with performance guarantee. Experiments on the MNIST dataset have been carried out, showing that for noni.i.d. data, increasing data redundancy to can improve the accuracy by under a stringent energy budget. Further increases in redundancy lead to diminishing improvements in accuracy. We have also shown that, the proposed energyaware dynamic scheduling policy can achieve closetooptimal performance without violating the energy budget, and on average schedule
more workers than a heuristic myopic policy. In the future, we plan to quantify the impact of data redundancy, and consider gradient compression and error accumulation to further reduce the communication cost.
Acknowledgment
This work is sponsored in part by the European Research Council (ERC) under Starting Grant BEACON (grant No. 677854), the Nature Science Foundation of China (No. 61871254, No. 91638204, No. 61571265, No. 61861136003, No. 61621091), National Key R&D Program of China 2018YFB0105005, and Intel Collaborative Research Institute for Intelligent and Automated Connected Vehicles.
Appendix A Proof of Theorem 1
Let . From (15), we have , , , and . When , , and thus
(19) 
Define the Lyapunov function as , and the oneslot Lyapunov drift as:
(20) 
where . The oneslot driftpluspenalty function is given by:
(21) 
The main idea of the proposed dynamic scheduling policy is to minimize the righthandside of (21). We have:
(22) 
Since , the optimal solution of (22) is:
(23) 
Define the slot drift as . Then the slot driftpluspenalty function can be bounded by:
(24) 
where , and are the optimal utility of obtained by the optimal genieaided policy, and the corresponding value of queue and energy deficit. The inequality is obtained since the proposed algorithm minimizes in each round.
References
 [1] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless network intelligence at the edge,” [Online] Available: https://arxiv.org /abs/1812.02858, Dec. 2018.
 [2] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey on mobile edge computing: The communication perspective,” IEEE Commun. Surveys Tuts., vol. 19, no. 4, pp. 23222358, 2017.

[3]
H. Li, K. Ota, and M. Dong, “Learning IoT in edge: Deep learning for the Internet of things with edge computing,”
IEEE Network, vol. 32, no. 1, pp. 96101, Jan.Feb. 2018.  [4] S. Zhou, Y. Sun, Z. Jiang, and Z. Niu, “Exploiting moving intelligence: Delayoptimized computation offloading in vehicular fog networks,” IEEE Commun. Mag., vol. 57, no. 5, pp. 4955, May 2019.
 [5] J. Konečný, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” NIPS Workshop on Private MultiParty Machine Learning, [Online] Available: https://arxiv.org/abs/1610.05492, Oct. 2016.
 [6] K. Bonawitz, et al. “Towards federated learning at scale: System design,” In Conference on Systems and Machine Learning, Stanford, CA, USA, Apr. 2019.
 [7] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods, and future directions,” [Online] Available: https://arxiv.org/abs/1908.07873, Aug. 2019.
 [8] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated learning with noniid data,” [Online] Available: https://arxiv.org/abs/1806.00582, Jun. 2018.
 [9] N. Yoshida, T. Nishio, M. Morikura, K. Yamamoto, and R. Yonetani, “HybridFL: Cooperative learning mechanism using noniid data in wireless networks,” [Online] Available: https://arxiv.org/abs/ 1905.07210, May 2019.
 [10] H. H. Yang, Z. Liu, T. Q. S. Quek, and H. V. Poor, “Scheduling policies for federated learning in wireless networks,” [Online] Available: https://arxiv.org/abs/1908.06287, Aug. 2019.
 [11] Q. Zeng, Y. Du, K. K. Leung, and K. Huang, “Energyefficient radio resource allocation for federated edge learning,” [Online] Available: https://arxiv.org/abs/1907.06040, Jul. 2019.
 [12] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resource constrained edge computing systems,” IEEE J. Sel. Areas Commun., vol. 37, no. 6, pp. 12051221, Jun. 2019.
 [13] N. H. Tran, W. Bao, A. Zomaya, Minh N.H. Nguyen, and C. S. Hong, “Federated learning over wireless networks: Optimization model design and analysis,” IEEE Conf. on Computer Commun. (INFOCOM), Paris, France, May 2019.
 [14] M. S. H. Abad, E. Ozfatura, D. Gündüz, and O. Ercetin, “Hierarchical federated learning across heterogeneous cellular networks,” [Online] Available: https://arxiv.org/abs/1909.02362, Sept. 2019.

[15]
M. Mohammadi Amiri, and D. Gündüz, “Machine learning at the wireless edge: Distributed stochastic gradient descent overtheair,”
IEEE Int. Symp. on Inform. Theory (ISIT), Paris, France, Jul. 2019.  [16] M. Mohammodi Amiri, and D. Gündüz, “Federated learning over wireless fading channels.” [Online] Available: https://arxiv.org/abs /1907.09769, Jul. 2019.
 [17] G. Zhu, Y. Wang, and K. Huang, “Lowlatency broadband analog aggregation for federated edge learning,” [Online] Available: https://arxiv.org/abs/1812.11494, Jan. 2019.
 [18] M. Mohammadi Amiri, T. M. Duman, and D. Gündüz, “Collaborative machine learning at the wireless edge with blind transmitters,” IEEE Global Conference on Signal and Information Processing (GlobalSIP), Ottawa, Canada, Nov. 2019.
 [19] M. J. Neely, Stochastic Network Optimization With Application to Communication and Queueing Systems. San Rafael, CA, USA: Morgan & Claypool, 2010.
Comments
There are no comments yet.