Energy-Aware Analog Aggregation for Federated Learning with Redundant Data

11/01/2019
by   Yuxuan Sun, et al.
0

Federated learning (FL) enables workers to learn a model collaboratively by using their local data, with the help of a parameter server (PS) for global model aggregation. The high communication cost for periodic model updates and the non-independent and identically distributed (i.i.d.) data become major bottlenecks for FL. In this work, we consider analog aggregation to scale down the communication cost with respect to the number of workers, and introduce data redundancy to the system to deal with non-i.i.d. data. We propose an online energy-aware dynamic worker scheduling policy, which maximizes the average number of workers scheduled for gradient update at each iteration under a long-term energy constraint, and analyze its performance based on Lyapunov optimization. Experiments using MNIST dataset show that, for non-i.i.d. data, doubling data storage can improve the accuracy by 9.8 budget, while the proposed policy can achieve close-to-optimal accuracy without violating the energy constraint.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 9

page 10

06/18/2020

Federated Learning With Quantized Global Model Updates

We study federated learning (FL), which enables mobile devices to utiliz...
07/23/2021

Device Scheduling and Update Aggregation Policies for Asynchronous Federated Learning

Federated Learning (FL) is a newly emerged decentralized machine learnin...
10/25/2021

Optimization-Based GenQSGD for Federated Edge Learning

Optimal algorithm design for federated learning (FL) remains an open pro...
10/21/2019

Resource Allocation in Mobility-Aware Federated Learning Networks: A Deep Reinforcement Learning Approach

Federated learning allows mobile devices, i.e., workers, to use their lo...
06/13/2021

Federated Learning Over Wireless Channels: Dynamic Resource Allocation and Task Scheduling

With the development of federated learning (FL), mobile devices (MDs) ar...
09/11/2020

Federated Generalized Bayesian Learning via Distributed Stein Variational Gradient Descent

This paper introduces Distributed Stein Variational Gradient Descent (DS...
08/06/2019

Motivating Workers in Federated Learning: A Stackelberg Game Perspective

Due to the large size of the training data, distributed learning approac...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the rapid development of machine learning (ML) techniques, emerging applications, including virtual and augmented reality, Internet of things, autonomous driving and e-health services, are penetrating into human lives

[1]. ML models for these applications are typically trained in central clouds. However, centralized training leads to high communication costs, and causes privacy concerns in applications that involve sensitive personal data.

Meanwhile, the end devices such as smart phones, vehicles and sensors and the infrastructures like base stations (BSs) and road side units are being equipped with more computing resources, enabling intensive computations at the network edge, namely multi-access edge computing [2, 3, 4]. With the help of edge intelligence and to address the privacy concerns, a distributed ML framework called federated learning (FL) has been proposed recently [5, 6, 7], where end devices, called workers, learn a shared ML model collaboratively using their local data, with the help of a central parameter server (PS) which aggregates the global model and coordinates the training process. Since the PS acquires the model update from each worker rather than their data, the privacy is preserved.

The high communication costs and non-independent and identically distributed (i.i.d.) data are the two major bottlenecks in FL [7]. According to [8], when using highly non-i.i.d. data for FL, the accuracy drops by for MNIST and for CIFAR-10 as compared to using i.i.d. data. They prove that the non-i.i.d. level, i.e., the difference between the local and global data distributions, is the root cause of the performance degradation. This problem is tackled by sharing publicly available i.i.d. data with the workers in [8], or workers sharing a limited portion of their data with the PS in [9].

The communication burden of FL mainly comes from the global model aggregation, which can be reduced by efficient scheduling and resource allocation [10, 11, 12, 13, 14], gradient quantization and sparsification [14, 15, 16], or via analog aggregation [15, 16, 17]. An analytical study on the convergence rate achieved by random, round robin and proportional fair scheduling policies is carried out in [10]. An energy-efficient bandwidth allocation and worker scheduling scheme is proposed in [11], minimizing the energy consumption while maximizing the fraction of workers scheduled. A more general resource constraint, including both communication and computing resources, is considered in [12, 13]. In [14], a hierarchical FL architecture is proposed, and the end-to-end latency is minimized by jointly considering model sparsification and the two-tier update interval. Quantization and error accumulation techniques are further considered in [15, 16] to reduce the communication cost.

Most papers on FL consider digital transmission for global model aggregation. However, the communication latency scales with the number of workers [17]. Observing that the PS is interested only in the average of local models rather than their individual values, a promising solution is to use analog aggregation which exploits the waveform-superposition property of a wireless multiple access channel (MAC) [15, 16, 17]. If the workers synchronize with each other and align the transmit power, the summation of local models can be carried out over-the-air

. The tradeoff between signal-to-noise ratio (SNR) and the amount of exploited data is analyzed in

[17], while gradient compression and error accumulation are considered in [15, 16] to further improve the bandwidth efficiency of FL. While over-the-air aggregation requires channel state information at the workers, it is shown in [18] that this requirement can be released if the PS has multiple antennas.

Existing papers on analog aggregation mainly consider power allocation under specific channel models, and have not addressed the non-i.i.d. data. In this work, we consider analog aggregation for FL, where each worker has a long-term energy budget, and data redundancy is introduced to the system via data exchange or overlapped data collection. We propose an energy-aware dynamic worker scheduling policy, which maximizes the average weighted fraction of scheduled workers without assuming specific channel models or requiring any future information, and analyze its performance based on Lyapunov optimization. Experiments using MNIST dataset show that, data redundancy can bring significant accuracy improvement when data is non-i.i.d., while the proposed policy can smartly utilize the available energy to achieve close-to-optimal accuracy.

The rest of the paper is organized as follows. In Sec. II, we introduce the system model and problem formulation. The worker scheduling policy is proposed in Sec. III, along with its performance analysis. Experiment results are presented in Sec. IV, and the paper is finally summarized in Sec. V.

Ii System Model and Problem Formulation

Ii-a Federated Learning Architecture

(a) Architecture of FL, where redundancy comes from data exchange between workers.
(b) Overlapped data collection.
Fig. 1: Illustration of a FL system and the acquisition of data redundancy.

As shown in Fig. 1(a), we consider an FL system with a single PS and homogeneous workers . To tackle non-i.i.d. data, we consider introducing data redundancy to the system via 1) data exchange, i.e., workers exchange their local data with neighboring workers they trust, or 2) overlapped collection, i.e., in IoT networks, the sensing area of sensors (workers) are overlapped with each other. For example, in Fig. 1(b), there are 4 data collection points, and each worker collects data from 2 points. As opposed to [8] and [9], this does not require sharing any data samples at the PS. We assume that there are original datasets generated by workers (in data exchange case) or data collection points (in overlapped collection case), each has the same number of data samples, denoted by . The global dataset is defined as , with data samples . For simplicity and without loss of generality, we assume in the following.

The data redundancy of the system is denoted by , indicating that each original dataset is stored at different workers. The local dataset owned by worker is denoted by , with . One example to obtain redundancy is to collect or exchange the original datasets in a cyclic manner: Let worker stores , where the set of indexes is , with

(1)

Fig. 1(b) is an example of obtaining redundancy in the data collection scenario, with .

The goal of FL is to minimize the global loss

(2)

where

is a loss function designed for the FL task, and

is the parameter vector to be optimized.

In the -th training round, the PS broadcasts the global parameter vector obtained in the last round to all the workers. We assume that the PS is a more capable node with sufficient energy resources (e.g., a BS); therefore the broadcast of the global parameter vector is error-free. Each worker randomly picks up a fraction of data samples , with

, to evaluate its local gradient estimate

as:

(3)

Here we can let , so that data redundancy does not bring additional computing workloads to the workers for training, but only increases the storage cost.

We define an indicator function , where if worker is scheduled to upload gradient in the -th round, and otherwise. Further define the set of workers scheduled in round as . The global parameter vector is updated according to

(4)

where is the learning rate.

Ii-B Analog Aggregation

For the aggregation of the local gradients, we consider analog transmission via a wireless MAC with sub-channels. If worker is scheduled, its local gradient is evenly partitioned into segments , where, for , is a vector with either or entries, and transmitted via sub-channel .

In order to carry out the summation of the local gradients over-the-air, all the scheduled workers need to be synchronized and align their transmit power. Specifically, in round , denote the power allocated to worker within sub-channel by , which satisfies

(5)

where is the channel gain between worker and the PS in sub-channel , and is a power scalar to determine the received SNR. We assume that remains constant within each round, but we do not limit to any specific distribution. We also assume that each worker has perfect knowledge of its current channel gains , . Within sub-channel , each scheduled worker transmits to the PS. The total communication latency in each round is times the symbol duration, regardless of the number of workers scheduled. We remark that, the consideration of sub-channels enables us to implement analog aggregation in the current digital transmit systems such as orthogonal frequency division multiplexing (OFDM) with minor changes [17]. However, we consider a worker-level schedule rather than a sub-channel-worker-level schedule in this work, so that the PS can receive the whole gradients of the scheduled workers, as shown in (4).

At the PS side, the received signal over sub-channel can be written as

(6)

where

is an i.i.d. additive white Gaussian noise (AWGN) vector, with each entry following the standard normal distribution. The

-th segment of the global parameter vector, , is updated according to

(7)

And finally, the global parameters are concatenated into vector .

Ii-C Problem Formulation

Our objective is to minimize the global loss after training rounds, by optimizing the worker schedule and power allocation . Meanwhile, we also want to explore how data redundancy impacts the performance. The problem is formulated as:

(8a)
(8b)
(8c)
(8d)

The first constraint (8b) states that the average energy consumed by each worker in each training round cannot exceed budget , due to the battery limitation of wireless devices.111We unify each round to a unit time length, and use power and energy interchangeably in this paper without any ambiguity. The second constraint (8c) states that all the scheduled workers align their power to enable over-the-air computation.

Since the loss function is usually different for different kinds of machine learning tasks, and the evolution of the parameters during the training process is very complex, it is hard to express explicitly. Meanwhile, the convergence rate of distributed SGD is found to be positively correlated to the number of workers scheduled, as shown in [11] and references therein. Therefore, we consider an alternative optimization problem that maximizes the average weighted fraction of scheduled workers:

(9)

where characterizes the importance of scheduling more workers in the -th round.

We further fix to a predefined value for . This assumption is based on the fact that, with analog aggregation, the convergence speed of the FL task is not very sensitive to the SNR or the average transmit power, according to [16]. Then can be obtained from (8c) after deciding whether to schedule worker or not. Finally, the energy consumption is given by

(10)

The alternative problem can be formulated as

(11a)
(11b)
(11c)

Iii Energy-aware Worker Scheduling

The key challenge to solve is that, constraint (11b) is a long-term energy budget. However, in practice, the channel gains and the power of the gradients cannot be acquired before the -th round, and they may not be i.i.d. over time. Therefore, we design online worker scheduling policies in this section, and carry out performance analysis without assuming any specific distributions for the channel.

For any worker , it is easy to see that, its energy constraint (11b) and the scheduling decision are independent of other workers. Then can be equivalently decoupled into individual problems

(12a)
(12b)
(12c)

The combination of the optimal solution of for all workers is the optimal solution of . And by solving , each worker can decide whether or not to update gradient individually. In what follows, we design online solutions to .

Iii-a Myopic Scheduling for Short-term Fixed Energy Constraint

1:Initialization: initialize global model , input , , , , , and let , .
2:for  do
3:     Broadcast to all the workers. PS
4:     Update from (3). Each worker, in parallel
5:     Acquire channel gains for , and calculate energy consumption according to (10).
6:     Make scheduling decision:
(13)
7:     Update virtual queue according to (15).
8:     Transmit in sub-channel , .
9:     Aggregate received signal according to (6), and update global model according to (II-B). PS
10:end for
Algorithm 1 Energy-Aware Dynamic Scheduling Policy for FL via Analog Aggregation

A simple way to handle the average energy constraint (12b) is to remove the long-term summation, and transform it to a short-term fixed energy constraint , for . Then the myopic scheduling policy can be given by

(14)

In the -th round, worker acquires the current channel gains and the powers of the gradient for , and calculates the required energy to send its gradient estimate. If the required energy is no more than the budget , worker is scheduled.

Iii-B Energy-Aware Dynamic Scheduling for Long-term Average Energy Constraint

Although the myopic scheduling policy is simple and can satisfy the original long-term average energy budget, it actually introduces a much tighter energy constraint. In other words, the worker uses its energy in a more conservative manner, and is less likely to be scheduled compared with that allowed by the original energy budget (12b).

To schedule workers more efficiently, we propose an energy-aware dynamic scheduling policy, as shown in Algorithm 1. We construct a virtual queue with , and its evolution is given by

(15)

The value of the virtual queue indicates the deficit between the current energy consumption and the budget.

As shown in Lines 3-4 in Algorithm 1, in each round, first, the PS broadcasts the up-to-date global parameter vector, and each worker runs a local gradient estimation step in parallel based on its local data. In Lines 5-6, each worker makes the scheduling decision by comparing the weighted energy and the weighted utility , where is an adjustable weight parameter. The virtual queue transforms the long-term energy budget into instantaneous energy constraint: if is large, it is more likely that , so that the worker tends to not update the gradient to save energy; and vice versa. Also, by introducing , can be obtained without any future information, or other workers’ states and decisions. Therefore, the proposed algorithm is energy-aware, online and distributed. As shown in Lines 7-9, all the workers then update their virtual queues, and the scheduled workers transmit their gradients to the PS synchronously. The global parameter vector is finally aggregated by the PS.

Iii-C Performance Analysis

To analyze the performance of the proposed algorithm, we assume that in this subsection, and refer to the Lypunov optimization technique [19]

. We consider a non-ergodic version of Lyapunov optimization, i.e., all the random variables can be non-i.i.d. across time; that is, 1) the distribution of the power of the gradient

is unknown; 2) the channel and user mobility are not limited to specific models; 3) the total number of rounds is finite.

Define as the optimal utility of achieved by the offline genie-aided solution, and the average utility achieved by Algorithm 1. Let be the optimal utility of , and . By applying the energy-aware dynamic scheduling policy, we get the following theorem.

Theorem 1.

When , the average weighted fraction of scheduled workers achieved by Algorithm 1 satisfies:

(16)

and the total energy consumption of worker is bounded by:

(17)

where .

Proof.

See Appendix A. ∎

Theorem 1 shows that, the average utility and the total energy consumption achieved by the proposed energy-aware dynamic scheduling policy have deviation bounds, compared with the optimal genie-aided policy and the energy budget, respectively. Both deviations are positively correlated to the maximum energy deficit , and can be traded-off by the weight parameter .

Iii-D Discussions

We remark that, according to (13), when , the scheduling indicator if . However, the energy cost may be very high, leading to a large upper bound on the energy deficit , and thus a large deviation from the optimal genie-aided solution. Moreover, the worker cannot be scheduled for many rounds afterwards, which in turn reduces the average utility. Therefore, although in the classical Lyapunov optimization, and the worst-case utility and energy consumption can be guaranteed, in the experiments, we set , which is more reasonable.

In Algorithm 1, the weight parameter is also involved in balancing the utility and energy consumption. In practice, we can set as a relatively large value at the beginning of the training process, and decrease it across time, since 1) Scheduling more workers at the initial training rounds helps the learning process converge faster; 2) The power of the gradients reduces across time, according to the experiments in the following; 3) The battery level decreases as time goes by, so that we should use energy more and more conservatively.

Iv Experiments

In this section, we evaluate the gain of data redundancy and the performance of the proposed worker scheduling policy for a digit recognition FL task, using the MNIST dataset222http://yann.lecun.com/exdb/mnist/ with training samples and test samples. We consider workers, sub-channels and training rounds. We divide the dataset in either an i.i.d. or a non-i.i.d. fashion. For i.i.d. data partition, the data samples are randomly partitioned into datasets . For the non-i.i.d case, the data samples are first sorted by the order of digits, and each dataset is formed by data samples from a single digit. Each worker stores of these datasets in a cyclic manner according to (1), a total of samples. During each round, a fraction of data samples are randomly chosen from the data samples worker stores for training, so that the amount of computation for training is the same under different data redundancies.

Fig. 2: The accuracy of the MLP under different data redundancies using the myopic scheduling policy.
Fig. 3: Variation of the power of the gradients with training rounds.

We train a multilayer perceptron (MLP) with a 784-neuron input layer, a 64-neuron hidden layer, and a 10-neuron softmax output layer. The total number of parameters is 50890. Cross entropy is adopted as the loss function, and rectified linear unit (ReLU) activation is used. We set the learning rate

as 0.05, the dropout probability as 0.5, and use a momentum of 0.5. We consider Rayleigh fading channels from the workers to the PS, where the channel gain

follows the standard complex normal distribution, and each entry of the additive noise vector follows standard normal distribution. Also, the power scalar is set as . For the energy-aware dynamic scheduling policy, we let , and .

In Fig. 2, we first explore the impact of data redundancy by evaluating the accuracy of the MLP using the myopic scheduling policy under both i.i.d. and non-i.i.d. data. The energy budget of each worker is set the same, with or . Overall, the accuracy with i.i.d. data outperforms that with non-i.i.d. data. When data is i.i.d., redundancy hardly brings any benefit to the system, as the workers can already use images of different digits to train the model even if . However, data redundancy can significantly improve the accuracy of the MLP under non-i.i.d. data. Specifically: 1) Given the energy budget, as the data redundancy increases, the accuracy of the model increases with diminishing marginal gain. When , an improvement of is achieved by increasing redundancy from to , and from to . 2) Increasing the energy budget helps to improve the performance of the MLP since more workers can be scheduled in each round, while the system is less sensitive to the energy constraint in the case of i.i.d. data. Comparing with , the accuracy improves by when , by when , but less than when .

In Fig. 3, we plot the power of the gradients to guide the parameter design of the energy-aware dynamic scheduling policy. Each scatter point represents the power of the gradient of a worker in that round, and the lines are obtained by averaging these across workers, i.e., . We find that, the power of the gradients reduces dramatically in the first 10-15 training rounds, and is quite stationary afterwards. Motivated by this observation, the weight parameter is set as:

(18)
(a) Accuracy.
(b) Fraction of workers scheduled.
(c) Cumulative energy consumption.
Fig. 4: The performance of the proposed energy-aware dynamic scheduling policy, compared with the myopic scheduling policy and the upperbound.

The performance of the proposed energy-aware dynamic scheduling policy is then presented in Fig. 4. As shown in Fig. 4(a), we compare the accuracy of the dynamic scheduling policy with the myopic policy and an upperbound, which is achieved by setting , so that all the workers can be scheduled in each round. We can see that, under both i.i.d. and non-i.i.d. data distributions, the proposed algorithm can achieve close-to-optimal accuracy, while outperforming the myopic policy. Note that under i.i.d. data distribution, we further reduce the energy budget to . Since the power of the gradient is high at the beginning of training, no worker can satisfy the energy constraint using the myopic policy. However, the dynamic scheduling policy enables workers to borrow energy from the future, so that workers can be scheduled. Fig. 4(b) and Fig. 4(c) present the fraction of workers scheduled in each round, and the maximum cumulative energy consumption over workers, respectively. By using the dynamic scheduling policy, more workers can be scheduled compared with the myopic policy, and the energy can be fully utilized. Specifically, with non-i.i.d. data, when and data redundancy is , dynamic policy schedules an average of workers in each round, while myopic policy can only schedule workers. Moreover, the dynamic scheduling policy schedules more workers at the beginning while the energy is sufficient, so that the convergence of the MLP is accelerated.

V Conclusions

We have considered analog aggregation for FL over wireless channels and introduced data redundancy to the system to deal with non-i.i.d. data. We have proposed an energy-aware worker scheduling policy to maximize the weighted fraction of scheduled workers, which works in an online, distributed manner with performance guarantee. Experiments on the MNIST dataset have been carried out, showing that for non-i.i.d. data, increasing data redundancy to can improve the accuracy by under a stringent energy budget. Further increases in redundancy lead to diminishing improvements in accuracy. We have also shown that, the proposed energy-aware dynamic scheduling policy can achieve close-to-optimal performance without violating the energy budget, and on average schedule

more workers than a heuristic myopic policy. In the future, we plan to quantify the impact of data redundancy, and consider gradient compression and error accumulation to further reduce the communication cost.

Acknowledgment

This work is sponsored in part by the European Research Council (ERC) under Starting Grant BEACON (grant No. 677854), the Nature Science Foundation of China (No. 61871254, No. 91638204, No. 61571265, No. 61861136003, No. 61621091), National Key R&D Program of China 2018YFB0105005, and Intel Collaborative Research Institute for Intelligent and Automated Connected Vehicles.

Appendix A Proof of Theorem 1

Let . From (15), we have , , , and . When , , and thus

(19)

Define the Lyapunov function as , and the one-slot Lyapunov drift as:

(20)

where . The one-slot drift-plus-penalty function is given by:

(21)

The main idea of the proposed dynamic scheduling policy is to minimize the right-hand-side of (21). We have:

(22)

Since , the optimal solution of (22) is:

(23)

Define the -slot drift as . Then the -slot drift-plus-penalty function can be bounded by:

(24)

where , and are the optimal utility of obtained by the optimal genie-aided policy, and the corresponding value of queue and energy deficit. The inequality is obtained since the proposed algorithm minimizes in each round.

Since and , from (19) and (A), we have

(25)

By summing (25) over , we prove Theorem 1.

References