Client Selection and Bandwidth Allocation in Wireless Federated Learning Networks: A Long-Term Perspective

04/09/2020
by   Jie Xu, et al.
0

This paper studies federated learning (FL) in a classic wireless network, where learning clients share a common wireless link to a coordinating server to perform federated model training using their local data. In such wireless federated learning networks (WFLNs), optimizing the learning performance depends crucially on how clients are selected and how bandwidth is allocated among the selected clients in every learning round, as both radio and client energy resources are limited. While existing works have made some attempts to allocate the limited wireless resources to optimize FL, they focus on the problem in individual learning rounds, overlooking an inherent yet critical feature of federated learning. This paper brings a new long-term perspective to resource allocation in WFLNs, realizing that learning rounds are not only temporally interdependent but also have varying significance towards the final learning outcome. To this end, we first design data-driven experiments to show that different temporal client selection patterns lead to considerably different learning performance. With the obtained insights, we formulate a stochastic optimization problem for joint client selection and bandwidth allocation under long-term client energy constraints, and develop a new algorithm that utilizes only currently available wireless channel information but can achieve long-term performance guarantee. Further experiments show that our algorithm results in the desired temporal client selection pattern, is adaptive to changing network environments and far outperforms benchmarks that ignore the long-term effect of FL.

READ FULL TEXT VIEW PDF

Authors

page 18

page 21

05/10/2022

Client Selection and Bandwidth Allocation for Federated Learning: An Online Optimization Perspective

Federated learning (FL) can train a global model from clients' local dat...
01/10/2021

Bandwidth Allocation for Multiple Federated Learning Services in Wireless Edge Networks

This paper studies a federated learning (FL) system, where multiple FL s...
11/03/2020

An Efficiency-boosting Client Selection Scheme for Federated Learning with Fairness Guarantee

The issue of potential privacy leakage during centralized AI's model tra...
05/31/2021

On Dynamic Resource Allocation for Blockchain Assisted Federated Learning over Wireless Channels

Blockchain assisted federated learning (BFL) has been intensively studie...
08/05/2021

On Addressing Heterogeneity in Federated Learning for Autonomous Vehicles Connected to a Drone Orchestrator

In this paper we envision a federated learning (FL) scenario in service ...
04/14/2021

Resource Rationing for Wireless Federated Learning: Concept, Benefits, and Challenges

We advocate a new resource allocation framework, which we term resource ...
06/13/2021

Federated Learning Over Wireless Channels: Dynamic Resource Allocation and Task Scheduling

With the development of federated learning (FL), mobile devices (MDs) ar...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Mobile devices nowadays generate a massive amount of data each day. This rich data has the potential to power a wide range of machine learning (ML)-based applications, such as learning the activities of smart phone users, predicting health events from wearable devices or adapting to pedestrian behavior in autonomous vehicles. Due to the growing storage and computational power of mobile devices as well as privacy concerns associated with uploading personal data, it is increasingly attractive to store and process data directly on each mobile device. The aim of “federated learning” (FL)

[1] is to enable mobile devices to collaboratively learn a shared ML model with the coordination of a central server while keeping all the training data on device, thereby decoupling the ability to do ML from the need to upload/store the data in the cloud.

This paper focuses on FL in a classic wireless network setting where the clients, e.g., mobile devices, share a common wireless link to the server. We call this system a wireless federated learning network (WFLN). The network operates for a number of learning rounds as follows: in each round, the clients download the current ML model from the server, improve it by learning from their local data, and then upload the individual model updates to the server via the wireless link; the server then aggregates the local updates to improve the shared model. Similar to a traditional throughput-oriented wireless network, the limited wireless network resources require the WFLN to determine in each round which clients access the wireless channel to upload the model updates and how much bandwidth is allocated to each client. However, due to the specific application in consideration, namely FL, the resource allocation objective and consequently the outcome can be very different from, e.g., throughput maximization.

Optimizing WFLNs faces unique challenges compared to optimizing either FL or the traditional wireless networks. On the one hand, the wireless network sets resource constraints on performing FL as the finite wireless bandwidth limits the number of clients that can be selected in each round, and the selection must be adaptive to the highly variable wireless channel conditions. On the other hand, FL is likely to change the way wireless networks should be optimized as model training is a complex long-term process where decisions across rounds are interdependent and collectively decide the final training performance. Further, since mobile devices often have finite energy budgets due to, e.g., a finite battery, the number of rounds each individual mobile device can participate during the entire course of FL is also limited. An extremely crucial yet largely overlooked question is: does learning in different rounds contribute the same or differently to the final learning outcome and hence should the wireless resources be allocated discrepantly across rounds? Without a good understanding of its answer, conventional wireless network optimization approaches that treat each time slot independently and equally may lead to considerably suboptimal FL performance.

This paper aims to formalize this fundamental problem of client selection and bandwidth allocation in WFLNs and derive critical knowledge to enable the efficient operation of these networks. We study how resources (i.e., bandwidth and energy) should be allocated among clients in each learning round as well as across rounds given finite client energy budgets in a volatile network environment. Our main contributions are summarized as follows.

(1) While existing works [2, 3]

have shown that including more clients in FL generally improves the learning performance, there is little understanding of how this improvement depends on the learning rounds. For a fixed total number of selected clients during the entire course of FL, should client selection be uniform across rounds or more biased toward the early/later FL rounds? Although analytical characterization seems extremely difficult, we show in two representative ML tasks, i.e., image classification and text generation, that selecting more clients in the later FL rounds not only achieves higher accuracy and lower training loss but also is more robust than selecting more clients in the early FL rounds. This finding, to our best knowledge, is the first that relates the temporal client selection pattern to the final FL performance.

(2) With the understanding of a desired temporal client selection pattern, we formulate a long-term client selection and bandwidth allocation problem for a finite number of FL rounds under finite energy constraints of individual clients. Because wireless channel conditions vary over time but future conditions are unpredictable, we leverage the Lyapunov technique [4] to convert the long-term problem into a sequence of per-round problems via a virtual energy deficit queue for each client. A new online optimization algorithm called OCEAN is proposed, which in each FL round solves a finite number of convex optimization problems using only currently available wireless information, and hence the algorithm is practical and has low complexity.

(3) We prove that OCEAN achieves the FL performance of the desired client selection pattern within a bounded gap while approximately satisfying the energy constraints of the clients. Specifically, OCEAN demonstrates an learning-energy tradeoff where is an algorithm parameter. In addition, we investigate the structure of the client selection and bandwidth allocation outcome. Our findings are two-fold: in each round, clients are selected according to a priority metric, which is the ratio of the client’s current energy deficit queue length and its current wireless channel state. However, among the selected ones, more bandwidth is allocated to clients with a lower priority (i.e., worse channel and larger energy deficit queue length). This is in stark contrast to a traditional throughput-oriented wireless network where more bandwidth is allocated to clients with a better channel condition in order to maximize throughput.

Ii Related Work

Since the proposal of FL [1, 5], a lot of research effort has been devoted to tackling various challenges in this new distributed machine learning framework, including developing new optimization and model aggregation algorithms [6, 7, 8], handling non-i.i.d. and unbalanced datasets [9, 10, 11], and preserving model privacy [12, 13, 14, 15, 16] etc. Among these challenges, improving the communication efficiency of FL has been a key challenge due to the tension between uploading a large amount of data for model aggregation and the limited network resource to support this transmission. In this regard, a strand of literature focuses on modifying the FL algorithm itself to reduce the communication burden on the network, e.g., updating clients with significant training improvement [17]

, compressing the gradient vectors via quantization

[18], or accelerating training using sparse or structured updates [19, 1]. Hierarchical FL networks [20] have also been proposed where multiple edge servers perform partial model aggregation first, whose outputs are further aggregated by a cloud server. Recognizing the unique physical property of wireless transmission, [2, 21] propose analog model aggregation over the air, provided that a very stringent synchronization is available.

As wireless networks are the envisioned main deployment scenario of FL, how to optimally allocate the limited bandwidth and energy resources for FL has also received much attention. Many existing works [22, 23, 24, 25] study the inherent trade-off between local model update and global model aggregation, e.g., to adapt the frequency of global aggregation [22] or to optimize uplink transmission power/rate and the local update CPU frequency [24, 25]. In all these works, all clients participate in every FL round. Although both empirical studies [2, 3] and theoretical analysis [26] show that including more clients improves the FL convergence speed, the limited bandwidth of wireless networks cannot support many clients to upload their local updates at the same time. For FL at scale, client scheduling policies, which select only a subset of clients in every round, are necessary. In [27], the convergence performance of FL under three basic scheduling policies, namely random, round-robin and proportional fair, is analyzed. Different types of joint bandwidth allocation and client scheduling policies, e.g., [3, 28, 29, 30, 31, 32], have been proposed to either minimize the learning loss or the training time. However, their optimization problems are formulated by considering individual FL rounds separately or treating every FL round equally, and hence the same network resources are allocated across learning rounds. Our paper differs from these works in that we explicitly consider the varying significance of FL rounds and study a long-term bandwidth allocation and client selection problem under long-term energy constraints and with uncertain wireless channel information.

Iii Impact of Temporal Client Selection Pattern

Existing works [2, 3] have shown that the FL performance (in terms of training loss and prediction accuracy) can be improved by selecting more clients in each round. However, selecting more clients is not always possible if each client is subject to a long-term energy constraint due to, e.g., a finite battery: selecting more clients in early learning rounds depletes the battery of the clients and hence fewer clients can be selected in later learning rounds. Hence, even with the same average number of selected clients, the temporal pattern can be considerably different, yet there is little understanding of how the temporal pattern affects the final FL outcome. In this section, we design two experiments to show that the temporal client pattern indeed has a considerable impact on the final FL performance.

Fig. 1: Training Loss (MNIST Dataset)
Fig. 2: Accuracy (MNIST Dataset)
Fig. 3: Training Loss (Shakespeare Dataset)
Fig. 4: Accuracy (Shakespeare Dataset)

Iii-a Image Classification on the MNIST Dataset

Our first experiment is conducted using the TensorFlow Federated (TFF) framework

[33]

on the MNIST dataset for image classification. A deep neural network (DNN) classifier is trained on 10 clients (index from 1 through 10) using the FedAvg algorithm

[1] over a total of 300 rounds. Three temporal selection patterns are investigated: Uniform – in each round, 5 clients are randomly selected to upload their model parameters; Ascend – the number of selected clients gradually increases from 1 to 10 over 300 rounds with an average number of 5 clients selected per round; Descend – the number of selected clients gradually decreases from 10 to 1 over 300 rounds with an average number of 5 clients selected per round.

Figures 2 and 2

illustrate the training loss and the prediction accuracy, respectively, over 300 rounds for the three temporal patterns. Each curve is generated by averaging over 60 runs and the standard deviation of these curves are also shown in the figures. As can be seen, although the average number of selected clients is the same, different temporal patterns result in different training loss and prediction accuracy by the end of the 300 rounds. In particular,

Ascend results in the best performance compared to Uniform and Descend. There is a good reason behind this result: early learning rounds are “easy” rounds where the learning performance is less sensitive to the number of selected clients. Hence, even if Ascend selects fewer clients in the early rounds, learning speed is minimally affected. However, the later learning rounds are the more “difficult” rounds, and to push accuracy even higher requires more clients to update the shared model using their data. In fact, not only Ascend wins in training loss and accuracy, but it is also much more robust as the standard deviation is much smaller. This is again because more clients participate in model updating towards the end of learning, which can smooth out abrupt changes in the learned model of individual clients.

Iii-B Text Generation on the Shakespeare Dataset

To verify that the above findings are generalizable, we conduct a similar experiment on a text generation task. We utilize the decentralized text generation dataset based on The Complete Works of Shakespeare provided in the TensorFlow Federated tutorial [33]

. A Recurrent Neural Network with eager execution is pre-trained on the text from Charles Dickens’

A Tale of Two Cities and A Christmas Carol as the initial model, and FL is used to fine-tune this model for the Shakespeare dataset. As can be seen in Figures 4 and 4, although the task and the dataset are very different, similar observations can be made as before: the Ascend selection pattern significantly outperforms Descend and Uniform in terms of training loss, accuracy and robustness.

We note that the exact optimal temporal selection pattern seems impossible to analytically characterize, which would also change across different learning tasks, models, datasets and algorithms. However, as the general ascending trend leads to considerable performance improvement, it offers valuable guidance for designing client selection schemes across rounds.

Iv Wireless Federated Learning Network Model

With the insights obtained in Section III, we now move on to optimize the WFLN. Consider a WFLN with one server and clients, indexed by the set . Each participating client has a local dataset

. In the supervised learning case,

defines the collection of data samples given as a set of input-output pairs , where is a -dimensional input feature vector, and is the ground-truth output label. This data can be generated through the usage of the client via mobile applications and can be employed for various ML tasks, e.g., user activity prediction or health event prediction.

FL iterates between two steps: 1) the server updates the global model by aggregating local models transmitted over a multi-access channel by the clients; 2) the clients update their local models using the global model broadcasted by the server. We call each iteration a learning round. WFLN has to decide in each round which clients upload their local model updates depending on their wireless channel condition and remaining battery to maximize the learning performance. We use to denote whether or not client is selected in round , and collects the overall client selection decisions.

Iv-a Client Energy Consumption

For a selected client in round (i.e., ), it incurs energy consumption due to uploading the local updates to the edge server via the wireless channel. We consider a specific wireless multi-access scheme, i.e., orthogonal frequency-division multiple access (OFDMA) for local model uploading with a total bandwidth . Let be the bandwidth allocation ratio for client in round , and hence its allocated bandwidth is . Let . Bandwidth allocation must satisfy . Clearly, if , namely client is not selected in round , then it is the best not to allocate any bandwidth to this client, i.e., . On the other hand, if , then we require that at least a minimum bandwidth is allocated to client , i.e., . This is because practical systems cannot assign an arbitrarily small bandwidth to an individual client. In addition, a close-to-zero bandwidth allocation will require an extremely high transmit power and hence result in an extremely high energy consumption to achieve a target transmission rate. To make the problem feasible, we assume .

Let denote the transmission power (in Watt/Hz) of client in round . The achievable rate (in bit/s), denoted by , can be written according to the Shannon’s formula as

(1)

where

is the variance of the complex white Gaussian channel noise and

is the channel state of client in round . Let denote the data size of the adopted machine learning model (in bit), then the time needed to upload the local model update to the edge server is . For a target upload time deadline , the required transmission power can be derived using (1) and hence the transmission energy consumption of client is

(2)

Iv-B System Learning Performance

Existing works and our empirical study show that in order to accelerate learning, it is desirable for the WFLN to select as as many clients as possible in each round. However, client selection is constrained by finite radio and battery resources. Therefore, the WFLN must judiciously select clients to perform federated learning in each round, without quickly draining clients’ battery and causing insufficient model updates in later communication rounds. To this end, we introduce the following metric to describe the FL performance in round :

(3)

where is a temporal weight to capture the varying significance of selecting more clients in different learning rounds. As suggested by Section III, an increasing sequence of often results in better FL performance as more clients are likely to be selected in later rounds of learning.

We note, however, although the above metric will facilitate our subsequent resource allocation, it does not exactly characterize the FL speed or accuracy, which is extremely difficult, if not impossible, to model due to the complex and non-convex nature of many ML algorithms.

Iv-C Problem Formulation

As we emphasize the long-term performance and final outcome of FL, the goal is to maximize the weighted sum of selected clients defined in (3) for a total number of learning rounds while satisfying the long-term energy budget constraints of individual clients, through joint client selection and bandwidth allocation in every round . Although the performance metric defined in (3) is artificial, we will relate it to the actual FL performance (i.e., training loss and accuracy) in experiments. Formally, the problem that we aim to solve is

P1 (4)
s.t. (5)
(6)
(7)

Constraint (5) requires that the total energy consumption over the rounds for each client does not exceed an energy budget (e.g., battery capacity or energy limit set by the client). Constraint (6) is the feasibility condition on the bandwidth allocation. Constraint (7) is the feasibility condition on the client selection.

So far we have formulated a long-term optimization problem for client selection and bandwidth allocation in WFLNs. However, several challenges impede the derivation of the optimal solution to P1. The first is the lack of future information: optimally solving P1 requires complete offline information (i.e., channel conditions) over the entire FL period (i.e., learning rounds) that is very difficult to accurately predict in advance. Furthermore, P1 belongs to mixed-integer nonlinear programming and is difficult to solve, even if the long-term future information is accurately known a priori. Thus, these challenges demand an online approach that can efficiently make joint client selection and bandwidth allocation decisions without foreseeing the far future.

Iv-D Offline Benchmark: -Round Lookahead Algorithm

Before we propose the online algorithm, we first introduce an offline algorithm with -round lookahead information (i.e., the channel information in the next learning rounds are assumed to be known) as a benchmark. Specifically, we divide the entire FL period into frames, each having learning rounds such that , and present the following problem formulation:

(8)
s.t. (9)

Essentially, P2 defines a family of offline algorithms parameterized by the lookahead window size . Clearly, there exists at least one sequence of joint client selection and bandwidth allocation decisions that satisfies all constraints of P2 (e.g., no client is selected in any round in each frame). We denote the optimal learning performance for the -th frame by , for , considering all the decisions that satisfy the constraints and have perfect information over the frame. Thus, the optimal long-term learning performance achieved by the oracle’s optimal -round lookahead algorithm is given by .

We note that because of the assumed lookahead information, the -round lookahead algorithms are impractical (unless ). The purpose of introducing these algorithms is only to use them as a benchmark for our practical online algorithm to be proposed in the next section.

V Online Client Selection and Bandwidth Allocation

In this section, we develop the Online Client sElection and bAndwidth allocatioN algorithm, called OCEAN, and then characterize its structural properties. We also prove that it is efficient compared to the optimal offline algorithm with -round lookahead information.

V-a The OCEAN Algorithm

A major challenge of directly solving P1 is that the long-term energy constraint of the clients couples the client selection and bandwidth allocation decisions across different learning rounds: selecting more clients in the current round reduces the bandwidth allocated to each individual client, thereby increasing the energy consumption of these clients; furthermore, more energy consumption in the current round potentially reduces the energy budget available for future FL rounds, and yet the decisions have to be made without foreseeing the future. To address this challenge, we leverage the Lyapunov technique and construct a virtual energy deficit queue for each client to guide the client selection and bandwidth allocation decisions to follow the long-term energy constraint. The virtual energy queue of client starts with , and is updated at the end of each round as follows

(10)

where . Hence, is the queue length indicating the deviation of the current energy consumption of client from its long-term energy constraint . Let collect the energy deficit queues for all clients.

1:Input: and
2:for  do
3:     if  then
4:          and
5:     end if
6:     Observe the current channel state
7:     Solve P3
8:     Update energy queue according to (10)
9:end for
Algorithm 1 OCEAN

We now present OCEAN in Algorithm 1. OCEAN is purely online and requires only the currently available channel state information as inputs (i.e. ). We use to denote a sequence of positive control parameters to dynamically adjust the tradeoff between maximizing the number of selected clients and minimizing energy consumption over the frames, each having communication rounds. The importance of the control parameters will be revisited in Section V.C. In every round , we aim to solve the following per-round problem:

P3 (11)
s.t. (12)

By considering the additional term , the system takes into account the energy deficit of the clients during the current round’s client selection and bandwidth allocation. As a consequence, when is larger, minimizing the energy deficit is more critical. Thus, our algorithm works following the philosophy of “if violate the energy constraint, then use less energy”, and the energy deficit queue maintained without foreseeing the future guides the system towards meeting the energy constraints of the clients. OCEAN decomposes the long-term optimization problem into a series of per-round problems P3. For a more rigorous derivation of this decomposition, please refer to the proof of Theorem 1. Now, to complete OCEAN, it remains to solve P3, which however is still very difficult.

V-B Solving the Per-Round Problem

The per-round problem P3 is a difficult mixed-integer problem. To see more clearly how the objective function depends on and , we write out and rearrange it as follows

(13)

Notice that is a binary integer variable and is a continuous variable in . In general, mixed-integer problems are difficult to solve and often there is no polynomial-time optimal algorithm. Fortunately, our problem P3 exhibits a special structure and we are able to exploit this structure to develop an algorithm that returns the optimal solution by solving at most convex optimization problems. To simplify the notations, we drop the index in this subsection.

Our algorithm to solve P3, called OCEAN-P, incrementally adds clients into the selection set based on a metric , which we call the selection priority (the lower value, the higher priority). Initially, all clients with (which also means as is always positive) are added into . We denote this initial set by . Then, clients with are added into one by one in the ascending order of , and for each possible selection set, the corresponding bandwidth allocation is computed by solving the following optimization problem

(14)
s.t. (15)
(16)

Let be the optimal bandwidth allocation for a given selection set , and be the optimal value. Clearly, for the initial set , as . The number of selection sets that can possibly emerge following the above set expanding rule, which are collected in , is at most . Finally, the implemented optimal selection is and the implemented optimal bandwidth allocation is .

Because there are clients and in every iteration, one more client is added into , we ensure that the algorithm only needs to solve at most optimization problems P4 to return the optimal solution. In fact, we can reduce the number of times for solving P4 by adding a termination condition: if for some , its optimal bandwidth allocation results in for the last added client , then the algorithm stops adding more clients into the selection set. This termination condition can significantly reduce the number of convex optimization problems to be solved when is large. The pesudocode of OCEAN-P is given in Algorithm 2. Next, we first prove that P3 is a convex optimization problem and then prove the optimality of OCEAN-P.

1:Input:
2:Rank the clients according to . Hence we have
3:Set , , and .
4:for  do
5:     Update
6:     Solve P4 and obtain and
7:     if  then
8:         Stop iteration
9:     else
10:         Add to , i.e.
11:     end if
12:end for
13:Find
14:Return where and
Algorithm 2 OCEAN-P
Lemma 1.

The function where is decreasing and convex in and is increasing and concave in .

Proof.

See Appendix A. ∎

Lemma 1 readily proves that P4 is a convex optimization problem as . Convex optimization problems are extensively studied in the literature and many efficient algorithms [34] and mature software tools (such as CVX [35] and SciPy [36]) exist. Next, we prove that our algorithm returns the optimal solution by solving P4 at most times.

Theorem 1.

OCEAN-P returns the optimal solution to the per-round problem P3 by solving at most convex optimization problems.

Proof.

See Appendix B. ∎

V-C Structural Results and Performance Analysis

In this subsection, we first investigate the structure of the optimal solution produced by OCEAN-P in every round, and then characterize the performance of OCEAN.

In Theorem 1, we have already proven a thresholding result on the client selection, namely only clients whose selection priority is below a threshold are selected to participate in a FL round. Proposition 1 characterizes how bandwidth is allocated among the selected clients and their incurred energy consumption.

Proposition 1.

In any learning round , the allocated bandwidth of a selected client and its weighted energy consumption are non-decreasing with .

Proof.

See Appendix C. ∎

Theorem 1 and Proposition 1 together show that a client with a smaller energy deficit and a better channel condition (and hence a smaller ) is more likely to be selected to participate in the current FL round; however, among the selected clients, a client with a smaller is allocated with less bandwidth. This is because although allocating more bandwidth to client with a smaller reduces the energy consumption and deficit of this client, it reduces the bandwidth that can be allocated to clients with larger , which leads to even higher increased energy consumption and deficit of those clients. Moreover, in the optimal solution, the overall effect of energy deficit and consumption, namely , is still increasing in .

With the optimality of OCEAN-P, we then prove the performance guarantee of OCEAN.

Theorem 2.

For any and such that , when comparing OCEAN with the -round lookahead algorithm, the following statements hold:

(a) The energy constraint of every client is approximately satisfied with a bounded deviation:

(17)

where .

(b) The federated learning performance satisfies:

(18)

where and is the optimal value achieved by the -round lookahead algorithm in frame .

Proof.

See Appendix D. ∎

Theorem 2 shows that, given a fixed value of and , OCEAN is -optimal with respect to the FL performance against the optimal -lookahead policy, while the energy consumption is guaranteed to be approximately satisfied with a bounded factor . Thus, OCEAN demonstrates an learning-energy tradeoff. Note that when , the -lookahead benchmark has complete future information of the entire rounds. Even in this case, the tradeoff still holds.

Vi Simulation Results

In this section, we simulate a WFLN to evaluate the performance of OCEAN.

Federated Dataset

. To simulate FL, we leverage the TensorFlow Federated (TFF) framework and the MNIST dataset for hand-written digit classification. Each client’s local dataset is keyed by the original writer of the digits. Since each writer has a unique style, this dataset exhibits the kind of non-i.i.d. behavior expected of federated datasets. We use the first 10 clients in the MNIST dataset to conduct our simulation with each client having 100 training data samples. Since the hand-written digit classification is a relatively easy image classification task, we follow TFF’s tutorial to construct a simple three-layer neural network with the first layer being input, the second containing 10 neurons and the third performing the softmax operation. This neural network’s model size is

bits. FedAvg [1] is used as the learning algorithm.

Wireless Network. To simulate the wireless network, we consider an OFDMA system where the total bandwidth MHz. Each client’s wireless channel gain is modelled as independent free-space fading with average path loss 36dB. The variance of the complex white Gaussian channel noise is set as W. To ensure timely model update, we set the target uploading time in each round to be ms. The minimal bandwidth is set as Hz. For each client , the energy budget is set as J. The network runs for rounds.

Vi-a Benchmarks

We compare the performance of OCEAN with the following three benchmark algorithms.

  • Select-All: All 10 clients are selected in every learning round. Bandwidth is allocated to minimize the total energy consumption while satisfying the upload deadline requirement.

  • Static Myopic Optimal (SMO): In every learning round, SMO uses only currently available information independently across rounds (which is equivalent to the 1-Round Lookahead algorithm) to solve

    (19)
    s.t. (20)

    This problem is easy to solve: for each client , first compute the required bandwidth so that using energy can meet the upload time target ; then rank in the ascending order and select clients until the total required bandwidth exceeds . SMO mimics existing approaches (e.g., [3]) that solves bandwidth allocation and client selection independently across learning rounds.

  • Adaptive Myopic Optimal (AMO): SMO has a clear deficiency which can result in energy under-utilization: when a client is not selected in a round, its energy is wasted and will not be used in future rounds. To address this issue, we also consider a modified version of SMO, which recycles previously unused energy budget for future rounds. In particular, the energy budget for client in round is modified to .

For OCEAN, we let and hence the sequence becomes a single scalar . Moreover, we implement three variants using different temporal importance sequences : Ascending (OCEAN-a); Descending (OCEAN-d); and Uniform (OCEAN-u).

Vi-B Performance Comparison

Figure 6 shows the number of selected clients in every round for different approaches, which is obtained by averaging over 10 runs. As the name suggests, Select-All selects all 10 clients in every round, resulting in the ideal optimal client selection for FL. SMO selects much fewer clients due to the hard energy budget allocation in every round. Many clients do not get to upload their local model updates due to the bad channel state that they are experiencing. AMO starts with selecting few clients due to the same reason as SMO. However, as time goes on, energy budget not used in the previous rounds accumulates. This allows the client to transmit at the desired rate using a higher transmission power in later rounds, especially in those towards the very end, thereby countering the effects of bad channel states. As a (fortunate) by-product, AMO also achieves an ascending pattern of client selection. Our proposed algorithm, OCEAN-a, is able to select many more clients than SMO because it uses energy as needed without imposing a hard per-round energy constraint. Compared to AMO, it is able to fine-tune the temporal pattern of client selection by using different sequences of temporal weights . As can be seen in Figure 6, OCEAN-a results in an increasing number of selected clients, OCEAN-d results in a decreasing number of selected clients, while OCEAN-u keeps the number of selected clients almost the same across rounds.

Fig. 5: Temporal Client Selection Patterns of OCEAN and Benchmarks
Fig. 6: Temporal Client Selection Patterns of OCEAN Variants

Figure 7 shows the actual energy consumption of individual clients by the end of 300 learning rounds for different approaches in a particular run. Because Select-All completely ignores the energy budgets of the clients, it results in a very large energy consumption, far exceeding the energy budgets. On the other hand, SMO does not fully utilize the client’s energy budget because in many learning rounds the client is not selected. Both AMO and OCEAN-a incur a total energy consumption close to the given energy budget (i.e. 0.15) for individual clients.

Fig. 7: Per-Client Energy Consumption Comparison

As our ultimate goal is to improve the FL performance, we show the training loss and accuracy for different approaches in Figures 9 and 9. Select-All, as expected, results in the best FL performance, with the smallest training loss, the highest accuracy and the fastest convergence among all approaches. Due to the insufficient selection of clients in the course of learning, SMO’s learning performance is considerably inferior to all other approaches. Thanks to the fortunate by-product of AMO, AMO’s FL performance is comparable to OCEAN-a in this specific setting, which is close to the ideal case Select-All. However, we will show in the next set of experiments that AMO’s “luck” does not extend to other more complex network environments.

Fig. 8: Training Loss of OCEAN-a and Benchmarks
Fig. 9: Accuracy of OCEAN-a and Benchmarks

Vi-C Adaptability to Varying Network Condition

Although the performance of AMO seems comparable to OCEAN-a in the last experiment, it is achieved in a relatively easy network, where the wireless channel is relatively stable. In this set of experiments, we simulate more challenging network environments where the wireless channel can vary considerably due to, e.g., client mobility. In particular, we simulate two scenarios. In Scenario 1, the average path loss gradually increases from 32 dB to 45 dB, mimicking a scenario where clients move away from the server over time. In Scenario 2, the average path loss gradually decreases from 45 dB to 32 dB, mimicking a scenario where clients move towards the server over time.

Scenario 1. Figure 11 shows the number of selected clients over 300 rounds for OCEAN-a and AMO and Figure 11 shows the their FL accuracy. In the early rounds when the wireless channel is good, AMO selects some clients. However, as the channel gain degrades, AMO is not able to adapt to this change as the pre-allocated energy budget (even if the unused budget from the previous rounds is incorporated) cannot support even a single client to finish uploading the local model before the deadline . Only in the rounds towards the very end does the energy budget become sufficient and hence, some clients again are selected to upload their local model updates. Because of the long idle period in the middle when no clients are selected, the learning performance of AMO is significantly worse than OCEAN.

Scenario 2. Figure 13 shows the number of selected clients over 300 rounds for OCEAN-a and AMO and Figure 13 shows the their federated learning accuracy. In this scenario, the channel state in the early rounds is bad and hence, hardly any client can be selected to upload its local model update due to insufficient energy budget in AMO. As the channel state improves, AMO starts to select some clients but it becomes too late to do so.

In both scenarios, OCEAN is able to adapt its client selection decision because of its soft per-round energy budget allocation, yet the total consumed energy is still made close to the total energy budget. The per-client total energy consumption of OCEAN-a is shown in Figure 14 for the two considered scenarios.

Fig. 10: Client Selection of Scenario 1
Fig. 11: Accuracy of Scenario 1
Fig. 12: Client Selection of Scenario 2
Fig. 13: Accuracy of Scenario 2
Fig. 14: Energy Consumption of OCEAN-a for the Two Scenarios
Fig. 15: Client Selection and Bandwidth Allocation Outcomes
Fig. 16: Tradeoff Between Learning and Energy

Vi-D Features of OCEAN

Vi-D1 Client Selection and Bandwidth Allocation Outcomes

To have a deeper understanding of how OCEAN works, we illustrate, in one specific round, how clients are selected and bandwidth is allocated depending on the clients’ channel condition and energy deficit queue in that round. In Figure 16, the upper subplot shows the current channel condition and energy deficit queue for each client. The middle subplot shows the computed selection priority , with shaded bars indicating the selected clients. The bottom subplot shows the bandwidth allocation among the selected clients. As can be seen, a better channel condition and a larger deficit queue result in a higher priority (i.e., a smaller value of ). However, among the selected clients, more bandwidth is allocated to clients of a lower priority (i.e., a larger value of ).

Vi-D2 Learning - Energy Tradeoff

Finally, we show the impact of the algorithm parameter on achieving different learning v.s. energy tradeoff of OCEAN. Figure 16 shows the number of selected clients, the learning accuracy and the per-client energy consumption violation as a function of . As can be seen, a larger emphasizes more on the learning performance, resulting in more selected clients and higher accuracy. On the other hand, a smaller emphasizes more on the energy consumption, resulting in a smaller violation (if any) on the total energy budget.

Vii Conclusion

Resource allocation in wireless networks is an old topic, but it also faces constantly changing new challenges as new applications emerge. With FL being the trending new wireless network application, the old mindset of resource allocation for traditional applications such as file downloading or video streaming must be changed. This paper identifies a key property of FL, namely the temporal dependency and varying significance of learning rounds, that may significantly reshape how wireless resources should be allocated for optimized network and learning performance, yet is largely overlooked in the literature. While our formulation and algorithm have shown superior performance in real-world FL experiments, there are several future research directions that may extend the impact of this work. For example, we showed that an ascending client selection pattern is generally desired, but it is still not clear what the optimal pattern is. Moreover, client heterogeneity in terms of the computing power and local data size/distribution can be incorporated into the model to further enhance the understanding of resource allocation in more complex WFLNs.

Appendix A Proof of Lemma 1

To prove the monotonicity and convexity of , we investigate the first and second order derivatives, respectively. The first-order derivative is

(21)

The second-order derivative is

(22)

Therefore, for , and hence is an increasing function. We also know that and thus, for . This proves that is decreasing on . Similarly, for , and hence is an decreasing function. Because , for . This proves that is increasing on . Moreover, is convex on as ; it is concave on as .

Appendix B Proof of Theorem 1

The key is to prove that the optimal solution must have the following thresholding structure: there exists so that and . To prove this, suppose that in the optimal solution , there exist so that and . Let us consider a different solution which is obtained by swapping the decisions for and in . Specifically,

(23)
(24)

Since the decisions for other clients remain the same, the difference in the objective function value is

(25)
(26)

where . This is a contradiction to the optimality of . Therefore, the optimal solution must have the aforementioned thresholding structure.

Next, we prove that the termination condition is correct. Let be the first client with where we use to denote the optimal bandwidth allocation for the selection set . Clearly, when only clients are selected, we obtain a higher utility because

(27)
(28)
(29)

Therefore, client must not be in the optimal selection set. Because of the thresholding structure, we know that clients also must not be in the optimal selection set. This proves the correctness of the termination condition.

Appendix C Proof of Proposition 1

It suffices to consider the following optimization problem with two clients

(30)
s.t. (31)

where is any constant in . Let . Suppose the optimal bandwidth allocation satisfies , then by Lemma 1, we know . Let us construct a different bandwidth allocation solution where and . In other words, the bandwidth allocation decisions are swapped. This solution also satisfies all constraints. We compare the respective objective values and have

(32)

This contradicts the optimality of . Therefore, we must have .

To prove , let and ignore the constraint that for now. The first-order condition requires

(33)

This leads to

(34)

Because , we can instead prove . Let us define . Since we have proven in the above, we only need to prove that is a non-increasing function in . To this end, consider the first order derivative of ,

(35)

Let . We have to prove for .

(36)
(37)

To simplify notations, we use a change of variable by letting . Then proving for is equivalent to proving for . We rewrite below:

(38)

Clearly, . In order to prove , we prove is decreasing in .

(39)

It is easy to verify that is increasing in and , which means for . Hence, for . This concludes the proof for by ignoring the constraint .

When the constraint is considered, there are two cases. In the first case, . In this case, the constraint is automatically satisfied and hence, our above conclusion holds. In the second case, . In this case, the optimal allocation is modified to and . Since is a decreasing function, . This completes the proof.

Appendix D Proof of Theorem 2

We define the quadratic Lyapunov function . Let be the 1-round Lyapunov drift yielded by some control decisions over one round: . Similarly, let be the -round Lyapunov drift: . Based on the queue dynamics, we have

(40)

Then, it can be easily show that

(41)

where is a constant satisfying , which is finite due to the boundedness of the channel condition and the minimum bandwidth allocation requirement . Next, it is straightforward that and

(42)

As we can see, by solving P3, OCEAN-P actually maximizes a lower bound of . Let , , be the sequence of decisions derived by the online algorithm.

(a) Consider a specific sequence of decisions where . Clearly, in this case, and . Because maximizes the right-hand side of (42), we have