Federated Learning (FL) [1, 2, 3, 4] has become an attractive ML framework to address the growing concerns of transmitting private data from distributed clients (e.g., mobile devices) to a central cloud server by leveraging the ever-increasing storage and computing capabilities of the client devices. In each FL round, clients train local models using their local data and the cloud server aggregates local model updates to form a global model. Because only local model information is exchanged in FL rather than the local data, FL preserves the data privacy of the clients and hence has found applications in a wide range of problems, such as next-word prediction  and image classification .
A main bottleneck that limits the performance of FL is the delay variability among individual clients due to their local training and model data transfer via the wireless network. In standard FL, the cloud server has to wait until receiving the training updates from all the clients before processing any next step. Therefore, straggler clients who have unfavorable wireless links or low computation capabilities may dramatically slow down the whole FL process [7, 8]. This is the so-called “straggler effect”. Various approaches have been proposed to mitigate the “straggler effect”. For example, model quantization [9, 10] and gradient sparsification  schemes aim to directly reduce the transferred data size and the model training complexity, thereby reducing all clients’ training and transmission delay. Asynchronous FL [12, 13]
allows clients to train and upload training data in an asynchronous manner, and hence the cloud server does not have to wait for the slow clients to process the next step. Another mainstream and proven effective approach to address the straggler problem is client selection, which reduces the probability of straggler clients participating in FL by judiciously selecting clients in every FL round. In this paper, we aim to improve the FL performance along the lines of client selection in the context of hierarchical FL (HFL), which is a hierarchical architecture of FL that significantly reduces the communication overhead between the cloud server and the clients.
The key idea of HFL is the introduction of multiple edge servers which reside between the single cloud server and the large number of clients. Instead of communicating directly to the cloud server, the clients only send their local training updates to the nearby edge servers, each of which performs an intermediate model aggregation. These aggregated models are further sent to the cloud server for the global aggregation. HFL significantly reduces the communication burden on the network and has been shown to achieve a faster convergence speed than the conventional FL architecture both theoretically [14, 15] and empirically . Although several learning algorithms have been designed for HFL, simplifying assumptions have been made that all clients participate in each round of model parameter aggregation and hence, the “straggler effect” in HFL has not been specifically addressed. However, it is not straightforward to apply existing client selection solutions to HFL due to several unique challenges that HFL faces. Firstly, since the service area of an ES is much more restricted than Cloud Server (CS) and contains overlapping areas, the accessible clients of each ES are time-varying. This time-varying characteristic makes the client behavior of opportunistic communication more complicated, and Network Operator (NO) must carefully select the client to the corresponding ES in the overlapping area. Secondly, since the advantage of HFL is to deal with the straggler problem, how to design an efficient client selection policy is more important than traditional FL. Thirdly, the client selection decision needs to be determined based on many uncertainties in the HFL network conditions, e.g., the traffic pattern of client-ES pair and available computation resources of clients, which affect training performance in previously unknown ways. Therefore, a learning-based client selection policy is preferred to a solely optimization-based policy.
In this paper, we investigate the client selection problem for HFL and propose a new learning-based policy, called Context-aware Online Client Selection (COCS). COCS is developed based on a novel Multi-Armed Bandit (MAB) framework called Contextual Combinatorial MAB (CC-MAB) [17, 18, 19]. COCS is contextual because it allows clients to use their computational information, e.g., available computation resources, and the client-ES pairs transmission information, e.g., bandwidth and distance. COCS is combinatorial because NO selects a subset of client-ES pairs and attempts to maximize the training utilities (i.e., select as many as clients in each round) by optimizing the client selection decision. To the best of our knowledge, COCS policy is the first client selection decision for HFL. In summary, we highlight the contributions of this paper as follows:
1) We formulate a client selection problem for HFL, where NO needs to select clients to ESs clients to process the local training in order to make more clients to be received by ESs before deadline under limited budget. To improve the convergence speed for HFL, client selection decision has a three-fold problem: (i) estimate the local model updates successfully received by ESs with cold-starts, (ii) decide whether a client should be selected to a certain ES due to time-varying connection conditions, and (iii) optimize how to pay computation resources on clients to maximize the utility under limited budgets.
2) Due to the a priori uncertain knowledge of participated clients, the client selection problem is formulated as a CC-MAB problem. An online learning policy called Context-aware Online Client Selection (COCS) is developed, which leverages the contextual information such as downloading channel state and local computing time over aggregation round for making a decision. For the strongly convex HFL, we analyze the utility loss of COCS, termed regret, compared to the Oracle solution that knows the exacted information of participated clients. A sublinear regret bound is derived for the proposed COCS policy, which implies that COCS can produce asymptotically optimal client selection decisions for HFL.
3) For non-convex HFL, the utility function of the convergence speed is quadratically related to the number of participated clients. By assuming that the information of each client-ES pair is perfectly known by NO, we show that the client selection problem is a submodular maximization problem with knapsack and one matroid constraints, where is the number of ESs. We use the Fast Lazy Greedy (FLGreedy) algorithm  to approximate the optimal solution with a performance guarantee. To this end, the analysis shows that the COCS policy also achieves a sublinear regret.
The rest of this paper is organized as follows: Section II overviews the related works. The system model and client selection problem of HFL are presented in Section III. We design the Context-aware Online Client Selection (COCS) policy for strongly convex HFL and provide an analytical performance guarantee in Section IV. Section V presents the COCS policy for non-convex HFL, which is applied by the approximated oracle solutions. Simulation results are shown in Section VI, followed by the conclusion in Section VII.
Ii Related Work
We provide a brief literature review that includes three main lines of work: client selection for FL, HFL, and the MAB problems for FL.
Due to the heterogeneous clients and limited resources of wireless FL networks, client selection can significantly improve the performance of FL in terms of convergence speed and training latency. For example, 
designs a deep reinforcement learning algorithm (the local model updates and the global model are considered as states) to select clients. uses gradient information to select clients. If the inner product between the client’s local and global gradient is negative, it will be excluded. In , they develop a system model to estimate the total number of aggregation rounds and design a greedy algorithm to jointly optimize client selection and bandwidth allocation throughout the training process, thereby minimizing the training latency.  designs a dynamic client sampling method. In the early aggregation rounds, fewer clients are selected, and more clients in later rounds. This method has been proven to improve the training loss and accuracy as well as decreasing the overall energy consumption. These client selection methods focus on the conventional FL, which is different from our consideration of HFL.
HFL has been considered to be a more practical FL framework for the current MEC system, since the hierarchical architecture makes FL communication more efficient and significantly reduces the latency . Later, some studies improve the performance of HFL from different perspectives or use it in some other applications. For example, [16, 15] propose a detailed convergence analysis of HFL, showing that the convergence speed of HFL achieves a linear speedup of conventional FL. Recently, FL has attracted the more interest, especially with the rapid development of ML applications on IoT devices.  designs a hierarchical blockchain framework for knowledge sharing on smart vehicles, which learn the environmental data through ML methods and share the learning knowledge with others.  uses HFL to better adapt to personalized modeling tasks and protect private information.
The MAB problem has been extensively studied to address the key trade-off between exploration and exploitation making under uncertain environment , and it has been used in FL for designing the client scheduling or selection [28, 29, 30].  designs a client scheduling problem and provides a MAB-based framework for FL training without knowing the wireless channel state information and the dynamic usage of local computing resources. In order to minimize the latency,  models fair-guaranteed client selection as a Lyapunov optimization problem and presents a policy based on CC-MAB to estimate the model transmission time. A multi-agent MAB algorithm is developed to minimize the FL training latency over wireless channels, constrained by training performance as well as each client’s differential privacy requirement in . In this paper, the COCS policy is proposed to select clients for HFL. Due to the dynamic connection conditions of the client-ES pair and the limited available computing capacities of clients in each edge aggregation round, the COCS policy faces more challenges than these studies.
Iii System Model and Problem Formulation
Iii-a Preliminary of Hierarchical Federated Learning
The Network Operator (NO) leverages a typical edge-cloud architecture to offer the Federated Learning (FL) service, known as Hierarchical FL (HFL) [14, 16, 15] as in Fig. 1. Unlike the conventional FL [2, 1, 3, 4] that involves only clients and a Cloud Server (CS), HFL consists of a set of mobile devices/clients, indexed by , a set of Edge Servers (ES), indexed by and a Cloud Server (CS). Let denote the set of clients that can communicate with ES in edge aggregation in round . Note that the communication areas of different edge servers may be overlapped (i.e., ). For each client , it is able to communicate with a subset of ESs in round . In particular, we assume that each client is equipped with a single antenna so that it can communicates with only one ES at a time even if it is located in the overlapped area of multiple ESs in each round. Let denote the parameters of the global model. The goal of the FL service is to find the optimal parameters of the global model
, which minimizes the average loss functionunder the HFL network as follows:
where is the selected client set by the ES with the number in each round, is the loss function associated with the local dataset on client , and is the loss of data sample . The objective of the loss function
(i) At the beginning of round , each ES randomly selects a subset of clients in its coverage area. Even if a client is in the overlapping area, is is only allowed to communicate with one ES in one round. We assume that the HFL network contains a backhaul link to transmit the selected clients to avoid that some clients are selected on multiple ESs. Each client selected by ES downloads the edge model and sets it to be the local model , .
where , is the learning rate, and is the stochastic gradient of (i.e., ).
(iii) After local training epochs, client uploads the local model updates to the ES . Instead of aggregating all local models on CS at the end of round [2, 3, 4] of conventional FL, local model updates are averaged within ES to be edge model , called edge aggregation, which is given as follows:
(iv) Every rounds of edge aggregation, global model is computed by , from all ESs, called global aggregation. Then, each ES downloads the global model to be its edge model .
Iii-B Cost of Client Selection
Since clients usually do not belong to NO, clients charge NO for the amount of requested computation resources for collecting the dataset and performing the local training to achieve the learning goal. At the beginning of each edge aggregation round, each client reveals its available computation resources , including CPU frequency, RAM and storage for the current round . NO pays to the client depending on the available computation resources, where is a non-decreasing function. Due to the limited budget of NO, in any edge aggregation round , the client selection decision of NO must satisfy the budget constraint .
Iii-C Deadline Based HFL
In summary, an edge aggregation consists of four stages: Download Transmission (DT), Local Computation (LC), Upload Transmission (UT) and Edge Computation (EC).
In DT stage, the selected client downloads the current edge model from the ES . Followed by Shannon’s equation, the channel state of DT is calculated by:
where is the transmission power, is the downlink wireless channel gain and is the noise power. Let denote the downloading data size (i.e., size of edge model ) and the allocated bandwidth is in the edge aggregation round . Therefore, thus the DT time for client is . Once the client receives , training comes to the LC stage (i.e., it updates the local model using its own dataset according to Eq. (2)). The LC time of each client is determined by the local computation resources in the current round . Given the computation resources , the LC time can be obtained as , where is the computation workload, which is based on the complexity of learning model and data. When the LC is finished, client uploads its local model updates to the ES . Similar to the channel state definition of DT in Eq. (4), the channel state of UT is and UT time is , where is the uplink channel gain and is the uploading data size (i.e., size of ). Finally, if the local model updates of all selected clients are received by ESs, the edge models should be computed according to Eq. (3). The EC time is , where is the edge model workload and is the process capacity of the ES . Since is the same for all the selected clients at ES , the training time of client is defined as follows:
Due to some physical limitations, e.g., low computation capability and unstable communication, some clients may incur huge training latency in one edge aggregation round. Therefore, the deadline-based FL [31, 32, 33] is more realistic to deal with straggler clients. Specifically, ESs drop the clients whose the local model updates cannot be received before the deadline (i.e., client such that ). In this paper, we consider deadline-based HFL. Therefore, the edge aggregation can be reformulated:
is a binary random variable representing whether client’s model update can be received before the deadline (i.e., if , ; otherwise, ), and is the training time of the -th fastest client. In order to guarantee a minimum level of training performance, we require that at least local model updates must be used for edge aggregation. Therefore, in case less than clients’ updates are received before the deadline, the system has to wait for some additional time . For practical values of , the probability of having less than client updates received before the deadline is small. For analysis convenience, we assume that at least client updates can be received before the deadline in every edge aggregation round. In addition, we assume that the deadline of all ESs are set the same, . The extension to heterogeneous deadlines is straightforward.
Iii-D Utility Function of Client Selection of HFL
Some existing HFL studies [15, 16, 34] have demonstrated that the convergence speed depends on the number of participating clients in each edge aggregation round for both strongly convex and non-convex HFL (i.e., the more clients participated, the faster convergence speed). In order to support the theoretical results, we show the training performance on our simulated HFL network with and in Fig. 2, and it is observed that more participating clients on ESs can improve the performance in both the strongly convex and non-convex HFL settings.
For now, we consider strongly strongly convex HFL, where the convergence speed is linearly dependent on the number of participating clients. The client selection policy for non-convex HFL will be developed in Section V. As in [31, 32, 33], not all the selected clients in may reach the EC stage (i.e., ) due to straggler drop-out. To achieve a targeted convergence criteria, NO thus needs to run more FL rounds, thereby incurring a higher training cost. Therefore, it is necessary to develop an efficient client selection policy to improve the convergence speed for HFL, where more clients can participate in every round without dropping out. Let , then the utility of the client selection decision on ES is defined as:
Further, let denote the client selection decision of the overall system and . Therefore, the utility function of the whole HFL network is defined as:
Iii-E Client Selection Problem Formulation
The client selection problem for NO is a sequential decision-making problem. The goal of NO is to make selection decision to maximize the cumulative utility for a total of aggregation rounds. If an ES selects very few clients, its training performance may be degraded and the computation resources of computation resources of ESs may be wasted. To avoid bottleneck of HFL, we consider that NO equally divides the budget among the ESs (i.e., for each , its budget is ). Assuming that NO knows a priori whether a selected client can return its model updates to the corresponding edge server in time, namely , then the client selection problem is formulated:
The following challenges should be addressed to solve the client selection problem in HFL networks: (i) For maximizing the expected training utility of HFL, it is necessary to precisely estimate the selected clients in each edge aggregation round. In addition, since NO does not have enough experience to determine the selected clients at the first several rounds (i.e., cold start), collecting the historical data for estimation is important for this policy. (ii) With the successful participated clients estimation, how to optimize the selection decision on each ES under the limited budget should be carefully considered, because high variance of number of participated clients on each ES degrades the training performance. Therefore, we equally separate the total budget to each ES (constraint (9b)). (iii) Due to the movement of each client, the available connecting ESs can be considered as time-varying (constraint (9c)), which brings more difficulties to make an efficient client selection decision. Note that constraint (9d) can guarantee that each client only can be selected to communicate at most one ES. (iv) Since the selection decisions are based on the estimated participated clients , the accuracy of participated clients estimation will directly influence the training utility of NO. The following section will propose an policy based on the Multi-Armed Bandits (MAB) to address the mentioned challenges.
Iv Context-Aware Online Client Selection policy for Strongly Convex HFL
In this section, we formulate our client selection problem of HFL as a Contextual Combinatorial Multi-Armed Bandit (CC-MAB). The combinatorial property is because NO pays computation resources from multiple clients for maximizing the training utility. The contextual property is because NO leverages contexts associated with clients to infer their participated probabilities. In this paper, whether successfully participating in the corresponding ES depends on many side factors, which are collaboratively referred to as context. We use contextual information to help infer the number of participated clients.
In CC-MAB, NO observes the context of clients at the beginning of each edge aggregation round before making the client selection decision. Recall that the participated probability of a client-ES pair depends on and in Eq. (5). Although it is difficult, if not impossible, for ESs to know the channel state of clients, the ESs can easily capture their own channel state to connected clients [35, 36]. Based on the channel state of DT and bandwidth , ESs can compute the DT rate . In addition, since the movement speed of clients is much slower than the transmission speed of wireless signals, while NO cannot know the UT rate , it is not difficult to be inferred by (suppose that clients do not locate in the same area in each edge aggregation round). Note that computation resources of any client are revealed at the beginning of each round. Therefore, we set and as context and use these information to help infer the participated probabilities. Let denote the context of client-ES pair in edge aggregation round . Without loss of generality, we normalize in a bounded space using min-max feature scaling. Let denote the context of all clients on ESs. The context of all clients on ESs are collected in . Whether successful participation of client on ES is a random variable parameterized by the context . We slightly simplify the notation of selected clients and define the context-aware . Specifically, is a mapping function for each client-ES pair , since the training time of clients is usually location-dependent, e.g., the distance between the client and ES, the communication environment, and other processing tasks on a client. We further define as the expected value (i.e., the participated probability ) of .
Iv-a Oracle Solution and Regret
Similar to the existing CC-MAB studies [17, 18, 19], before providing our policy design, we first give an Oracle benchmark solution to the client selection problem of HFL by assuming that the NO knows the context-aware successful participated probability , . In this ideal setting, the utility function is perfectly known by NO, and thus we can get the optimal value of the client selection problem. The long-term selection problem P1 can be decomposed into independent subproblems in each round:
is a combinatorial optimization problem withKnapsack and a Matroid constraints. The combinatorial property is because NO should choose a proper client selection decision to optimize participated probabilities on all ESs in order to achieve higher convergence speed. Knapsack constraints are from the constraint (10b), which bounds computation resources payment on each ES.
To prove that (10c) is a matroid constraint, we firstly state the definition of matroid. A matroid is a system with independents sets, in which is a finite set (named the ground set) and represents the set of independent subsets of . It has the three following properties: (1) and has at least one subset of ; (2) For each , if , then ; (3) If , and , then such that .
In the subproblem P2, let denote the ground set of matroid , and consists of subsets of (i.e., , , ), where all includes at most one client from for each . We can write as , s.t. , . In this paper, is the set of all feasible client selection decisions. Therefore, it can be verified that Eq. (10c) is a matroid constraint [19, 37].
Based on our analysis, it is easy to observe that P2 is NP-hard, and hence it can be solved by brute-force, if the size of the HFL network is moderate. If the HFL network is too large, NO can use some commercial software to obtain the optimal solution, e.g., CPLEX . For simplicity, we define the optimal Oracle solution for each P2 in edge aggregation round is . However, in practice, the prior knowledge of participated clients is infeasible, and thus the NO has to make a selection decision based on the estimated participated clients in each edge aggregation round. Intuitively, NO should design an online client selection policy to choose based on the estimation . The performance of an online client selection policy is calculated by utility loss compared with the Oracle solution, called regret. Suppose that we have a selection sequence given by a policy, the expected regret is:
The expectation is concerning with respect to the decisions made by the client selection decision policy and the participated clients over contexts.
Iv-B Context-aware Online Client Selection Policy
Now, we will present our online client selection decision policy name Context-aware Online Client Selection (COCS). The COCS policy is designed based on CC-MAB. In edge aggregation round , the process of COCS of NO is operated sequentially as follows: (i) NO observes the contexts of all client-ES pairs , . (ii) NO determines its selection decision based on the observed context information in the current round and the knowledge learned from the previous rounds. (iii) The selection decision is applied. If , , the clients located in the coverage area of ES can be selected by ES for training in round . (iv) At the end of each edge aggregation round, the local model updates from which clients are observed by all ESs, which is then used to update the estimated participated clients from the observed context of client-ES pair .
The pseudocode of COCS policy is presented in Algorithm 1. It has two parameters and to be designed, where is a deterministic and monotonically increasing function used to identify the under-explored context, and decides how we partition the context space. The COCS policy is stated as follows:
Initialization Phase: Given parameter , the proposed policy first creates a partition denoted by for the context space , which splits into sets. Each set is a -dimensional hypercube with size . Note that is an important input parameter to guarantee policy performance. For each hypercube , the NO keeps a counter for each client and each ES . For the tuple of a counter , we define a selection event that represents a selection decision satisfying the three following conditions: 1) the client is selected to an ES ; 2) the ES successfully receives the client before (i.e., ); 3) the context of client-ES pair belongs to (i.e., ). The counter stores the number of times that the event occurs until edge aggregation round . Each ES also saves an experience for each client and each hypercube , which contains the observed participation indicators when a selection event occurs. Based on the observed participation indicators in , the estimated participated probability for a selection event is computed by:
In each edge aggregation round , the COCS policy has the following phases:
Hypercube Identification Phase: If the local model updates of client can be successfully received by an ES in edge aggregation round , we obtain that is the hypercube for the context , the estimated participated probability of client on ES is . Let denote the collection of all the estimated participated probabilities. For making a client selection decision, COCS policy needs to check whether these hypercubes have been explored sufficiently in order to ensure the enough accuracy of the estimated participated probability for each client-ES pair . Therefore, we define under-explored hypercubes for the ES in edge aggregation round as follows:
Also, let denote the collection of the under-explored client for each ES . The challenge of COCS policy is how to decide the current estimated participated clients are accurate enough to guide the client selection decision in each edge aggregation round, which is referred as exploitation or more training results need to be collected for a certain hypercube, which is referred to as exploration. COCS policy aims to balance the exploration and exploitation phases in order to maximize the utility of NO up to a finite round . Based on the , COCS can identify that then either enters an exploration phase or an exploitation phase.
Exploration Phase: Firstly, let denote an ES has under-explored clients, and denote the ES does not have under-explored clients. If the ES has a non-empty , then COCS enters the exploration phase. We may have two following cases in exploration phase:
(i) All the clients have under-explored ESs. Intuitively, NO hopes to receive more local training updates . Therefore, COCS policy aims to select as many clients that have under-explored ESs sequentially solved by the following optimization:
where is the size of the collection .
(ii) Part of ESs have under-explored clients . We divide this case into two stages: NO firstly selects ESs that have under-explored clients by solving the following optimization:
where is client selection decision on ES that has under-explored clients and is the size of the collection . Secondly, ESs aim to select the explored clients . Here, we assume that there exists ESs that , where . Therefore, ESs can select the clients with the following constraint:
If not, NO does not need to select clients in due to no budget left. Under this condition, the client selection decisions are jointly optimized the following optimization:
Exploitation Phase: If the set of under-explored clients is empty (i.e., ), then COCS policy enters the exploitation phase. The optimal client selection decision is derived by solving P2 from the current estimated participated clients :
Update Phase: After selecting the client-ES pair in each round , the proposed COCS policy observes whether the local model updates of selected clients can be received before the deadline ; then, it updates and of each hypercube .
Iv-C Performance Analysis
To present an upper performance bound of the COCS policy in terms of regret, we make the following assumption that the participated clients are similar are similar when their contexts are similar. This natural assumption is formalized by the following Hölder condition [17, 18, 19], which is defined as follows:
(Hölder Condition). If a real function on -dimensional Euclidean space satisfies Hödel condition, there exists , such that for any , it holds that for an arbitrary client , where is the Euclidean norm.
By providing the design of the input parameters and , we show that COCS policy achieves a sublinear with , which guarantees that COCS has an asymptotically optimal performance. This means that the online client selection decision made by COCS policy converges to the Oracle solution. Because any edge aggregation round is either in the exploration or exploitation phase, the regret can be divided into two parts , where and are the regrets due to exploration and exploitation phases, respectively. The total regret bound is achieved by separately bounding these two parts. Therefore, we present the following two lemmas for bounding exploration and exploitation regrets.
(Bound of .) Given the input parameters and , where and , the regret is bounded by:
Proof. See Appendix A in the supplemental file.
Lemma 1 shows that the order of is determined by the control function and the number of hypercubes in partition .
(Bound of .) Given and , where and , if the Hölder condition holds true and the additional condition is satisfied with , , , for all , then is bounded by:
Proof. See Appendix B in the supplemental file.
Lemma 2 indicates that the regret of exploitation depends on the choice of and with an additional condition being satisfied. Based on the above two Lemmas, we will have the following Theorem for the upper bound of the regret .
(Bound of .) Given the input parameters and , where and , if the Hölder condition holds true and the additional condition is satisfied with , , , for all , then the regret can be bounded by:
Proof. See Appendix C in the supplemental file.
The regret upper bound in Theorem 1 is given with properly choosing input parameters and . However, the values of and are not deterministic. Next, we will show that the regret upper bound of in these parameters design.
(Regret upper bound). If we select , , , , and COCS algorithm runs with these parameters, the regret can be bounded by:
where . The dominant order of the regret is