Log In Sign Up

Federated X-Armed Bandit

by   Wenjie Li, et al.

This work establishes the first framework of federated 𝒳-armed bandit, where different clients face heterogeneous local objective functions defined on the same domain and are required to collaboratively figure out the global optimum. We propose the first federated algorithm for such problems, named . By utilizing the topological structure of the global objective inside the hierarchical partitioning and the weak smoothness property, our algorithm achieves sublinear cumulative regret with respect to both the number of clients and the evaluation budget. Meanwhile, it only requires logarithmic communications between the central server and clients, protecting the client privacy. Experimental results on synthetic functions and real datasets validate the advantages of over single-client algorithms and federated multi-armed bandit algorithms.


page 1

page 2

page 3

page 4


Almost Cost-Free Communication in Federated Best Arm Identification

We study the problem of best arm identification in a federated learning ...

Federated Multi-Armed Bandits

Federated multi-armed bandits (FMAB) is a new bandit paradigm that paral...

Multi-Armed Bandit Based Client Scheduling for Federated Learning

By exploiting the computing power and local data of distributed clients,...

Federated Online Sparse Decision Making

This paper presents a novel federated linear contextual bandits model, w...

Accurate and Fast Federated Learning via Combinatorial Multi-Armed Bandits

Federated learning has emerged as an innovative paradigm of collaborativ...

Federated Linear Contextual Bandits

This paper presents a novel federated linear contextual bandits model, w...

Federated Online Clustering of Bandits

Contextual multi-armed bandit (MAB) is an important sequential decision-...

1 Introduction

Federated bandit is a newly-developed bandit problem that incorporates federated learning with sequential decision making (McMahan et al., 2017; Shi and Shen, 2021a)

. Unlike the traditional bandit models where the exploration-exploitation tradeoff is the only major concern, federated bandit problem also takes account of the modern concerns of data heterogeneity and privacy protection towards a trustworthy machine learning. In particular, in the federated learning paradigm, the data available to each client could be drawn from non-i.i.d distributions, making collaborations between the clients necessary to make valid inferences for the aggregated global model. However, due to user privacy concerns and the large communication cost, such collaborations across the clients must be restricted and avoid direct transmissions of the local data. To make correct decisions in the future, the clients have to utilize the limited communications from each other and coordinate exploration and exploitation correspondingly.

To the best of our knowledge, existing results of federated bandits, such as Dubey and Pentland (2020); Huang et al. (2021); Shi and Shen (2021a); Shi et al. (2021b), focus on either the case where the number of arms is finite (multi-armed bandit), or the case where the expected reward is a linear function of the chosen arm (linear contextual bandit). However, for problems such as dynamic pricing (Chen and Gallego, 2022) and hyper-parameter optimization (Shang et al., 2019), the available arms are often defined on a domain with infinite or even uncountable cardinality, and the reward function is usually non-linear with respect to the metric employed by the domain . These problems challenge the applications of existing federated bandit algorithms to more complicated real-world problems. Two applications that motivate our study of federated -armed bandit are given in below.

  • Federated medicine dosage recommendation. For the dosage recommendation of a newly-invented medicine/vaccine (in terms of volume or weight), the clinical trials could be conducted at multiple hospitals (clients). To protect patients’ privacy, hospitals cannot directly share the treatment result of each trial (reward). Moreover, because of the demographic difference among the patient groups, the best dosage obtained at each hospital (i.e., the optimal of local objectives) could be different from the optimal recommended dosage for entire population of the state (i.e., the optimal of the global objective). Researchers needs to collaboratively find the global optimal dosage by exploring and exploiting the local data.

  • Federated hyper-parameter optimization.

    An important application of automating machine learning workflows with minimal human intervention consists of hyper-parameter optimization for ML models, e.g., learning rate, neural network architecture, etc. Many modern data are collected by mobile devices (clients). The model performance (reward) of each hyper-parameter setting could be different for each mobile device (i.e., local objectives) due to user heterogeneity. To fully utilize the whole dataset (i.e., global objective) for hyperparameter optimization such that the obtained auto-ML model can work seamlessly for diverse scenarios, the central server need to coordinate the local search properly without violate regulations of consumer data privacy.

Figure 1: Real-life applications that motivate the federated -armed bandit problem.

In the aforementioned examples, the reward objectives are defined on a domain , which can often be formatted as a region of and has infinite cardinality. Moreover, the objective (both local and global ones) is highly nonlinear with respect to the arm chosen due to the complex nature of the problem. Therefore, the basic assumptions of federated multi-armed bandit or federated linear contextual bandit algorithms are violated, and thus the existing federated bandit algorithms cannot apply or perform well for such problems.

Under the classical setting where centralized data are immediately avaliable, -armed bandit algorithms such as HOO and HCT have been proposed to find the optimal arm inside the domain (Bubeck et al., 2011; Azar et al., 2014). But these algorithms cannot be trivially adapted to the task of finding the global optimum when there are multiple clients and communication limit. The local objectives could have very different landscapes across clients due to the non-i.i.d local datasets, and no efficient communication method has been established between -armed bandit algorithms that run on the local data sets. In this work, we propose the new federated algorithm where all the clients collaboratively learn the best solution to the global -armed bandit model on average, while few communications (in terms of the amount and the frequency) are required so that the privacy of each client is preserved.

We highlight our major contributions in this paper as follows.

  • Federated -armed bandit. We establish the first framework of the federated -armed bandit problem (Sec. 2), which naturally connects the -armed bandit problem with the characteristics of federated learning. The new framework introduces many new challenges to -armed bandit including (1) potential severe heterogeneity among local objective due to non-i.i.d local data sets, (2) the non-accessibility of the global objective for all local clients or the central server, and (3) the restriction of communications between the server and the clients.

  • New algorithm with desirable regret. We propose a new algorithm for the federated -armed bandit problem named Fed-PNE

    . Inspired by the heuristic of arm elimination in multi-armed bandits

    (Lattimore and Szepesvári, 2020), the new algorithm performs hierarchical node elimination in the domain . More importantly, it incorporates efficient communications between the server and the clients to transmit information while protecting client-privacy. We establish the sublinear cumulative regret upper bound of the proposed algorithm as well as the bound of the communication cost. Theoretically, we prove that Fed-PNE utilizes the advantage of federation and at the same time has high communication efficiency (Sec. 3). Theoretical comparisons of our regret bounds with existing bounds are provided in Table 1.

  • Empirical results. By examining the empirical performance of our Fed-PNE algorithm on a both synthetic functions and real datasets (Sec 4), we verify the correctness of our theoretical results. We show the advantages of Fed-PNE over single-client -armed bandit algorithms and federated multi-armed bandit algorithms. The empirical results exhibit the usefulness of our algorithm in real-life applications.

Bandit algorithms Average Regret Upper Bound Communication cost Privacy
HCT (Azar et al. (2014)) N.A.
Fed-PNE (This work)
Table 1: Comparison of (client-wise) regret upper bound and the communication cost on the -armed bandit problem for sufficiently large . Centralized results are adapted from the centralized algorithms such as HOO (Bubeck et al., 2011) and HCT (Azar et al., 2014) by assuming that the server makes all the decisions with access to all client-wise information. Notations: denotes the number of clients; denotes the budget (time horizon) and denotes the near-optimality dimension in Assumption 3.

2 Preliminaries

We first introduce the preliminary concepts and notations used in this paper. For a real number , we use to represent the smallest integer larger than . For an integer , we use to represent the set of integers . We use to hide the logarithmic terms in big- notations, i.e., for two functions , represents that for some .

2.1 Problem Formulation and Performance Measure

Let be a measurable space of arms. We model the problem as a federated -armed bandit setting where a total of clients respectively have the access to different local objectives , which could be non-convex, non-differentiable and even non-continuous. Given a limited number of rounds , each client chooses a point at each round and observes a noisy feedback defined as , where is a zero-mean and bounded random noise independent from previous observations or other clients’ observations. The goal of the clients is to find the point that maximizes the global objective , which is defined to be the average of the local objectives, i.e.,

However, the global objective is not accessible by any client. The only information that the clients have access to is: (1) noisy evaluations of their own local objective functions , and (2) communications between itself and the central server. For the global objective, we assume that there is at least one global maximizer such that . Given the sequence of the points chosen by the clients , the performance of the clients is measured by the expectation of the cumulative regret, defined as

Another possible measure of algorithm performance is the so-called simple regret which only evaluates the goodness of optimizer in the final round, i.e., . This paper aligns with the standard federated bandit analysis framework and focuses on cumulative regret only. Moreover, as mentioned by Bubeck et al. (2011), we always have if we select the path via a cumulative regret-based policy.

2.2 Hierarchical Partitioning of the Parameter Space

Following the recent progress in centralized -armed bandit (e.g., Azar et al., 2014; Shang et al., 2019; Bartlett et al., 2019), we utilize a pre-defined infinitely-deep hierarchical partitioning of the parameter space to optimize the objective functions. The hierarchical partition discretizes the space by recursively defining the following relationship:

where is the (maximum) number of disjoint children for one node, and for every node , denotes the depth and the index of the node inside the partition. Each node on depth is partitioned into children on depth , while the union of all the nodes on each depth equals the parameter set . The partition is chosen before the optimization process and the same partition of the space is shared and used by all the clients as the partition itself reveals no information of the reward distributions of local objectives. A simple and intuitive example is a binary equal-sized partition on the domain , where each node on depth has length .

2.3 Communication Model and Privacy Concerns.

Similar to the setting of federated multi-armed bandit (Shi and Shen, 2021a; Huang et al., 2021), we assume that there exists a central server that coordinates the behaviors of all the different clients. The server has access to the same partition of the parameter space used by all the clients, and is able to communicate with the clients. Due to privacy concerns, the client-side algorithm should keep the reward of each evaluation confidential and the only things that can be transmitted to the server are the local statistical summary of the rewards. The clients are not allowed to communicate with each other. In accordance to McMahan et al. (2017); Shi and Shen (2021a), we assume that the server and the clients are fully synchronized. Although the clients can communicate with the server, the number of clients could be very large and thus the communication would be very costly. We take into account such communication cost in our algorithm design and the theoretical analysis.

2.4 Definitions and Assumptions

We use the following set of definitions and assumptions following the prior works on -armed bandit such as Bubeck et al. (2011) and Azar et al. (2014).

Assumption 1.

(Dissimilarity) The space is equipped with a dissimilarity function such that and

Given the dissimilarity function in Assumption 1, the diameter of a set is defined as . The open ball of radius and with center is defined as . We now introduce the local smoothness assumptions.

Assumption 2.

(Local Smoothness) We assume that there exist constants , and such that for all nodes on layer ,

  • s.t.

  • for all .

  • The global objective function satisfies that for all , we have

Remark 2.1.

Similar to the existing works on the -armed bandit problem, the dissimilarity function is not an explicit input required by our Fed-PNE algorithm and only the smoothness constants are accessed (Bubeck et al., 2011; Azar et al., 2014). As mentioned by Bubeck et al. (2011); Grill et al. (2015), there are many functions that satisfy Assumption 2 on the standard equal-sized partition with accessible and .

Finally, we introduce the definition of the near-optimality dimension as in Azar et al. (2014), which measures the number of near-optimal regions and thus the difficulty of the problem.

Assumption 3.

(Near-optimality dimension) Let and , for any subset of -optimal nodes , there exists a constant such that , where is the near-optimality dimension of function and is the -cover number of the set w.r.t. the dissimilarity .

Remark 2.2.

Some recent progress of centralized -armed bandit such as Grill et al. (2015); Shang et al. (2019); Bartlett et al. (2019) have proposed an even weaker version of Assumption 2, i.e., the local smoothness without a metric assumption. Correspondingly, they define the complexity measure named near-optimal dimension w.r.t. the partition . However, it is not trivial to directly adopt the weak local smoothness assumption without a metric for the federated version of blackbox optimization. Due to the limited communication, it will lead to continuing sampling in the suboptimal regions, yielding large cumulative regrets. As a pioneer work in federated -armed bandit, we choose to use the slightly stronger assumptions in Bubeck et al. (2011) so that theoretical guarantees of our Fed-PNE algorithm can be successfully established. Weakening our set of assumption while keeping the regret bound guarantee is a very interesting future work direction.

3 Algorithm and Analysis

In this section, we introduce the new algorithm for the federated -armed bandit problem, show its uniqueness compared with multi-armed bandit algorithms, and provide its theoretical analysis.

3.1 The Fed-PNE Algorithm

We propose the Federated Phased Node Elimination (Fed-PNE) algorithm, which consists of one client-side algorithm (Algorithm 1) and one server-side algorithm (Algorithm 2). The Fed-PNE algorithm runs in dynamic phases and it utilizes the hierarchical partition to gradually find the optimum by eliminating different regions of the domain. For a node , since its depth and index uniquely identifies the node, we will use to index the nodes in the elimination and expansion process. We use to denote the indices of active nodes that need to be sampled in phase and for the indices of nodes that need to be eliminated. To obtain a reward over a node (i.e., pull a node), the client evaluate the local objective at some where is either uniformly sampled from the node as in Bubeck et al. (2011) or some pre-defined point in the node as in Azar et al. (2014). The regret analysis will only be slightly different because of the smoothness assumption. In Section 3.2, we have used the latter strategy to derive our regret bound.

1:  Input: -nary partition
2:  Initialize
3:  while not reaching the time horizon  do
4:     Update
5:     Receive nodes and visited times from the server
6:     for  with  do
7:        Pull the node for times, receive rewards

        Calculate the local mean estimate

9:     end for
10:     Send the local estimates to the server
11:  end while
Algorithm 1 Fed-PNE: -th client

Algorithm Explanation: At initialization, the server starts from the root of the partition . At the beginning of each phase , the server expands the exploration tree as described in Algorithm 2 and the set until the criterion is satisfied, where denote the number of elements in a set, and the threshold number is the minimum required number of times each node on depth needs to be pulled, defined as

where are two absolute constants, and is the confidence (details in Lemma A.2). The number of times each node has to be sampled by each client and the phase length are then computed. This unique expansion criteria and sampling scheme guarantee four important things at the same time: (1) Every client samples every node at least one time so that the global objective is explored; (2) The empirical averages in line 12 of Algorithm 2

are unbiased estimators of the global function values for every node; (3) Every node is sampled enough number of times (larger than

); (4) The waste of budget due to the limitation on communication is minimized. After the broadcast in line 9, every client receives from the server.

Next, the clients perform the exploration and send only the empirical reward averages back to the server, as in Algorithm 1. The server computes the best node, denoted by , and decides the elimination set by the following selection criteria.


where and . In other words, for any node such that

, the function value of the global objective inside the node is much worse than the function value in the best node with high probability, and thus can be safely eliminated. The server then eliminate the bad nodes and proceed to the next phase with the new set

, which consists of nodes that are children of un-eliminated nodes in the previous phase, as shown in line 15-16 in Algorithm 2.

1:  Input: -nary partition , smoothness parameters
2:  Initialize
3:  while not reaching the time horizon  do
5:     while  or  do
6:        Renew
7:     end while
8:     Compute the number and the phase length
9:     Broadcast nodes to every client
10:     Receive local estimates from the clients
11:     for  every  do
12:        Calculate the global mean estimate
13:     end for
14:     Compute
15:     Compute
16:     Compute
17:  end while
Algorithm 2 Fed-PNE: server
Remark 3.1.

Apart from the obvious uniqueness in the algorithm design such as line 5-8, 15-16 in Algorithm 2, Fed-PNE also introduces the idea of “node elimination”, which is based on the hierarchical partitioning of the parameter space. If we treat nodes inside the partition as the “arms” in multi-armed bandit, then Fed-PNE algorithm is different from the traditional Phased-Elimination (PE) algorithm (Lattimore and Szepesvári, 2020) in mainly three aspects: (1) The nodes to be eliminated at different phases in Fed-PNE represent areas in domain with different sizes because of the hierarchical partition, while the arms to be eliminated by PE at different phases have equal roles; (2) While trying to eliminate areas, Fed-PNE also explores deeper in the partition and expand one node into multiple nodes, which means that the number of nodes to be sampled may increase instead of decrease as increases. However, the number of remaining arms only decreases in PE; (3) The elimination criteria in Eqn. (1) not only includes the Upper-Confidence Bound (UCB) terms for statistical uncertainty, but also the smoothness term , which accounts for the variation of the objective function inside one node.

Remark 3.2.

It is worth mentioning that our algorithm requires the parameters as part of the input, which measures how fast the diameter of a node shrinks in the partition. These parameters are important because they characterize the smoothness of the global objective and we need them to determine the threshold and the elimination set . This information is crucial to ensure that validity of cumulative regret analysis theorems even for centralized -armed bandit problems. Most of existing -armed bandit algorithms, such as HOO(Bubeck et al., 2011),HCT(Azar et al., 2014) and VHCT(Li et al., 2021), all require these parameters.

3.2 Theoretical Analysis

We provide the upper bound on the expected cumulative regret of the proposed Fed-PNE algorithm as follows, which exhibits our theoretical advantage over non-federated algorithms.

Theorem 3.1.

Suppose that satisfies Assumption 2, and is the near-optimality dimension of the global objective as defined in Assumption 3. Setting , we have the following upper bound on the expected cumulative regret of the Fed-PNE algorithm.

where and are two absolute constants that do not depend on and . Moreover, the communication cost of Fed-PNE scales as

Remark 3.3.

The proof of the above theorem and the exact values of the two constants are relegated to Appendix B. Theorem 3.1 displays a desirable regret upper bound for the Fed-PNE algorithm because the first term on the right-hand side only depends on and it is a cost due to federation across all the clients. When is sufficiently large compared with 111Specifically, when the condition is satisfied., the second term dominates the bound and it depends sub-linearly on both the number of rounds and the number of agents , which means that the algorithm converges to the optimum of the global objective. Moreover, the average cumulative regret of each client is of order , which represents that increasing the number of clients helps reducing the regret of each client, and thus validates the effectiveness of federation. Compared with the regret of traditional -armed bandit algorithms such as HCT and HOO (Bubeck et al., 2011; Azar et al., 2014), i.e., , the average regret bound of our algorithm is smaller when is large, which means that our algorithm is faster.

Remark 3.4.

When is relatively small, the first term dominates the regret bound, yielding a super-linear rate w.r.t. . Such a undesirable rate is mainly due to the inefficient use of all clients. For example, the total number of pulls of the node , i.e., , is much larger than when we explore the shallow layers (i.e., is small) in the partition at the beginning stage of the search.

Remark 3.5.

Moreover, the communication cost in Theorem 3.1 only depends logarithmically on the time horizon , showing that there are no frequent communications between the server and the clients during the federated learning process. Therefore simultaneously, our algorithm successfully protects user privacy to certain extent and saves the communication cost. The dependence of the communication cost on the number of clients is unavoidable since our algorithm assumes that all clients communicate synchronously, and thus the server exchanges information with all the clients for each phase. Such dependence is also observed in federated multi-armed bandit algorithms such as Fed1-UCB (Shi and Shen, 2021a) and Fed-PE (Huang et al., 2021).

4 Experiments

(a) Garland
(b) DoubleSine
(c) Landmine
Figure 2: Average cumulative regret of Fed-PNE, Fed1-UCB with different number of arms and HCT on the synthetic functions and the real-life datasets. is the same for Fed-PNE and Fed1-UCB.

We empirically evaluate the proposed Fed-PNE algorithm for both synthetic functions and real-world datasets. For the federated algorithms, we plot the average cumulative regret per client against the rounds. For the centralized algorithm (HCT

), we plot its cumulative regret on the global objective of each task against the number of evaluations. Such a comparison is fair in terms of overall computation resource, since we can view one evaluation of global objective as the result of instant public communications of all local objective evaluations in one round. For all the curves presented in this section (and additional numerical results in the appendix), they are averaged over 10 independent runs with shaded area standing for the 1 standard deviation. Additional experiments on other datasets and the experimental details are provided in Appendix

C. We compare Fed-PNE with two competitive baselines.

  • HCT (Azar et al., 2014) is a strong -armed bandit algorithm baseline. As mentioned above, we run HCT on the global objective directly, which is a big advantage for HCT.

  • Fed1-UCB (Shi and Shen, 2021a) is a federated multi-armed bandit algorithm. We use a uniform meshgrid on the parameter space to generate the ”arms” and adapt the algorithm to the -armed bandit problem (See Appendix C for more details).

Synthetic Dataset. We evaluate the algorithms on two synthetic functions which are commonly used in -armed bandit problem, which are the Garland function and the DoubleSine function, both defined on . These two functions are well-known for their unsmoothness and large number of local optimums. The perturbed versions of these two functions are used as the local objective while the original functions are used as the global objective. The average cumulative regret of different algorithms are provided in Figure 2(a) and 2(b). As can be observed in the figures, Fed1-UCB suffer from the linear regret due to the difference in the rewards of the best point in the meshgrid and the global optimum. Our algorithm, however, is able to achieve even better performance than HCT running on the global objective.

Landmine Detection. We federatedly tune the hyper-parameters of machine learning models fitted on the Landmine dataset (Liu et al., 2007), where the features of different locations on multiple landmine fields extracted from radar images are used to detect the landmines. Following the setting of Dai et al. (2020)

, each client only has the access to the data of one random field, and trains a support vector machine with the RBF kernel parameter chosen from [0.01, 10] and the

regularization parameter chosen from . The local objective and the global objective are the AUC-ROC scores on the local landmine field and all the landmine fields respectively. The average cumulative regret of different algorithms are provided in Figure 2(c). As can be observed in the figures, our algorithm achieved smallest cumulative regret and thus the best performance.

COVID-19 Vaccine Dosage Optimization. In combat to the pandemic, we optimize the vaccine dosage in epidemiological models of COVID-19 to find the best fractional dosage for overall population following Wiecek et al. (2022). Using fractional dosage of the vaccines will make them less effective, but at the same time more people get the chance of vaccination and thus can possibly accelerate the process of herd immunity. In our experimental setting, the local objectives are the final infectious rate of different countries/regions. Different countries have different parameters such as population size and the number of ICU units, and thus make the objectives heterogeneous. The results are shown in Figure 2(d). Our algorithm also achieves the fastest convergence.

Benefits of large number of clients. Finally, we validate the theoretical benefit of having large number of clients when running FedPNE. We plot the average cumulative regret of our algorithm on the four objectives with different number of clients in Figure 3222For the COVID dataset, there are only around 200 countries/regions in the world that have data available and thus we run for instead of . . As shown in the four figures, larger number of clients improves the average cumulative regret of FedPNE, which proves the correctness of our conclusions in Theorem 3.1.

(a) Garland
(b) DoubleSine
(c) Landmine
Figure 3: Average cumulative regret of Fed-PNE on the synthetic functions and the real-life datasets with different number of clients. is the number of clients.

5 Related Works.

Federated bandits. Most recent works of federated bandits focus on the case where the number of arms is finite or the reward function is linear. For example, Shi and Shen (2021a) and Shi et al. (2021b) propose the Fed-UCB-type algorithms for the multi-armed bandit (with personalization) problem to construct efficient client-server communications. Zhu et al. (2021); Li et al. (2020) propose to make use of differential privacy to protect the user information of each client in federated bandit. For federated linear contextual bandits, different algorithms such as Dubey and Pentland (2020); Huang et al. (2021) utilize distinct ways to reconstruct the global contextual parameter and achieve sublinear regret with little communication cost. Very recently, Wang et al. (2022) extended the LASSO bandit algorithm (Bastani and Bayati, 2020) to federated high dimensional bandits. As far as we are aware of, our work is the first progress to discuss federated -armed bandit or continuum-armed bandit.

-armed bandits. Since the creation of the Zooming algorithm (Kleinberg et al., 2008), -armed bandit has become a heated line of research. Algorithms such as HOO, HCT, VHCT provide cumulative regret bounds for the stochastic reward feedback setting (Bubeck et al., 2011; Azar et al., 2014; Li et al., 2021). Apart from these works, some algorithms also discuss the case where the reward has no noise, such as DiRect (Jones et al., 1993), DOO (Munos, 2011), SOO(Munos, 2011), and SequOOL (Bartlett et al., 2019). Another set of research works focuses on minimizing the simple regret of the optimization algorithms at the final round, and proposes algorithms such as POO (Grill et al., 2015), GPO (Shang et al., 2019), and StroquOOL (Bartlett et al., 2019). The theoretical results of these algorithms are not directly comparable to our work since we focus on cumulative regret.

Federated hyper-parameter optimization. Recently, a few works have analyzed the problem of federated hyper-parameter optimization using different approaches. Dai et al. (2020, 2021)

have proposed some modified versions of Thompson sampling and analyzed the problem using Bayesian Optimization algorithms. However, these algorithms require strong assumptions and the regret bounds of these algorithms do not show clear dependence on the number of clients

(Dai et al., 2020, 2021). Khodak et al. (2021) have proposed FedEx for optimizing hyper-parameters of algorithms such as Fed-Avg and analyzed its regret in convex problems. Compared with these works, our work is much more general in three senses: (1) We require very weak assumptions on the objectives for the Fed-PNE algorithm to work; (2) Our work can be applied to non-hyper-parameter tuning problems such as medicine dosage prediction, which do not involve algorithms such as Fed-Avg; (3) We provide the first regret bound with clear dependence on under the weak assumptions.

6 Discussion and Conclusion

In this work, we establish the framework of federated -armed bandit problem and propose the first algorithm for such problems. The proposed Fed-PNE algorithm utilizes the intrinsic structure of the global objective inside the hierarchical partitioning and achieves desirable regret bounds in terms of both the number of clients and the evaluation budget. Meanwhile it requires only logarithmic communications between the server and the clients, protecting the privacy of the clients. Both theoretical analysis and the experimental results show the advantage of Fed-PNE

over centralized algorithms and prior federated multi-armed bandit algorithms. Many interesting future directions can be explored based on the framework proposed in this work. For example, other summary statistics of the client-wise data can potentially accelerate the proposed algorithm, such as the usage of empirical variance in

Li et al. (2021). Moreover, the current algorithm still needs a the weak lipschitzness assumption. Whether the weakest assumption in the literature of -armed bandit, i.e., the local smooth without a metric assumption proposed by Grill et al. (2015) can be used to prove similar regret guarantees remains a challenging open problem.


  • Azar et al. [2014] Mohammad Gheshlaghi Azar, Alessandro Lazaric, and Emma Brunskill. Online stochastic optimization under correlated bandit feedback. In International Conference on Machine Learning, pages 1557–1565. PMLR, 2014.
  • Bartlett et al. [2019] Peter L. Bartlett, Victor Gabillon, and Michal Valko. A simple parameter-free and adaptive approach to optimization under a minimal local smoothness assumption. In 30th International Conference on Algorithmic Learning Theory, 2019.
  • Bastani and Bayati [2020] Hamsa Bastani and Mohsen Bayati. Online decision making with high-dimensional covariates. Operations Research, 68(1):276–294, 2020.
  • Bubeck et al. [2011] Sébastien Bubeck, Rémi Munos, Gilles Stoltz, and Csaba Szepesvári. -armed bandits. Journal of Machine Learning Research, 12(46):1655–1695, 2011.
  • Chen and Gallego [2022] Ningyuan Chen and Guillermo Gallego. A primal–dual learning algorithm for personalized dynamic pricing with an inventory constraint. Mathematics of Operations Research, 0(0):null, 2022. doi: 10.1287/moor.2021.1220.
  • Dai et al. [2020] Zhongxiang Dai, Bryan Kian Hsiang Low, and Patrick Jaillet. Federated bayesian optimization via thompson sampling. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 9687–9699. Curran Associates, Inc., 2020. URL
  • Dai et al. [2021] Zhongxiang Dai, Bryan Kian Hsiang Low, and Patrick Jaillet. Differentially private federated bayesian optimization with distributed exploration. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 9125–9139. Curran Associates, Inc., 2021. URL
  • Dubey and Pentland [2020] Abhimanyu Dubey and Alex  SandyPentland. Differentially-private federated linear bandits. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6003–6014. Curran Associates, Inc., 2020. URL
  • Grill et al. [2015] Jean-Bastien Grill, Michal Valko, Remi Munos, and Remi Munos. Black-box optimization of noisy functions with unknown smoothness. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2015.
  • Huang et al. [2021] Ruiquan Huang, Weiqiang Wu, Jing Yang, and Cong Shen. Federated linear contextual bandits. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL
  • Jones et al. [1993] David Jones, Cary Perttunen, and Bruce Stuckman. Lipschitzian optimization without the Lipschitz constant. Journal of Optimization Theory and Applications, 79(1):157–181, Oct 1993.
  • Khodak et al. [2021] Mikhail Khodak, Renbo Tu, Tian Li, Liam Li, Maria-Florina F Balcan, Virginia Smith, and Ameet Talwalkar. Federated hyperparameter tuning: Challenges, baselines, and connections to weight-sharing. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 19184–19197. Curran Associates, Inc., 2021. URL
  • Kleinberg et al. [2008] Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In

    Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing

    , STOC ’08, page 681–690, New York, NY, USA, 2008. Association for Computing Machinery.
    ISBN 9781605580470. doi: 10.1145/1374376.1374475. URL
  • Lattimore and Szepesvári [2020] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
  • Li et al. [2020] Tan Li, Linqi Song, and Christina Fragouli. Federated recommendation system via differential privacy. In 2020 IEEE International Symposium on Information Theory (ISIT), pages 2592–2597. IEEE, 2020.
  • Li et al. [2021] Wenjie Li, Chi-Hua Wang, Qifan Song, and Guang Cheng. Optimum-statistical collaboration towards general and efficient black-box optimization, 2021.
  • Liu et al. [2007] Qiuhua Liu, Xuejun Liao, and Lawrence Carin. Semi-supervised multitask learning. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007. URL
  • McMahan et al. [2017] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
  • Munos [2011] Rémi Munos. Optimistic optimization of a deterministic function without the knowledge of its smoothness. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011.
  • Shang et al. [2019] Xuedong Shang, Emilie Kaufmann, and Michal Valko. General parallel optimization a without metric. In Algorithmic Learning Theory, pages 762–788, 2019.
  • Shi and Shen [2021a] Chengshuai Shi and Cong Shen. Federated multi-armed bandits. Proceedings of the AAAI Conference on Artificial Intelligence, 35(11):9603–9611, May 2021a. URL
  • Shi et al. [2021b] Chengshuai Shi, Cong Shen, and Jing Yang. Federated multi-armed bandits with personalization. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 2917–2925. PMLR, 13–15 Apr 2021b. URL
  • Wang et al. [2022] Chi-Hua Wang, Wenjie Li, Guang Cheng, and Guang Lin. Federated online sparse decision making, 2022. URL
  • Wiecek et al. [2022] Witold Wiecek, Amrita Ahuja, Esha Chaudhuri, Michael Kremer, Alexandre Simoes Gomes, Christopher M. Snyder, Alex Tabarrok, and Brandon Joel Tan. Testing fractional doses of covid-19 vaccines. Proceedings of the National Academy of Sciences, 119(8):e2116932119, 2022. doi: 10.1073/pnas.2116932119. URL
  • Zhu et al. [2021] Zhaowei Zhu, Jingxuan Zhu, Ji Liu, and Yang Liu. Federated bandit: A gossiping approach. In Abstract Proceedings of the 2021 ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems, pages 3–4, 2021.

Appendix A Notations and Useful Lemmas

a.1 Notations

Here we list all the notations used in the proof of our cumulative regret bound:

  • : The set of phases that are completed up to time .

  • : The time steps of phase .

  • : The set of un-eliminated/pre-eliminated nodes in phase .

  • : The set of nodes to be eliminated at the end of phase .

  • : The set of post-eliminated nodes in phase , i.e., .

  • : the indices of the node chosen by the server at phase from .

  • : the node that contains the (one of the) global maximum on depth .

  • : the number of times the node is sampled from client , which is chosen to be in the FedPNE algorithm.

  • : the minimum required number of samples needed for a node on depth , defined below.

  • : the number of times the node is sampled, which is chosen to be in the FedPNE algorithm.

The threshold for every depth. The number of times needed for the statistical error (the UCB term) of every node on depth to be better than the optimization error is the solution to


which is equivalent as the following choice of the threshold


Notably, this choice of the threshold is the same as the threshold value in the HCT algorithm [Azar et al., 2014]. In other words, we design our algorithm so that the samples are from different clients uniformly and thus the estimators are unbiased, and at the same time we minimize the unspent budget due to such distribution. There is still some (manageable) unspent budget due to the floor operation in the computation of . However because of the expansion criterion (line 5-6) in Fed-PNE, we are able to travel to very deep layers inside the partition very fast when there are a lot of clients, and thus Fed-PNE is faster than single-client -armed bandit algorithms.

a.2 Supporting Lemmas

Lemma A.1.

(Hoeffding’s Inequality) Let

be independent random variables such that

almost surely. Consider the sum of these random variables, . Then for all , we have

Here is the expected value of .

Lemma A.2.

(High Probability Event) Define the “good” event as


where the right hand side is exactly the confidence bound for the node and are two constants. Then for any fixed round , we have

Proof. For every and every , note that the node is sampled the same number of times independently from every local objective , therefore by the Hoeffding’s inequality (Lemma A.1), for any , we have


If we take the average over the clients and the samples in the two summation terms inside the probability expression, we get that for any


Therefore by the union bound, the probability of the event can be bounded as


where the third inequality follows from the fact that the number of nodes is always smaller than since every client visits every node at least once.

Lemma A.3.

(Optimal Node is Never Eliminated). Under the high probability event at time , the node that contains the global optimum of at depth level , denoted by , is never eliminated in Algorithm 2, i.e., .

Proof. Under the high probability event , we have the following inequality for every node


Therefore we have the following set of inequalities at the end of each phase for .


where the second inequality follows from the local smoothness property of the objective. Therefore we conclude that and the optimal node is never eliminated.

Lemma A.4.

(Remaining Nodes Are Near-Optimal). Under the high probability event at time , the representative point of every un-eliminated node in phase , i.e., , is at least -optimal, that is


Proof. This Lemma shares a similar proof logic as Lemma A.3. Under the high probability event , we have the following inequality for every node such that


Therefore the following set of inequalities hold


where the second inequality holds because is not eliminated. The third inequality holds because of the optimality of the node in Algorithm 2, and the last one follows from the weak lipchitzness assumption (Assumption 2). In conclusion, we have the following upper bound on the regret


where the last inequality holds because in the phase , we sample each node enough number of times ( larger than the threshold ) so that and thus are all smaller than .

Lemma A.5.

(Lemma 3 in Bubeck et al. [2011]) For a node , define to be the maximum of the function on that region. Suppose that for some , then all in are -optimal.

Proof. The proof is provided here for completeness. For any real positive number , we denote by such that


By the weak Lipschitz property, we have that for all ,


Letting and substituting the bounds on the suboptimality and on the diameter of (Assumption 2) concludes the proof.

Lemma A.6.

(Upper Bounds on Length of Phases) Given the choice of the threshold and the design in Algorithm 2, there exists a constant such that for any fixed phase , denote and to be the depth sampled by the algorithm at phase and respectively, then the following inequalities hold almost surely


Proof. By the design in our algorithm, the length of each (dynamic) phase should be


where denotes the -th power of and similar notations are used for the other powers. The second equality holds because nodes in are descendants of the un-eliminated nodes in , and the size of is bounded by by the definition of near-optimality dimension (Assumption 3) and Lemma A.4. Define to be the depth where , then


Then when , because and , which means that and thus the algorithm will only increase the depth by 1. Therefore


In the other case when , we have and thus because this number is already larger than . Now since , we have the following bound on


which finishes all the proof.

Appendix B Main Proof of Theorem 3.1

In this section, we provide the proofs of the main theorem (Theorem 3.1) in this paper.

Proof. Let denote whether the event is true, i.e., if is true and 0 otherwise. We first decompose the regret into two terms


For the second term, note that we can bound its expectation as follows


where the second inequality follows from Lemma A.2. Now we bound the first term in the decomposition under the event . In every round , for the non-eliminated nodes in the previous phase, i.e., such that (and thus ), by Lemma A.4 we have