1 Introduction
Federated bandit is a newlydeveloped bandit problem that incorporates federated learning with sequential decision making (McMahan et al., 2017; Shi and Shen, 2021a)
. Unlike the traditional bandit models where the explorationexploitation tradeoff is the only major concern, federated bandit problem also takes account of the modern concerns of data heterogeneity and privacy protection towards a trustworthy machine learning. In particular, in the federated learning paradigm, the data available to each client could be drawn from noni.i.d distributions, making collaborations between the clients necessary to make valid inferences for the aggregated global model. However, due to user privacy concerns and the large communication cost, such collaborations across the clients must be restricted and avoid direct transmissions of the local data. To make correct decisions in the future, the clients have to utilize the limited communications from each other and coordinate exploration and exploitation correspondingly.
To the best of our knowledge, existing results of federated bandits, such as Dubey and Pentland (2020); Huang et al. (2021); Shi and Shen (2021a); Shi et al. (2021b), focus on either the case where the number of arms is finite (multiarmed bandit), or the case where the expected reward is a linear function of the chosen arm (linear contextual bandit). However, for problems such as dynamic pricing (Chen and Gallego, 2022) and hyperparameter optimization (Shang et al., 2019), the available arms are often defined on a domain with infinite or even uncountable cardinality, and the reward function is usually nonlinear with respect to the metric employed by the domain . These problems challenge the applications of existing federated bandit algorithms to more complicated realworld problems. Two applications that motivate our study of federated armed bandit are given in below.

Federated medicine dosage recommendation. For the dosage recommendation of a newlyinvented medicine/vaccine (in terms of volume or weight), the clinical trials could be conducted at multiple hospitals (clients). To protect patients’ privacy, hospitals cannot directly share the treatment result of each trial (reward). Moreover, because of the demographic difference among the patient groups, the best dosage obtained at each hospital (i.e., the optimal of local objectives) could be different from the optimal recommended dosage for entire population of the state (i.e., the optimal of the global objective). Researchers needs to collaboratively find the global optimal dosage by exploring and exploiting the local data.

Federated hyperparameter optimization.
An important application of automating machine learning workflows with minimal human intervention consists of hyperparameter optimization for ML models, e.g., learning rate, neural network architecture, etc. Many modern data are collected by mobile devices (clients). The model performance (reward) of each hyperparameter setting could be different for each mobile device (i.e., local objectives) due to user heterogeneity. To fully utilize the whole dataset (i.e., global objective) for hyperparameter optimization such that the obtained autoML model can work seamlessly for diverse scenarios, the central server need to coordinate the local search properly without violate regulations of consumer data privacy.
In the aforementioned examples, the reward objectives are defined on a domain , which can often be formatted as a region of and has infinite cardinality. Moreover, the objective (both local and global ones) is highly nonlinear with respect to the arm chosen due to the complex nature of the problem. Therefore, the basic assumptions of federated multiarmed bandit or federated linear contextual bandit algorithms are violated, and thus the existing federated bandit algorithms cannot apply or perform well for such problems.
Under the classical setting where centralized data are immediately avaliable, armed bandit algorithms such as HOO and HCT have been proposed to find the optimal arm inside the domain (Bubeck et al., 2011; Azar et al., 2014). But these algorithms cannot be trivially adapted to the task of finding the global optimum when there are multiple clients and communication limit. The local objectives could have very different landscapes across clients due to the noni.i.d local datasets, and no efficient communication method has been established between armed bandit algorithms that run on the local data sets. In this work, we propose the new federated algorithm where all the clients collaboratively learn the best solution to the global armed bandit model on average, while few communications (in terms of the amount and the frequency) are required so that the privacy of each client is preserved.
We highlight our major contributions in this paper as follows.

Federated armed bandit. We establish the first framework of the federated armed bandit problem (Sec. 2), which naturally connects the armed bandit problem with the characteristics of federated learning. The new framework introduces many new challenges to armed bandit including (1) potential severe heterogeneity among local objective due to noni.i.d local data sets, (2) the nonaccessibility of the global objective for all local clients or the central server, and (3) the restriction of communications between the server and the clients.

New algorithm with desirable regret. We propose a new algorithm for the federated armed bandit problem named FedPNE
. Inspired by the heuristic of arm elimination in multiarmed bandits
(Lattimore and Szepesvári, 2020), the new algorithm performs hierarchical node elimination in the domain . More importantly, it incorporates efficient communications between the server and the clients to transmit information while protecting clientprivacy. We establish the sublinear cumulative regret upper bound of the proposed algorithm as well as the bound of the communication cost. Theoretically, we prove that FedPNE utilizes the advantage of federation and at the same time has high communication efficiency (Sec. 3). Theoretical comparisons of our regret bounds with existing bounds are provided in Table 1. 
Empirical results. By examining the empirical performance of our FedPNE algorithm on a both synthetic functions and real datasets (Sec 4), we verify the correctness of our theoretical results. We show the advantages of FedPNE over singleclient armed bandit algorithms and federated multiarmed bandit algorithms. The empirical results exhibit the usefulness of our algorithm in reallife applications.
Bandit algorithms  Average Regret Upper Bound  Communication cost  Privacy 

HCT (Azar et al. (2014))  N.A.  ✓  
Centralized  ✗  
FedPNE (This work)  ✓ 
2 Preliminaries
We first introduce the preliminary concepts and notations used in this paper. For a real number , we use to represent the smallest integer larger than . For an integer , we use to represent the set of integers . We use to hide the logarithmic terms in big notations, i.e., for two functions , represents that for some .
2.1 Problem Formulation and Performance Measure
Let be a measurable space of arms. We model the problem as a federated armed bandit setting where a total of clients respectively have the access to different local objectives , which could be nonconvex, nondifferentiable and even noncontinuous. Given a limited number of rounds , each client chooses a point at each round and observes a noisy feedback defined as , where is a zeromean and bounded random noise independent from previous observations or other clients’ observations. The goal of the clients is to find the point that maximizes the global objective , which is defined to be the average of the local objectives, i.e.,
However, the global objective is not accessible by any client. The only information that the clients have access to is: (1) noisy evaluations of their own local objective functions , and (2) communications between itself and the central server. For the global objective, we assume that there is at least one global maximizer such that . Given the sequence of the points chosen by the clients , the performance of the clients is measured by the expectation of the cumulative regret, defined as
Another possible measure of algorithm performance is the socalled simple regret which only evaluates the goodness of optimizer in the final round, i.e., . This paper aligns with the standard federated bandit analysis framework and focuses on cumulative regret only. Moreover, as mentioned by Bubeck et al. (2011), we always have if we select the path via a cumulative regretbased policy.
2.2 Hierarchical Partitioning of the Parameter Space
Following the recent progress in centralized armed bandit (e.g., Azar et al., 2014; Shang et al., 2019; Bartlett et al., 2019), we utilize a predefined infinitelydeep hierarchical partitioning of the parameter space to optimize the objective functions. The hierarchical partition discretizes the space by recursively defining the following relationship:
where is the (maximum) number of disjoint children for one node, and for every node , denotes the depth and the index of the node inside the partition. Each node on depth is partitioned into children on depth , while the union of all the nodes on each depth equals the parameter set . The partition is chosen before the optimization process and the same partition of the space is shared and used by all the clients as the partition itself reveals no information of the reward distributions of local objectives. A simple and intuitive example is a binary equalsized partition on the domain , where each node on depth has length .
2.3 Communication Model and Privacy Concerns.
Similar to the setting of federated multiarmed bandit (Shi and Shen, 2021a; Huang et al., 2021), we assume that there exists a central server that coordinates the behaviors of all the different clients. The server has access to the same partition of the parameter space used by all the clients, and is able to communicate with the clients. Due to privacy concerns, the clientside algorithm should keep the reward of each evaluation confidential and the only things that can be transmitted to the server are the local statistical summary of the rewards. The clients are not allowed to communicate with each other. In accordance to McMahan et al. (2017); Shi and Shen (2021a), we assume that the server and the clients are fully synchronized. Although the clients can communicate with the server, the number of clients could be very large and thus the communication would be very costly. We take into account such communication cost in our algorithm design and the theoretical analysis.
2.4 Definitions and Assumptions
We use the following set of definitions and assumptions following the prior works on armed bandit such as Bubeck et al. (2011) and Azar et al. (2014).
Assumption 1.
(Dissimilarity) The space is equipped with a dissimilarity function such that and
Given the dissimilarity function in Assumption 1, the diameter of a set is defined as . The open ball of radius and with center is defined as . We now introduce the local smoothness assumptions.
Assumption 2.
(Local Smoothness) We assume that there exist constants , and such that for all nodes on layer ,


s.t.

for all .

The global objective function satisfies that for all , we have
Remark 2.1.
Similar to the existing works on the armed bandit problem, the dissimilarity function is not an explicit input required by our FedPNE algorithm and only the smoothness constants are accessed (Bubeck et al., 2011; Azar et al., 2014). As mentioned by Bubeck et al. (2011); Grill et al. (2015), there are many functions that satisfy Assumption 2 on the standard equalsized partition with accessible and .
Finally, we introduce the definition of the nearoptimality dimension as in Azar et al. (2014), which measures the number of nearoptimal regions and thus the difficulty of the problem.
Assumption 3.
(Nearoptimality dimension) Let and , for any subset of optimal nodes , there exists a constant such that , where is the nearoptimality dimension of function and is the cover number of the set w.r.t. the dissimilarity .
Remark 2.2.
Some recent progress of centralized armed bandit such as Grill et al. (2015); Shang et al. (2019); Bartlett et al. (2019) have proposed an even weaker version of Assumption 2, i.e., the local smoothness without a metric assumption. Correspondingly, they define the complexity measure named nearoptimal dimension w.r.t. the partition . However, it is not trivial to directly adopt the weak local smoothness assumption without a metric for the federated version of blackbox optimization. Due to the limited communication, it will lead to continuing sampling in the suboptimal regions, yielding large cumulative regrets. As a pioneer work in federated armed bandit, we choose to use the slightly stronger assumptions in Bubeck et al. (2011) so that theoretical guarantees of our FedPNE algorithm can be successfully established. Weakening our set of assumption while keeping the regret bound guarantee is a very interesting future work direction.
3 Algorithm and Analysis
In this section, we introduce the new algorithm for the federated armed bandit problem, show its uniqueness compared with multiarmed bandit algorithms, and provide its theoretical analysis.
3.1 The FedPNE Algorithm
We propose the Federated Phased Node Elimination (FedPNE) algorithm, which consists of one clientside algorithm (Algorithm 1) and one serverside algorithm (Algorithm 2). The FedPNE algorithm runs in dynamic phases and it utilizes the hierarchical partition to gradually find the optimum by eliminating different regions of the domain. For a node , since its depth and index uniquely identifies the node, we will use to index the nodes in the elimination and expansion process. We use to denote the indices of active nodes that need to be sampled in phase and for the indices of nodes that need to be eliminated. To obtain a reward over a node (i.e., pull a node), the client evaluate the local objective at some where is either uniformly sampled from the node as in Bubeck et al. (2011) or some predefined point in the node as in Azar et al. (2014). The regret analysis will only be slightly different because of the smoothness assumption. In Section 3.2, we have used the latter strategy to derive our regret bound.
Algorithm Explanation: At initialization, the server starts from the root of the partition . At the beginning of each phase , the server expands the exploration tree as described in Algorithm 2 and the set until the criterion is satisfied, where denote the number of elements in a set, and the threshold number is the minimum required number of times each node on depth needs to be pulled, defined as
where are two absolute constants, and is the confidence (details in Lemma A.2). The number of times each node has to be sampled by each client and the phase length are then computed. This unique expansion criteria and sampling scheme guarantee four important things at the same time: (1) Every client samples every node at least one time so that the global objective is explored; (2) The empirical averages in line 12 of Algorithm 2
are unbiased estimators of the global function values for every node; (3) Every node is sampled enough number of times (larger than
); (4) The waste of budget due to the limitation on communication is minimized. After the broadcast in line 9, every client receives from the server.Next, the clients perform the exploration and send only the empirical reward averages back to the server, as in Algorithm 1. The server computes the best node, denoted by , and decides the elimination set by the following selection criteria.
(1) 
where and . In other words, for any node such that
, the function value of the global objective inside the node is much worse than the function value in the best node with high probability, and thus can be safely eliminated. The server then eliminate the bad nodes and proceed to the next phase with the new set
, which consists of nodes that are children of uneliminated nodes in the previous phase, as shown in line 1516 in Algorithm 2.Remark 3.1.
Apart from the obvious uniqueness in the algorithm design such as line 58, 1516 in Algorithm 2, FedPNE also introduces the idea of “node elimination”, which is based on the hierarchical partitioning of the parameter space. If we treat nodes inside the partition as the “arms” in multiarmed bandit, then FedPNE algorithm is different from the traditional PhasedElimination (PE) algorithm (Lattimore and Szepesvári, 2020) in mainly three aspects: (1) The nodes to be eliminated at different phases in FedPNE represent areas in domain with different sizes because of the hierarchical partition, while the arms to be eliminated by PE at different phases have equal roles; (2) While trying to eliminate areas, FedPNE also explores deeper in the partition and expand one node into multiple nodes, which means that the number of nodes to be sampled may increase instead of decrease as increases. However, the number of remaining arms only decreases in PE; (3) The elimination criteria in Eqn. (1) not only includes the UpperConfidence Bound (UCB) terms for statistical uncertainty, but also the smoothness term , which accounts for the variation of the objective function inside one node.
Remark 3.2.
It is worth mentioning that our algorithm requires the parameters as part of the input, which measures how fast the diameter of a node shrinks in the partition. These parameters are important because they characterize the smoothness of the global objective and we need them to determine the threshold and the elimination set . This information is crucial to ensure that validity of cumulative regret analysis theorems even for centralized armed bandit problems. Most of existing armed bandit algorithms, such as HOO(Bubeck et al., 2011),HCT(Azar et al., 2014) and VHCT(Li et al., 2021), all require these parameters.
3.2 Theoretical Analysis
We provide the upper bound on the expected cumulative regret of the proposed FedPNE algorithm as follows, which exhibits our theoretical advantage over nonfederated algorithms.
Theorem 3.1.
Suppose that satisfies Assumption 2, and is the nearoptimality dimension of the global objective as defined in Assumption 3. Setting , we have the following upper bound on the expected cumulative regret of the FedPNE algorithm.
where and are two absolute constants that do not depend on and . Moreover, the communication cost of FedPNE scales as
Remark 3.3.
The proof of the above theorem and the exact values of the two constants are relegated to Appendix B. Theorem 3.1 displays a desirable regret upper bound for the FedPNE algorithm because the first term on the righthand side only depends on and it is a cost due to federation across all the clients. When is sufficiently large compared with ^{1}^{1}1Specifically, when the condition is satisfied., the second term dominates the bound and it depends sublinearly on both the number of rounds and the number of agents , which means that the algorithm converges to the optimum of the global objective. Moreover, the average cumulative regret of each client is of order , which represents that increasing the number of clients helps reducing the regret of each client, and thus validates the effectiveness of federation. Compared with the regret of traditional armed bandit algorithms such as HCT and HOO (Bubeck et al., 2011; Azar et al., 2014), i.e., , the average regret bound of our algorithm is smaller when is large, which means that our algorithm is faster.
Remark 3.4.
When is relatively small, the first term dominates the regret bound, yielding a superlinear rate w.r.t. . Such a undesirable rate is mainly due to the inefficient use of all clients. For example, the total number of pulls of the node , i.e., , is much larger than when we explore the shallow layers (i.e., is small) in the partition at the beginning stage of the search.
Remark 3.5.
Moreover, the communication cost in Theorem 3.1 only depends logarithmically on the time horizon , showing that there are no frequent communications between the server and the clients during the federated learning process. Therefore simultaneously, our algorithm successfully protects user privacy to certain extent and saves the communication cost. The dependence of the communication cost on the number of clients is unavoidable since our algorithm assumes that all clients communicate synchronously, and thus the server exchanges information with all the clients for each phase. Such dependence is also observed in federated multiarmed bandit algorithms such as Fed1UCB (Shi and Shen, 2021a) and FedPE (Huang et al., 2021).
4 Experiments
We empirically evaluate the proposed FedPNE algorithm for both synthetic functions and realworld datasets. For the federated algorithms, we plot the average cumulative regret per client against the rounds. For the centralized algorithm (HCT
), we plot its cumulative regret on the global objective of each task against the number of evaluations. Such a comparison is fair in terms of overall computation resource, since we can view one evaluation of global objective as the result of instant public communications of all local objective evaluations in one round. For all the curves presented in this section (and additional numerical results in the appendix), they are averaged over 10 independent runs with shaded area standing for the 1 standard deviation. Additional experiments on other datasets and the experimental details are provided in Appendix
C. We compare FedPNE with two competitive baselines.
HCT (Azar et al., 2014) is a strong armed bandit algorithm baseline. As mentioned above, we run HCT on the global objective directly, which is a big advantage for HCT.
Synthetic Dataset. We evaluate the algorithms on two synthetic functions which are commonly used in armed bandit problem, which are the Garland function and the DoubleSine function, both defined on . These two functions are wellknown for their unsmoothness and large number of local optimums. The perturbed versions of these two functions are used as the local objective while the original functions are used as the global objective. The average cumulative regret of different algorithms are provided in Figure 2(a) and 2(b). As can be observed in the figures, Fed1UCB suffer from the linear regret due to the difference in the rewards of the best point in the meshgrid and the global optimum. Our algorithm, however, is able to achieve even better performance than HCT running on the global objective.
Landmine Detection. We federatedly tune the hyperparameters of machine learning models fitted on the Landmine dataset (Liu et al., 2007), where the features of different locations on multiple landmine fields extracted from radar images are used to detect the landmines. Following the setting of Dai et al. (2020)
, each client only has the access to the data of one random field, and trains a support vector machine with the RBF kernel parameter chosen from [0.01, 10] and the
regularization parameter chosen from . The local objective and the global objective are the AUCROC scores on the local landmine field and all the landmine fields respectively. The average cumulative regret of different algorithms are provided in Figure 2(c). As can be observed in the figures, our algorithm achieved smallest cumulative regret and thus the best performance.COVID19 Vaccine Dosage Optimization. In combat to the pandemic, we optimize the vaccine dosage in epidemiological models of COVID19 to find the best fractional dosage for overall population following Wiecek et al. (2022). Using fractional dosage of the vaccines will make them less effective, but at the same time more people get the chance of vaccination and thus can possibly accelerate the process of herd immunity. In our experimental setting, the local objectives are the final infectious rate of different countries/regions. Different countries have different parameters such as population size and the number of ICU units, and thus make the objectives heterogeneous. The results are shown in Figure 2(d). Our algorithm also achieves the fastest convergence.
Benefits of large number of clients. Finally, we validate the theoretical benefit of having large number of clients when running FedPNE. We plot the average cumulative regret of our algorithm on the four objectives with different number of clients in Figure 3^{2}^{2}2For the COVID dataset, there are only around 200 countries/regions in the world that have data available and thus we run for instead of . . As shown in the four figures, larger number of clients improves the average cumulative regret of FedPNE, which proves the correctness of our conclusions in Theorem 3.1.
5 Related Works.
Federated bandits. Most recent works of federated bandits focus on the case where the number of arms is finite or the reward function is linear. For example, Shi and Shen (2021a) and Shi et al. (2021b) propose the FedUCBtype algorithms for the multiarmed bandit (with personalization) problem to construct efficient clientserver communications. Zhu et al. (2021); Li et al. (2020) propose to make use of differential privacy to protect the user information of each client in federated bandit. For federated linear contextual bandits, different algorithms such as Dubey and Pentland (2020); Huang et al. (2021) utilize distinct ways to reconstruct the global contextual parameter and achieve sublinear regret with little communication cost. Very recently, Wang et al. (2022) extended the LASSO bandit algorithm (Bastani and Bayati, 2020) to federated high dimensional bandits. As far as we are aware of, our work is the first progress to discuss federated armed bandit or continuumarmed bandit.
armed bandits. Since the creation of the Zooming algorithm (Kleinberg et al., 2008), armed bandit has become a heated line of research. Algorithms such as HOO, HCT, VHCT provide cumulative regret bounds for the stochastic reward feedback setting (Bubeck et al., 2011; Azar et al., 2014; Li et al., 2021). Apart from these works, some algorithms also discuss the case where the reward has no noise, such as DiRect (Jones et al., 1993), DOO (Munos, 2011), SOO(Munos, 2011), and SequOOL (Bartlett et al., 2019). Another set of research works focuses on minimizing the simple regret of the optimization algorithms at the final round, and proposes algorithms such as POO (Grill et al., 2015), GPO (Shang et al., 2019), and StroquOOL (Bartlett et al., 2019). The theoretical results of these algorithms are not directly comparable to our work since we focus on cumulative regret.
Federated hyperparameter optimization. Recently, a few works have analyzed the problem of federated hyperparameter optimization using different approaches. Dai et al. (2020, 2021)
have proposed some modified versions of Thompson sampling and analyzed the problem using Bayesian Optimization algorithms. However, these algorithms require strong assumptions and the regret bounds of these algorithms do not show clear dependence on the number of clients
(Dai et al., 2020, 2021). Khodak et al. (2021) have proposed FedEx for optimizing hyperparameters of algorithms such as FedAvg and analyzed its regret in convex problems. Compared with these works, our work is much more general in three senses: (1) We require very weak assumptions on the objectives for the FedPNE algorithm to work; (2) Our work can be applied to nonhyperparameter tuning problems such as medicine dosage prediction, which do not involve algorithms such as FedAvg; (3) We provide the first regret bound with clear dependence on under the weak assumptions.6 Discussion and Conclusion
In this work, we establish the framework of federated armed bandit problem and propose the first algorithm for such problems. The proposed FedPNE algorithm utilizes the intrinsic structure of the global objective inside the hierarchical partitioning and achieves desirable regret bounds in terms of both the number of clients and the evaluation budget. Meanwhile it requires only logarithmic communications between the server and the clients, protecting the privacy of the clients. Both theoretical analysis and the experimental results show the advantage of FedPNE
over centralized algorithms and prior federated multiarmed bandit algorithms. Many interesting future directions can be explored based on the framework proposed in this work. For example, other summary statistics of the clientwise data can potentially accelerate the proposed algorithm, such as the usage of empirical variance in
Li et al. (2021). Moreover, the current algorithm still needs a the weak lipschitzness assumption. Whether the weakest assumption in the literature of armed bandit, i.e., the local smooth without a metric assumption proposed by Grill et al. (2015) can be used to prove similar regret guarantees remains a challenging open problem.References
 Azar et al. [2014] Mohammad Gheshlaghi Azar, Alessandro Lazaric, and Emma Brunskill. Online stochastic optimization under correlated bandit feedback. In International Conference on Machine Learning, pages 1557–1565. PMLR, 2014.
 Bartlett et al. [2019] Peter L. Bartlett, Victor Gabillon, and Michal Valko. A simple parameterfree and adaptive approach to optimization under a minimal local smoothness assumption. In 30th International Conference on Algorithmic Learning Theory, 2019.
 Bastani and Bayati [2020] Hamsa Bastani and Mohsen Bayati. Online decision making with highdimensional covariates. Operations Research, 68(1):276–294, 2020.
 Bubeck et al. [2011] Sébastien Bubeck, Rémi Munos, Gilles Stoltz, and Csaba Szepesvári. armed bandits. Journal of Machine Learning Research, 12(46):1655–1695, 2011.
 Chen and Gallego [2022] Ningyuan Chen and Guillermo Gallego. A primal–dual learning algorithm for personalized dynamic pricing with an inventory constraint. Mathematics of Operations Research, 0(0):null, 2022. doi: 10.1287/moor.2021.1220.
 Dai et al. [2020] Zhongxiang Dai, Bryan Kian Hsiang Low, and Patrick Jaillet. Federated bayesian optimization via thompson sampling. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 9687–9699. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/6dfe08eda761bd321f8a9b239f6f4ec3Paper.pdf.
 Dai et al. [2021] Zhongxiang Dai, Bryan Kian Hsiang Low, and Patrick Jaillet. Differentially private federated bayesian optimization with distributed exploration. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 9125–9139. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/4c27cea8526af8cfee3be5e183ac9605Paper.pdf.
 Dubey and Pentland [2020] Abhimanyu Dubey and Alex SandyPentland. Differentiallyprivate federated linear bandits. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6003–6014. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/4311359ed4969e8401880e3c1836fbe1Paper.pdf.
 Grill et al. [2015] JeanBastien Grill, Michal Valko, Remi Munos, and Remi Munos. Blackbox optimization of noisy functions with unknown smoothness. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2015.
 Huang et al. [2021] Ruiquan Huang, Weiqiang Wu, Jing Yang, and Cong Shen. Federated linear contextual bandits. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=Rt5mjXAqHrY.
 Jones et al. [1993] David Jones, Cary Perttunen, and Bruce Stuckman. Lipschitzian optimization without the Lipschitz constant. Journal of Optimization Theory and Applications, 79(1):157–181, Oct 1993.
 Khodak et al. [2021] Mikhail Khodak, Renbo Tu, Tian Li, Liam Li, MariaFlorina F Balcan, Virginia Smith, and Ameet Talwalkar. Federated hyperparameter tuning: Challenges, baselines, and connections to weightsharing. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 19184–19197. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/a0205b87490c847182672e8d371e9948Paper.pdf.

Kleinberg et al. [2008]
Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal.
Multiarmed bandits in metric spaces.
In
Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing
, STOC ’08, page 681–690, New York, NY, USA, 2008. Association for Computing Machinery. ISBN 9781605580470. doi: 10.1145/1374376.1374475. URL https://doi.org/10.1145/1374376.1374475.  Lattimore and Szepesvári [2020] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
 Li et al. [2020] Tan Li, Linqi Song, and Christina Fragouli. Federated recommendation system via differential privacy. In 2020 IEEE International Symposium on Information Theory (ISIT), pages 2592–2597. IEEE, 2020.
 Li et al. [2021] Wenjie Li, ChiHua Wang, Qifan Song, and Guang Cheng. Optimumstatistical collaboration towards general and efficient blackbox optimization, 2021.
 Liu et al. [2007] Qiuhua Liu, Xuejun Liao, and Lawrence Carin. Semisupervised multitask learning. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007. URL https://proceedings.neurips.cc/paper/2007/file/a34bacf839b923770b2c360eefa26748Paper.pdf.
 McMahan et al. [2017] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communicationefficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
 Munos [2011] Rémi Munos. Optimistic optimization of a deterministic function without the knowledge of its smoothness. In J. ShaweTaylor, R. Zemel, P. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011.
 Shang et al. [2019] Xuedong Shang, Emilie Kaufmann, and Michal Valko. General parallel optimization a without metric. In Algorithmic Learning Theory, pages 762–788, 2019.
 Shi and Shen [2021a] Chengshuai Shi and Cong Shen. Federated multiarmed bandits. Proceedings of the AAAI Conference on Artificial Intelligence, 35(11):9603–9611, May 2021a. URL https://ojs.aaai.org/index.php/AAAI/article/view/17156.
 Shi et al. [2021b] Chengshuai Shi, Cong Shen, and Jing Yang. Federated multiarmed bandits with personalization. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 2917–2925. PMLR, 13–15 Apr 2021b. URL https://proceedings.mlr.press/v130/shi21c.html.
 Wang et al. [2022] ChiHua Wang, Wenjie Li, Guang Cheng, and Guang Lin. Federated online sparse decision making, 2022. URL https://arxiv.org/abs/2202.13448.
 Wiecek et al. [2022] Witold Wiecek, Amrita Ahuja, Esha Chaudhuri, Michael Kremer, Alexandre Simoes Gomes, Christopher M. Snyder, Alex Tabarrok, and Brandon Joel Tan. Testing fractional doses of covid19 vaccines. Proceedings of the National Academy of Sciences, 119(8):e2116932119, 2022. doi: 10.1073/pnas.2116932119. URL https://www.pnas.org/doi/abs/10.1073/pnas.2116932119.
 Zhu et al. [2021] Zhaowei Zhu, Jingxuan Zhu, Ji Liu, and Yang Liu. Federated bandit: A gossiping approach. In Abstract Proceedings of the 2021 ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems, pages 3–4, 2021.
Appendix A Notations and Useful Lemmas
a.1 Notations
Here we list all the notations used in the proof of our cumulative regret bound:

: The set of phases that are completed up to time .

: The time steps of phase .

: The set of uneliminated/preeliminated nodes in phase .

: The set of nodes to be eliminated at the end of phase .

: The set of posteliminated nodes in phase , i.e., .

: the indices of the node chosen by the server at phase from .

: the node that contains the (one of the) global maximum on depth .

: the number of times the node is sampled from client , which is chosen to be in the FedPNE algorithm.

: the minimum required number of samples needed for a node on depth , defined below.

: the number of times the node is sampled, which is chosen to be in the FedPNE algorithm.
The threshold for every depth. The number of times needed for the statistical error (the UCB term) of every node on depth to be better than the optimization error is the solution to
(2) 
which is equivalent as the following choice of the threshold
(3) 
Notably, this choice of the threshold is the same as the threshold value in the HCT algorithm [Azar et al., 2014]. In other words, we design our algorithm so that the samples are from different clients uniformly and thus the estimators are unbiased, and at the same time we minimize the unspent budget due to such distribution. There is still some (manageable) unspent budget due to the floor operation in the computation of . However because of the expansion criterion (line 56) in FedPNE, we are able to travel to very deep layers inside the partition very fast when there are a lot of clients, and thus FedPNE is faster than singleclient armed bandit algorithms.
a.2 Supporting Lemmas
Lemma A.1.
(Hoeffding’s Inequality) Let
be independent random variables such that
almost surely. Consider the sum of these random variables, . Then for all , we haveHere is the expected value of .
Lemma A.2.
(High Probability Event) Define the “good” event as
(4) 
where the right hand side is exactly the confidence bound for the node and are two constants. Then for any fixed round , we have
Proof. For every and every , note that the node is sampled the same number of times independently from every local objective , therefore by the Hoeffding’s inequality (Lemma A.1), for any , we have
(5) 
If we take the average over the clients and the samples in the two summation terms inside the probability expression, we get that for any
(6) 
Therefore by the union bound, the probability of the event can be bounded as
(7)  
where the third inequality follows from the fact that the number of nodes is always smaller than since every client visits every node at least once.
Lemma A.3.
(Optimal Node is Never Eliminated). Under the high probability event at time , the node that contains the global optimum of at depth level , denoted by , is never eliminated in Algorithm 2, i.e., .
Proof. Under the high probability event , we have the following inequality for every node
(8) 
Therefore we have the following set of inequalities at the end of each phase for .
(9)  
where the second inequality follows from the local smoothness property of the objective. Therefore we conclude that and the optimal node is never eliminated.
Lemma A.4.
(Remaining Nodes Are NearOptimal). Under the high probability event at time , the representative point of every uneliminated node in phase , i.e., , is at least optimal, that is
(10) 
Proof. This Lemma shares a similar proof logic as Lemma A.3. Under the high probability event , we have the following inequality for every node such that
(11) 
Therefore the following set of inequalities hold
(12)  
where the second inequality holds because is not eliminated. The third inequality holds because of the optimality of the node in Algorithm 2, and the last one follows from the weak lipchitzness assumption (Assumption 2). In conclusion, we have the following upper bound on the regret
(13) 
where the last inequality holds because in the phase , we sample each node enough number of times ( larger than the threshold ) so that and thus are all smaller than .
Lemma A.5.
(Lemma 3 in Bubeck et al. [2011]) For a node , define to be the maximum of the function on that region. Suppose that for some , then all in are optimal.
Proof. The proof is provided here for completeness. For any real positive number , we denote by such that
(14) 
By the weak Lipschitz property, we have that for all ,
(15)  
Letting and substituting the bounds on the suboptimality and on the diameter of (Assumption 2) concludes the proof.
Lemma A.6.
(Upper Bounds on Length of Phases) Given the choice of the threshold and the design in Algorithm 2, there exists a constant such that for any fixed phase , denote and to be the depth sampled by the algorithm at phase and respectively, then the following inequalities hold almost surely
(16) 
Proof. By the design in our algorithm, the length of each (dynamic) phase should be
(17)  
where denotes the th power of and similar notations are used for the other powers. The second equality holds because nodes in are descendants of the uneliminated nodes in , and the size of is bounded by by the definition of nearoptimality dimension (Assumption 3) and Lemma A.4. Define to be the depth where , then
(18) 
Then when , because and , which means that and thus the algorithm will only increase the depth by 1. Therefore
(19) 
In the other case when , we have and thus because this number is already larger than . Now since , we have the following bound on
(20) 
which finishes all the proof.
Appendix B Main Proof of Theorem 3.1
In this section, we provide the proofs of the main theorem (Theorem 3.1) in this paper.
Proof. Let denote whether the event is true, i.e., if is true and 0 otherwise. We first decompose the regret into two terms
(21)  
For the second term, note that we can bound its expectation as follows
(22)  