1 Introduction
Stochastic MultiArmed Bandits (MAB) (Bubeck et al., 2012)
is a fundamental problem in machine learning with wide applications in real world. In stochastic MAB, there is an unknown underlying distribution over
for base arms and a learner (or called a server) interacts with the environment for rounds. At each round, the environment draws random rewards from the distribution for base arms. At the same time, the learner chooses one of base arms based on previously collected information, and receives the reward of chosen arm. The goal of the learner is to minimize the regret, measured as the difference between the reward of best fixed base arm and the learner’s total reward in expectation. MultiArmed Bandits has been used in recommendation systems, clinical trial, etc. However, many of these applications rely heavily on users’ sensitive data, which raise great concerns about data privacy. For example, in recommendation systems, observations at each round represent some preferences of the user over the recommended item set, which is the personal information of user and should be protected.Since first proposed in 2006, Differential Privacy (DP) (Dwork et al., 2006) has become a goldstandard in privacy preserving machine learning (Dwork and Roth, 2014). We say an algorithm protects differential privacy if there is not much difference between outputs of this algorithm over two datasets with Hamming distance 1 (see Section 2 for the rigorous definition in the streaming setting). For differentially private stochastic MultiArmed Bandits, there has already been extensive studies (Mishra and Thakurta, 2015; Tossou and Dimitrakakis, 2016; Sajed and Sheffet, 2019). Based on classic nonprivate optimal UCB algorithm (Auer et al., 2002), as well as the treebased aggregation technique to calculate private summation (Dwork et al., 2010), both Mishra and Thakurta (2015) and Tossou and Dimitrakakis (2016) designed algorithms under DP guarantee but with suboptimal guarantee ^{1}^{1}1In fact, (Tossou and Dimitrakakis, 2016) achieved a better utility bound but under a weaker privacy guarantee compared with common differential privacy in the streaming setting.. Recently, Sajed and Sheffet (2019) proposed a complex algorithm based on nonprivate Successive Elimination (EvenDar et al., 2002)
and sparse vector technique
(Dwork and Roth, 2014) to achieve the optimal regret bound, where is the minimum gap of rewards, and it matches both the nonprivate lower bound (Lai and Robbins, 1985) and the differentially private lower bound (Shariff and Sheffet, 2018) in common parameter regimes.However, stochastic MAB is the simplest model for sequential decision making with uncertainty. There are many problems in real world that have a combinatorial nature among multiple arms and maybe even nonlinear reward functions, such as online advertising, online shortest path, online social influence maximization, etc, which can be modeled via Combinatorial SemiBandits (CSB) (Chen et al., 2013, 2016; Lattimore and Szepesvári, 2018). In CSB, the learner chooses a super arm which is a set of base arms instead of a single base arm in MAB, and then observes the outcomes of the chosen arms as the feedback, and receive a reward determined by the chosen arms’ outcomes. The reward can be a nonlinear function in terms of these observations. Since many applications modeled via CSB also have issues about privacy leakage, in this paper, we study how to design private algorithms for Combinatorial SemiBandits under two common assumptions about nonlinear rewards: bounded smoothness and bounded smoothness (see section 2 for definitions.), which contain social influence maximization and linear CSB as important examples respectively (Kveton et al., 2015; Chen et al., 2016; Wang and Chen, 2017).
Main Difficulty: Compared with simple stochastic MAB, it is more difficult to design differentially private algorithms for CSB, due to its large action space and nonlinear rewards. Though each super arm in CSB can be regarded as a base arm in stochastic MAB, a straightforward implementation of differentially private algorithms for stochastic MAB will lead to a dependence over the size of decision set for super arms, which can be exponentially large in terms of . Besides above two differences, we receive observations of a set of base arms contained in the chosen super arm at each round, instead of a single base arm in MAB. Denote the maximum cardinality of a super arm as , which means the sensitive data collected at each round is roughly in a dimensional ball.
However, protecting differential privacy usually causes an additional dependence on the dimension of data for utility guarantee compared with corresponding nonprivate result, which is a notorious sideeffect of DP, such as in differentially private empirical risk minimization (ERM) (Bassily et al., 2014), bandits linear optimization (Agarwal and Singh, 2017), online convex optimization and bandits convex optimization (Thakurta and Smith, 2013), etc. On one hand, in some cases such as differentially private ERM (Bassily et al., 2014), this additional dependence on the dimension is unavoidable. On the other hand, some researchers show it is possible to eliminate this sideeffect if there are some extra structures, such as assumptions about restricted strong convexity, parameter set in norm, or generalized linear model with data bounded in norm, etc (Kifer et al., 2012; Smith and Thakurta, 2013; Jain and Thakurta, 2014; Talwar et al., 2015). In general, it is unclear whether it is possible to eliminate the sideeffect about dimensional dependence brought by privacy protection, let alone that our CSB setting does not have any extra structure mentioned above.
Besides, compared with differential privacy that admits the server to collect users’ true data, local differential privacy (LDP) is a much stronger notion of privacy, which requires protecting data privacy before collection. Thus LDP is more practical and userfriendly compared with DP (Cormode et al., 2018)
. Intuitively, learning under LDP guarantee is more difficult as what we collect is already noisy. Moreover, eliminating the sideeffect on the dimension is also more difficult under LDP guarantee even when we have some extra assumptions. For example, there are some negative results for locally differentially private sparse mean estimation
(Duchi et al., 2016).Our Contributions: Given above discussions, it seems hard to obtain nearly optimal regret for CSB under DP and much stronger LDP guarantee. Somewhat surprisingly, without any additional structure assumption such as sparsity, we show that it is indeed possible to achieve nearly optimal regret bound, by designing private algorithms with theoretical upper bounds and proving corresponding lower bounds in each case. Our upper bounds (nearly) match both our private lower bounds and nonprivate lower bounds (see Table 1 for an overview, where is some gap defined in Section 3, represents the upper bound, represents both the upper bound and lower bound, and for , we hide the polylogarithmic dependence such as ). The main contributions of this paper are summarized as the follows:
(1) For bounded smooth CSB under LDP and DP, we propose novel algorithms with regret bounds and respectively, and prove nearly matching lower bounds;
(2) For bounded smooth CSB under DP, we propose an algorithm with regret bound and nearly matching lower bound.
In Section 2, we provide some backgrounds in Combinatorial SemiBandits and (Local) Differential Privacy. Then in Section 3 and Section 4, we study both upper and lower bounds for (locally) differentially private bounded smooth and bounded smooth CSB respectively. Finally, we conclude our main results in Section 5.
1.1 Other Related Work
Besides differentially private stochastic MAB, there are also some works considering adversarial MAB with DP guarantee (Thakurta and Smith, 2013; Tossou and Dimitrakakis, 2017; Agarwal and Singh, 2017). Later, Shariff and Sheffet (2018) study contextual linear bandits under a relaxed definition of DP called Joint Differential Privacy. Compared with DP, bandits learning with LDP guarantee is paid less attention to. Only Gajane et al. (2018) study stochastic MAB under LDP guarantee. Recently, Basu et al. (2019) investigate relations about several variants of differential privacy in MAB setting, and prove some lower bounds. For nonprivate Combinatorial SemiBandits, there is an extension of study (György et al., 2007; Chen et al., 2013, 2016; Kveton et al., 2015; Combes et al., 2015; Wang and Chen, 2017, 2018).
2 Preliminaries
Now we detail the concrete setting studied in this paper.
2.1 Combinatorial SemiBandits
In a Combinatorial SemiBandits (CSB), there are base arms (denote ), and a predefined decision set , each element of which is a subset of with at most base arms and is called a super arm or an action, i.e. for any and represents the cardinality of a set. is an underlying unknown distribution supported on with expectation . There are rounds in total. At each round, the player chooses a super arm , and the environment draws a fresh random outcome from independently of any other variables. Then the player receives a reward and observes the feedback . We assume the reward function satisfies following assumptions, which are common in either real applications or previous literature (Chen et al., 2016; Wang and Chen, 2018), such as Linear CSB, social influence maximization.
Assumption 1.
There exists a reward function such that for any , where the expectation is over the randomness of outcome and .
Under above assumption, define as the optimal reward if we know in advance.
Assumption 2 (bounded smoothness).
There exists a constant , such that for arbitrary super arm , and two mean vectors , there is , where represents the truncated vector of on subset .
Assumption 3 (Monotonicity).
For any such that (elementwise compare), we have .
Intuitively, Assumptions 2 and 3 are about the smoothness and monotonicity of expected reward function , which are critical to deal with nonlinear rewards .
In this paper, we mainly consider two norms: norm and norm . Important examples that satisfy bounded smoothness include social influence maximization and Probabilistic maximum coverage bandit (Chen et al., 2013). For bounded smooth CSB, online shortest path and online maximum spanning tree are typical applications (Wang and Chen, 2018). Obviously, Linear combinatorial semibandits is bounded smooth. We regard and as constants in the whole paper. Apparently, bounded smoothness is a weaker assumption compared with bounded smoothness, and we have the following fact:
Fact 1.
Suppose a reward function is bounded smooth, then it is also bounded smooth with . On the contrary, suppose a reward function is bounded smooth, then it is bounded smooth with .
For many combinatorial problems such as MAXCUT, Minimum Weighted Set Cover etc, there are only efficient approximation algorithms. Therefore, it is natural to model them as a general approximation oracle defined as below:
Definition 1.
For some , approximation oracle is an oracle that takes an expectation vector as input, and outputs a super arm , such that . Here is the approximation ratio and
is the success probability of the oracle.
With approximation oracle, we should then consider corresponding approximation regret as we can only solve offline problem approximately:
Definition 2.
approximation regret of a CMAB algorithm after rounds using an approximation oracle under the expectation vector is defined as .
2.2 (Local) Differential Privacy
Now we give definitions of DP and LDP, as well as a basic building block.
Definition 3 (Differential Privacy (Dwork et al., 2006; Jain et al., 2012)).
Let be a sequence of data with domain . Let , where be outputs of the randomized algorithm on input . is said to preserve differential privacy, if for any two data sequences that differ in at most one entry, and for any subset , it holds that
Compared with DP, Local Differential Privacy (LDP) is a stronger notion of privacy than DP, see Kasiviswanathan et al. (2011); Duchi et al. (2013). Since LDP requires to encrypt each user’s data to protect privacy before collection, there is no need to define corresponding streaming version. Here we adopt the LDP definition given in (Bassily and Smith, 2015).
Definition 4 (Ldp).
A mechanism is said to be local differential private or LDP, if for any , and any (measurable) subset , there is
To protect LDP, the most commonly used method is Laplacian mechanism. Suppose the output domain of an algorithm is bounded by a dimensional L1 ball with radius , Laplacian mechanism just injects a dimensional random noise to the true output , and each entry of noise is sampled from independently ^{2}^{2}2 represents The Laplace distribution centered at with scale , and its p.d.f is
. The corresponding variance is
.. It is easy to prove the Laplacian mechanism guarantees LDP (Dwork and Roth, 2014).3 Bounded Smooth CSB with Privacy Guarantee
Since learning under LDP is much more difficult compared with DP, we mainly consider how to design an optimal algorithm for Bounded Smooth CSB under LDP guarantee. As we can see, based on our observation for locally differentially private CSB, it is then easy to obtain results for differentially private CSB.
As a warmup, we show that a simple mechanism can achieve nontrivial regret with LDP guarantee, but the dependence on dimension is suboptimal. Next, we design an improved version with optimal utility bound, and the matching lower bound is proved in Subsection 3.3.
3.1 A Straightforward Algorithm with SubOptimal Guarantee
Our private algorithm is based on previous nonprivate CSB algorithm, Combinatorial UCB (CUCB) (Chen et al., 2013, 2016). Though the reward function is nonlinear in terms of super arm and we only have access to some approximation oracle, which make our setting more complicated compared with previous private stochastic MAB (Mishra and Thakurta, 2015; Tossou and Dimitrakakis, 2016; Sajed and Sheffet, 2019), we show that the most straightforward method described in Algorithm 1 (denoted as ), i.e. using Laplacian mechanism with respect to each user’s data before collection, is enough to guarantee LDP and corresponding regret.
The key observation is that, the mean estimation of each base arm lies at the core of CUCB algorithm, and adding a Laplacian noise with respect to each observation causes additional variance to these estimations, which can be handled by relaxed upper confidence bounds. Injecting noise to the reward is used both in Tossou and Dimitrakakis (2017) and Agarwal and Singh (2017) for differentially private adversarial MAB. The idea about relaxed UCB also appears before for differentially private stochastic MAB (Mishra and Thakurta, 2015), whereas we study more general locally differentially private CSB with nonlinear reward and approximation oracle. Given the Laplacian mechanism, the privacy guarantee of Algorithm 1 is obvious:
Theorem 1.
Algorithm 1 guarantees LDP.
Before stating the regret bound, we define some necessary notations. We say a super arm is bad if , and denote the set of bad super arms as . For any base arm , define
(1)  
(2) 
and .
Now, we state the utility guarantee of Algorithm 1:
Theorem 2.
Under bounded smoothness and monotonicity assumptions, the regret of Algorithm 1 is upper bounded by
(3) 
Compared with corresponding nonprivate CUCB that achieves regret (Chen et al., 2013, 2016), one can see the regret bound of Algorithm 1 has an extra multiplicative factor , which is the price we pay for protecting LDP. According to our lower bound proved in Subsection 3.3, the dependence on the privacy parameter is optimal. However the additional term brought by privacy protection is undesirable and will hurt final performance for large . In the next subsection, we show how to eliminate this additional factor.
3.2 An Improved Algorithm with the Best Guarantee
Compared with the previous studies that try to eliminate the sideeffect of dimension brought by privacy protection under either sparsity or low complexity assumptions (Jain and Thakurta, 2014; Talwar et al., 2015; Zheng et al., 2017), in our general CSB setting, the information at each round is contained in a dimensional ball, and we do not have any sparsity assumption, which makes the additional factor seem unavoidable.
Somewhat surprisingly, after a careful analysis, we find that there is some redundant information implicitly even without any sparsity assumption. In detail, in the analysis of Algorithm 1, the instant regret of choosing super arm at round is controlled by the largest mean estimation error among all base arms in , which implies that we do not need to require all the observation of base arms in of user to update corresponding empirical means. Instead, we only use the observation of least pulled base arm in to update its empirical mean and keep others unchanged, as it is the weakest one in and causes largest estimation error. Since the user only sends the information of one entry to server now, it is enough to add noise in order to protect it, which then gets rids of the annoying additional factor in the regret guarantee. Denote this variant as , as shown in Algorithm 2.
Again, the privacy guarantee follows directly from the classic Laplacian mechanism:
Theorem 3.
Algorithm 2 guarantees LDP.
Since we condense the information required from each user significantly, which is reduced from observations to one observation, now we can inject less noise and prove a much better regret bound compared with the guarantee of Algorithm 1:
Theorem 4.
Under bounded smoothness and monotonicity assumptions, the regret of Algorithm 2 is upper bounded by
(4) 
Compared with the nonprivate theoretical guarantee, theorem 4 implies that we can achieve optimal locally differentially private bounded smooth CSB without any additional price paid for privacy protection, which is a bit surprising given the previous work about (locally) differentially private learning. See section A in the supplementary materials for the proof of theorem 4.
MultiArmed Bandits (MAB) is a special case of CSB, where and . In this case, our Algorithms 1 and 2) are exactly the same, and we obtain an algorithm for MAB under LDP with regret bound , where is the optimal base arm, and is the gap between arm and optimal arm . Apparently, this regret bound is also optimal given the LDP lower bound proved in Basu et al. (2019) and nonprivate lower bound (Bubeck et al., 2012).
Finally, if one wants to protect DP rather than LDP, based on the same observation as above, we can simply use the treebased aggregation technique (Dwork et al., 2010) with respect to the least pulled base arm to calculate its empirical mean estimation with DP guarantee. Since the treebased aggregation technique injects much less noise compared with Algorithm 2 designed for LDP, it is not hard to prove that this variant for DP can achieve regret bound .^{4}^{4}4The proof for this result is actually a combination of techniques used in this subsection and what we will use in subsection 4.2, hence omitted.
3.3 Lower Bounds
In this subsection, we prove the regret lower bound for locally private CSB problem with bounded smoothness. Like previous work (Kveton et al., 2015; Wang and Chen, 2017), we only consider lower bound with exact oracle, i.e. .
First we define a class of algorithms that we are interested in:
Definition 5.
An algorithm is called consistent if for any suboptimal super arm , the number of times is chosen by the algorithm is subpolynomial in for any stochastic CSB instance, i.e. for any .
Our lower bound is derived for the consistent algorithm class, which is natural for the stochastic CSB and has been used for lower bound analysis in many previous results (Lattimore and Szepesvári, 2018; Basu et al., 2019; Lai and Robbins, 1985; Kveton et al., 2015).
Our analysis focuses on CSB instances where the suboptimality gap of any super arms are equal. Since general CSB problem is harder than CSB problem with equal suboptimality gap (The latter problem can be reduced to the former), our lower bound can be directly applied to general CSB class, with replaced with for each base arm .
Theorem 5.
For any and , and any satisfying , the regret of any consistent locally private algorithm on the CSB problem with bounded smoothness is bounded from below as
Specifically, for , the regret is at least
The lower bound shows that Algorithm 2 achieves optimal regret with respect to all the parameters of the CSB instance. The proof of the theorem is an almost direct reduction from private MAB. Previous result (Theorem 2 in Basu et al. (2019) ) shows that the regret for any consistent locally private algorithm for MAB is at least . Since any MAB instance is a special case of CSB with , the regret lower bounds for stochastic CSB with follows directly by reduction. For general CSB problem with bounded smoothness, we consider a similar instance with the reward of each arm in MAB instance multiplied by . See Section B in the supplementary materials for the detailed analysis. For bounded smooth CSB under DP setting, using nearly the same technique, it is not hard to prove that the corresponding lower bound is .
4 Bounded Smooth CSB with Privacy Guarantee
4.1 Bounded Smooth CSB under LDP
Though our proposed Algorithm 2 is already optimal for bounded smooth CSB, if we use it for bounded smooth CSB such as important linear CSB to protect LDP, we will obtain its regret bound in order due to Fact 1. However, the optimal nonprivate regret bound for bounded smooth CSB is (Kveton et al., 2015; Wang and Chen, 2017), which implies a gap with our locally differentially private upper bound. Is it possible to eliminate this additional just like in the previous locally differentially private bounded smooth CSB? First we prove a lower bound for Bounded Smooth CSB under LDP guarantee. Our result under bounded smoothness assumption can be applied to linear CSB problem by setting .
Theorem 6.
For any and such that is an integer, and any satisfying , the regret of any consistent locally private algorithm on the CSB problem satisfying bounded smoothness is bounded from below as
Specifically, for , the regret is at least
We borrow the hard instance from Kveton et al. (2015) to prove the lower bound. Consider a path semibandit problem with base arms. The feasible super arms are paths, each containing base arm for . The reward of pulling super arm is times the sum of the weight for . The weights of the different base arms in the same super arm are identical, while the weights in the different paths are i.i.d sampled. Denote the best super arm as
, The weight of each base arm is a Bernoulli random variable with mean:
We use the general canonical bandit model (Lattimore and Szepesvári, 2018) to prove above theorem. See Section C in the supplementary materials for the detailed proof.
Though we can only prove a lower bound of in the same order as corresponding nonprivate optimal guarantee, we conjecture our lower bound is loose and the right lower bound is . In other words, maybe there is indeed some sideeffect for utility guarantee about the dimension if we hope to protect LDP. Intuitively, for bounded smooth CSB, we may have to update all arms in a played super arm for the regret guarantee (instead of only one arm as we did for bounded smooth CSB), and this makes the privacy protection harder with an extra factor of .
Since differential privacy is a relatively weaker notion compared with LDP, there may be some hope to further improve the regret bound if we focus on the guarantee of DP. In next two subsections, we show it is indeed true, by designing an differentially private algorithm with regret bound , and proving a nearly matching lower bound.
4.2 Upper Bound under DP
Compared with LDP, in which case the learning algorithm (or the server) can only receives noisy information, DP only has some restriction for the output of an algorithm, and the server has authority to collect true data. Thus, it is possible to inject much less noise under DP setting via an economic allocation of privacy budget .
We use treebased aggregation scheme (Dwork et al., 2009; Chan et al., 2011) to protect DP in our algorithm, which is an effective method in releasing private continual statistics over a data stream and frequently used in previous work, such as stochastic MAB (Mishra and Thakurta, 2015; Tossou and Dimitrakakis, 2016), Online Convex Optimization (Thakurta and Smith, 2013). Consider a data stream where . In each step , the algorithm receives data , and needs to output the sum , while insuring that the output sequence are differentially private. Treebased mechanism solves this problem in an elegant way with a binary tree. Each leaf node denotes data received in step . Each internal node calculates the sum of data in the leaf nodes rooted at it. Notice that one only needs access to nodes and sums up the values on them in order to calculate . Using the Laplacian mechanism, previous results have shown that adding i.i.d Lap() to each node ensures differential privacy for the scheme as stated in the following lemma:
Lemma 1 (Dwork et al. (2010); Chan et al. (2011)).
Treebased aggregation scheme with i.i.d noise added to each node is differentially private.
In our CSB setting, we store a vector with support at most in the leaf nodes of step . Each internal node calculates the sum of in the leaf nodes rooted at it. For each node, we add i.i.d noise to each dimension of the vector stored on the node to guarantee DP (See Algorithm 3). Based on Lemma 1, we have
Theorem 7.
Algorithm 3 guarantees DP.
In Algorithm 3, when we need to estimate the mean weight based on the previous outcome , we add additional Laplace noise to the sum of due to treebased aggregation scheme. Note that the number of Laplace noises added (the number of nodes we access to) is only logarithmic. This means that the additional confidence bound due to Laplace noise is only for base arm when it is pulled for times. Compared with the original bound for the subGaussian noise which is of order , the additional bound for Laplace noise enjoys better dependence on . This helps us to separate the term of and in the regret via delicate analysis, and finally derive a nearly optimal bound in the additive form.
Theorem 8.
Under bounded smoothness and monotonicity assumptions, the regret of Algorithm 3 is upper bounded by
We refer readers to Section D of the supplementary materials for the detailed proof. By relaxing LDP to DP, we have shown that it is possible to eliminate the sideeffect on dimension induced by privacy protection and nearly match corresponding nonprivate optimal bound .
4.3 Lower Bound under DP
In this subsection, we prove the lower bound for CSB algorithm under DP. Similar with the result of LDP lower bound, we consider CSB algorithm with consistent property. The lower bound stated below implies that our algorithm 3 can achieve nearoptimal regret regardless of logarithmic factors:
Theorem 9.
For any and such that , and any satisfying , the regret for any consistent CSB algorithm guaranteeing DP is at least .
The theorem is proved in section E of the supplementary materials. We only sketch the proof here. Previous results have shown that for nonprivate stochastic linear CSB, the regret lower bound is at least . By slightly modifying the hard instance, we can show that the regret lower bound for nonprivate CSB with bounded smoothness is . Since private CSB is strictly harder than nonprivate CSB (by reduction), the regret lower bound for private CSB is . We only need to prove that the regret lower bound for private CSB is , from which we can prove that the regret lower bound is .
Now we sketch the proof of term. Note a simple extension of Kveton et al. (2015) can only achieve in our differentially private setting, which is not satisfactory. It is thus necessary to construct some new hard instance to prove Theorem 9.
To solve this problem, we design the following CSB problem as a special case of general CSB with bounded smoothness. Suppose there are
base arms, each associated with a weight sampled from Bernoulli distribution. These
base arms are divided into three sets, and . contains base arms, which build up the optimal super arm set. contains “public” base arms for suboptimal super arms. These arms are contained in all suboptimal super arms. contains base arms. each base arm combined with ”public” base arms in builds up a suboptimal super arm. Totally we have suboptimal super arms and one optimal super arm. The mean of the Bernoulli random variable associated to each base arm is defined as follow:The weights of base arms in are identical, while other weights are i.i.d sampled. The reward of pulling a super arm is times the sum of weights of all base arm . As a result, the suboptimality gap of each suboptimal super arm is . With the coupling argument in Karwa and Vadhan (2017), we can prove that is at least for any suboptimal super arm with high probability. Since there are suboptimal super arm, we can reach the conclusion that the regret lower bound for private CSB is .
5 Conclusion and Future work
In this paper, we study (locally) differentially private algorithm for Combinatorial SemiBandits under two common assumptions about reward functions. For bounded smooth CSB under LDP and DP, we show the optimal regret of these two settings are respectively and , by proving lower bounds and designing (nearly) optimal private algorithms. For relatively weaker bounded smooth CSB, if we are required to protect DP instead of LDP, we show the optimal regret is , and give a differentially private algorithm as well as a nearly matching lower bound. Moreover, above optimal performance in our (locally) differentially private CSB is nearly the same order as nonprivate setting (Kveton et al., 2015; Chen et al., 2016; Wang and Chen, 2017).
Our Algorithm 2 is applicable for locally private CSB with bounded smoothness, with a regret upper bound of in this setting. However, the regret lower bound we prove is just . We conjecture that our lower bound is loose and the Algorithm 2 is also nearoptimal for locally private CSB with bounded smoothness. How to improve the lower bound is an interesting open problem for future work.
6 acknowledgements
This work is supported by National Basic Research Program of China (973 Program) (grant no. 2015CB352502), NSFC (61573026), BJNSF (L172037) and Beijing Acedemy of Artificial Intelligence.
References
 The price of differential privacy for online learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 32–40. Cited by: §1.1, §1, §3.1.
 Finitetime analysis of the multiarmed bandit problem. Machine learning 47 (23), pp. 235–256. Cited by: §1.
 Private empirical risk minimization: efficient algorithms and tight error bounds. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pp. 464–473. Cited by: §1.

Local, private, efficient protocols for succinct histograms.
In
Proceedings of the FortySeventh Annual ACM on Symposium on Theory of Computing
, pp. 127–135. Cited by: §2.2.  Differential privacy for multiarmed bandits: what is it and what is its cost?. arXiv preprint arXiv:1905.12298. Cited by: §1.1, §3.2, §3.3, §3.3, §B.
 Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends® in Machine Learning 5 (1), pp. 1–122. Cited by: §1, §3.2.
 Private and continual release of statistics. ACM Transactions on Information and System Security (TISSEC) 14 (3), pp. 1–24. Cited by: §4.2, Lemma 1.
 Combinatorial multiarmed bandit and its extension to probabilistically triggered arms. The Journal of Machine Learning Research 17 (1), pp. 1746–1778. Cited by: (Locally) Differentially Private Combinatorial SemiBandits, §1.1, Table 1, §1, §2.1, §3.1, §3.1, §5.
 Combinatorial multiarmed bandit: general framework and applications. In International Conference on Machine Learning, pp. 151–159. Cited by: §1.1, §1, §2.1, §3.1, §3.1.
 Combinatorial bandits revisited. In Advances in Neural Information Processing Systems, pp. 2116–2124. Cited by: §1.1.
 Privacy at scale: local differential privacy in practice. In Proceedings of the 2018 International Conference on Management of Data, pp. 1655–1658. Cited by: §1.
 Local privacy and minimax bounds: sharp rates for probability estimation. In Advances in Neural Information Processing Systems, pp. 1529–1537. Cited by: §2.2.
 Minimax optimal procedures for locally private estimation. arXiv preprint arXiv:1604.02390. Cited by: §1, Lemma 3.
 Calibrating noise to sensitivity in private data analysis. In Theory of cryptography, Berlin, Germany, pp. 265–284. Cited by: §1, Definition 3.
 Differential privacy under continual observation. In Proceedings of the fortysecond ACM symposium on Theory of computing, pp. 715–724. Cited by: §1, §3.2, Lemma 1.
 On the complexity of differentially private data release: efficient algorithms and hardness results. In Proceedings of the fortyfirst annual ACM symposium on Theory of computing, pp. 381–390. Cited by: §4.2.
 The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9 (3–4), pp. 211–407. Cited by: §1, §2.2.

PAC bounds for multiarmed bandit and markov decision processes
. InInternational Conference on Computational Learning Theory
, pp. 255–270. Cited by: §1.  Corrupt bandits for preserving local privacy. In Algorithmic Learning Theory, pp. 387–412. Cited by: §1.1.
 The online shortest path problem under partial monitoring. Journal of Machine Learning Research 8 (Oct), pp. 2369–2403. Cited by: §1.1.
 Differentially private online learning. In Conference on Learning Theory, pp. 24–1. Cited by: Definition 3.
 (Near) dimension independent risk bounds for differentially private learning. In International Conference on Machine Learning, pp. 476–484. Cited by: §1, §3.2.

Finite sample differentially private confidence intervals
. arXiv preprint arXiv:1711.03908. Cited by: §4.3, §E.  What can we learn privately?. SIAM Journal on Computing 40 (3), pp. 793–826. Cited by: §2.2.
 Private convex empirical risk minimization and highdimensional regression. Journal of Machine Learning Research 1 (41), pp. 3–1. Cited by: §1.
 Tight regret bounds for stochastic combinatorial semibandits. In Artificial Intelligence and Statistics, pp. 535–543. Cited by: (Locally) Differentially Private Combinatorial SemiBandits, §1.1, Table 1, §1, §3.3, §3.3, §4.1, §4.1, §4.3, §5, §C, §E, §E, §E.
 Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6 (1), pp. 4–22. Cited by: §1, §3.3.
 Bandit algorithms. preprint. Cited by: §1, §3.3, §4.1, §C.
 An informationtheoretic approach to minimax regret in partial monitoring. arXiv preprint arXiv:1902.00470. Cited by: §C.
 (Nearly) optimal differentially private stochastic multiarm bandits. In Proceedings of the ThirtyFirst Conference on Uncertainty in Artificial Intelligence, pp. 592–601. Cited by: §1, §3.1, §3.1, §4.2.
 An optimal private stochasticmab algorithm based on optimal private stopping rule. In International Conference on Machine Learning, pp. 5579–5588. Cited by: §1, §3.1.
 Differentially private contextual linear bandits. In Advances in Neural Information Processing Systems, pp. 4296–4306. Cited by: §1.1, §1, §E.
 Differentially private model selection via stability arguments and the robustness of the lasso. J Mach Learn Res Proc Track 30, pp. 819–850. Cited by: §1.
 Nearly optimal private lasso. In Advances in Neural Information Processing Systems, pp. 3025–3033. Cited by: §1, §3.2.
 (Nearly) optimal algorithms for private online learning in fullinformation and bandit settings. In Advances in Neural Information Processing Systems, pp. 2733–2741. Cited by: §1.1, §1, §4.2.
 Achieving privacy in the adversarial multiarmed bandit. In ThirtyFirst AAAI Conference on Artificial Intelligence, Cited by: §1.1, §3.1.
 Algorithms for differentially private multiarmed bandits. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §1, §3.1, §4.2, footnote 1.
 Improving regret bounds for combinatorial semibandits with probabilistically triggered arms and its applications. In Advances in Neural Information Processing Systems, pp. 1161–1171. Cited by: §1.1, Table 1, §1, §3.3, §4.1, §5.
 Thompson sampling for combinatorial semibandits. In International Conference on Machine Learning, pp. 5101–5109. Cited by: §1.1, §2.1, §2.1.
 Collect at once, use effectively: making noninteractive locally private learning possible. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 4130–4139. Cited by: §3.2.
Appendices
A Proof of Theorem 4
Theorem 4.
(Restate) For Algorithm 2, we have
(5) 
Proof.
Suppose denote the event that the oracle fails to produce an approximate answer with respect to the input vector in step . We have . The number of times happens in expectation is at most . The cumulative regret in these steps is at most
Now we only consider the steps doesn’t happen. We maintain counters in the proof, and denote its value in step as . The initialization of is the same as , i.e. . In step , if doesn’t happen, and the oracle selects a suboptimal super arm, we increment by one, i.e. , where , otherwise we keep unchanged. This indicates that . Notice that if a suboptimal super arm is pulled in step , exactly one counter is incremented by one, and . As a result, we have:
(6) 
Here denote the suboptimal gap when incremented from to in a certain step .
Now we only need to bound and . We denote the following event as : For a fixed step and a fixed base arm ,
The noise in comes from two parts: the Laplacian noise added for privacy and the randomness of . For the first part, by Bernstein’s Inequality over i.i.d Laplace distribution, the confidence bound is with prob. at least . For the second part, since is bounded, the confidence bound is with prob. at least by Hoeffding’s inequality. This shows that happens with prob. . By union bounds over all steps, happens for all and with prob. . We denote this event as .
Suppose happens, we have . If a suboptimal arm is pulled in step . we have
(7) 
The first inequality is due to monotonicity and bounded smoothness assumption. The second inequality is because the oracle returns which satisfies . The third inequality is due to the definition of and the concentration bound for . The last inequality is due to .
Define . If for any , we have by Equ. A. On the other hand, by the definition of , , which leads to a contradiction. This means that if suboptimal arm is pulled in step , and contains base arm , the counter is at most
Comments
There are no comments yet.