(Locally) Differentially Private Combinatorial Semi-Bandits

06/01/2020 ∙ by Xiaoyu Chen, et al. ∙ 15

In this paper, we study Combinatorial Semi-Bandits (CSB) that is an extension of classic Multi-Armed Bandits (MAB) under Differential Privacy (DP) and stronger Local Differential Privacy (LDP) setting. Since the server receives more information from users in CSB, it usually causes additional dependence on the dimension of data, which is a notorious side-effect for privacy preserving learning. However for CSB under two common smoothness assumptions <cit.>, we show it is possible to remove this side-effect. In detail, for B_∞-bounded smooth CSB under either ε-LDP or ε-DP, we prove the optimal regret bound is Θ(mB^2_∞ln T /Δϵ^2) or Θ̃(mB^2_∞ln T/Δϵ) respectively, where T is time period, Δ is the gap of rewards and m is the number of base arms, by proposing novel algorithms and matching lower bounds. For B_1-bounded smooth CSB under ε-DP, we also prove the optimal regret bound is Θ̃(mKB^2_1ln T/Δϵ) with both upper bound and lower bound, where K is the maximum number of feedback in each round. All above results nearly match corresponding non-private optimal rates, which imply there is no additional price for (locally) differentially private CSB in above common settings.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stochastic Multi-Armed Bandits (MAB) (Bubeck et al., 2012)

is a fundamental problem in machine learning with wide applications in real world. In stochastic MAB, there is an unknown underlying distribution over

for base arms and a learner (or called a server) interacts with the environment for rounds. At each round, the environment draws random rewards from the distribution for base arms. At the same time, the learner chooses one of base arms based on previously collected information, and receives the reward of chosen arm. The goal of the learner is to minimize the regret, measured as the difference between the reward of best fixed base arm and the learner’s total reward in expectation. Multi-Armed Bandits has been used in recommendation systems, clinical trial, etc. However, many of these applications rely heavily on users’ sensitive data, which raise great concerns about data privacy. For example, in recommendation systems, observations at each round represent some preferences of the user over the recommended item set, which is the personal information of user and should be protected.

Since first proposed in 2006, Differential Privacy (DP) (Dwork et al., 2006) has become a gold-standard in privacy preserving machine learning (Dwork and Roth, 2014). We say an algorithm protects differential privacy if there is not much difference between outputs of this algorithm over two datasets with Hamming distance 1 (see Section 2 for the rigorous definition in the streaming setting). For -differentially private stochastic Multi-Armed Bandits, there has already been extensive studies (Mishra and Thakurta, 2015; Tossou and Dimitrakakis, 2016; Sajed and Sheffet, 2019). Based on classic non-private optimal UCB algorithm (Auer et al., 2002), as well as the tree-based aggregation technique to calculate private summation (Dwork et al., 2010), both Mishra and Thakurta (2015) and Tossou and Dimitrakakis (2016) designed algorithms under DP guarantee but with sub-optimal guarantee 111In fact, (Tossou and Dimitrakakis, 2016) achieved a better utility bound but under a weaker privacy guarantee compared with common differential privacy in the streaming setting.. Recently, Sajed and Sheffet (2019) proposed a complex algorithm based on non-private Successive Elimination (Even-Dar et al., 2002)

and sparse vector technique

(Dwork and Roth, 2014) to achieve the optimal regret bound, where is the minimum gap of rewards, and it matches both the non-private lower bound (Lai and Robbins, 1985) and the differentially private lower bound (Shariff and Sheffet, 2018) in common parameter regimes.

However, stochastic MAB is the simplest model for sequential decision making with uncertainty. There are many problems in real world that have a combinatorial nature among multiple arms and maybe even non-linear reward functions, such as online advertising, online shortest path, online social influence maximization, etc, which can be modeled via Combinatorial Semi-Bandits (CSB) (Chen et al., 2013, 2016; Lattimore and Szepesvári, 2018). In CSB, the learner chooses a super arm which is a set of base arms instead of a single base arm in MAB, and then observes the outcomes of the chosen arms as the feedback, and receive a reward determined by the chosen arms’ outcomes. The reward can be a non-linear function in terms of these observations. Since many applications modeled via CSB also have issues about privacy leakage, in this paper, we study how to design private algorithms for Combinatorial Semi-Bandits under two common assumptions about non-linear rewards: -bounded smoothness and -bounded smoothness (see section 2 for definitions.), which contain social influence maximization and linear CSB as important examples respectively (Kveton et al., 2015; Chen et al., 2016; Wang and Chen, 2017).

Main Difficulty: Compared with simple stochastic MAB, it is more difficult to design differentially private algorithms for CSB, due to its large action space and non-linear rewards. Though each super arm in CSB can be regarded as a base arm in stochastic MAB, a straightforward implementation of differentially private algorithms for stochastic MAB will lead to a dependence over the size of decision set for super arms, which can be exponentially large in terms of . Besides above two differences, we receive observations of a set of base arms contained in the chosen super arm at each round, instead of a single base arm in MAB. Denote the maximum cardinality of a super arm as , which means the sensitive data collected at each round is roughly in a -dimensional ball.

However, protecting differential privacy usually causes an additional dependence on the dimension of data for utility guarantee compared with corresponding non-private result, which is a notorious side-effect of DP, such as in differentially private empirical risk minimization (ERM) (Bassily et al., 2014), bandits linear optimization (Agarwal and Singh, 2017), online convex optimization and bandits convex optimization (Thakurta and Smith, 2013), etc. On one hand, in some cases such as differentially private ERM (Bassily et al., 2014), this additional dependence on the dimension is unavoidable. On the other hand, some researchers show it is possible to eliminate this side-effect if there are some extra structures, such as assumptions about restricted strong convexity, parameter set in norm, or generalized linear model with data bounded in norm, etc (Kifer et al., 2012; Smith and Thakurta, 2013; Jain and Thakurta, 2014; Talwar et al., 2015). In general, it is unclear whether it is possible to eliminate the side-effect about dimensional dependence brought by privacy protection, let alone that our CSB setting does not have any extra structure mentioned above.

Besides, compared with differential privacy that admits the server to collect users’ true data, local differential privacy (LDP) is a much stronger notion of privacy, which requires protecting data privacy before collection. Thus LDP is more practical and user-friendly compared with DP (Cormode et al., 2018)

. Intuitively, learning under LDP guarantee is more difficult as what we collect is already noisy. Moreover, eliminating the side-effect on the dimension is also more difficult under LDP guarantee even when we have some extra assumptions. For example, there are some negative results for locally differentially private sparse mean estimation

(Duchi et al., 2016).

Our Contributions: Given above discussions, it seems hard to obtain nearly optimal regret for CSB under DP and much stronger LDP guarantee. Somewhat surprisingly, without any additional structure assumption such as sparsity, we show that it is indeed possible to achieve nearly optimal regret bound, by designing private algorithms with theoretical upper bounds and proving corresponding lower bounds in each case. Our upper bounds (nearly) match both our private lower bounds and non-private lower bounds (see Table 1 for an overview, where is some gap defined in Section 3, represents the upper bound, represents both the upper bound and lower bound, and for , we hide the poly-logarithmic dependence such as ). The main contributions of this paper are summarized as the follows:

(1) For -bounded smooth CSB under -LDP and -DP, we propose novel algorithms with regret bounds and respectively, and prove nearly matching lower bounds;

(2) For -bounded smooth CSB under -DP, we propose an algorithm with regret bound and nearly matching lower bound.

In Section 2, we provide some backgrounds in Combinatorial Semi-Bandits and (Local) Differential Privacy. Then in Section 3 and Section 4, we study both upper and lower bounds for (locally) differentially private -bounded smooth and -bounded smooth CSB respectively. Finally, we conclude our main results in Section 5.

max width = 1.3 Problem -LDP -DP Non-Private Result -Bounded Smooth CSB (Chen et al., 2016; Wang and Chen, 2017) -Bounded Smooth CSB (Kveton et al., 2015; Wang and Chen, 2017)

Table 1: Summary of Our Results for Private CSB. represents matching upper bounds and lower bounds. represents upper bounds. Our lower bound in DP setting is actually in an additive form, see Theorem 9. Here, we write it in a multiplicative form for simplicity, which is natural in common parameter regimes.

1.1 Other Related Work

Besides differentially private stochastic MAB, there are also some works considering adversarial MAB with DP guarantee (Thakurta and Smith, 2013; Tossou and Dimitrakakis, 2017; Agarwal and Singh, 2017). Later, Shariff and Sheffet (2018) study contextual linear bandits under a relaxed definition of DP called Joint Differential Privacy. Compared with DP, bandits learning with LDP guarantee is paid less attention to. Only Gajane et al. (2018) study stochastic MAB under LDP guarantee. Recently, Basu et al. (2019) investigate relations about several variants of differential privacy in MAB setting, and prove some lower bounds. For non-private Combinatorial Semi-Bandits, there is an extension of study (György et al., 2007; Chen et al., 2013, 2016; Kveton et al., 2015; Combes et al., 2015; Wang and Chen, 2017, 2018).

2 Preliminaries

Now we detail the concrete setting studied in this paper.

2.1 Combinatorial Semi-Bandits

In a Combinatorial Semi-Bandits (CSB), there are base arms (denote ), and a predefined decision set , each element of which is a subset of with at most base arms and is called a super arm or an action, i.e. for any and represents the cardinality of a set. is an underlying unknown distribution supported on with expectation . There are rounds in total. At each round, the player chooses a super arm , and the environment draws a fresh random outcome from independently of any other variables. Then the player receives a reward and observes the feedback . We assume the reward function satisfies following assumptions, which are common in either real applications or previous literature (Chen et al., 2016; Wang and Chen, 2018), such as Linear CSB, social influence maximization.

Assumption 1.

There exists a reward function such that for any , where the expectation is over the randomness of outcome and .

Under above assumption, define as the optimal reward if we know in advance.

Assumption 2 (-bounded smoothness).

There exists a constant , such that for arbitrary super arm , and two mean vectors , there is , where represents the truncated vector of on subset .

Assumption 3 (Monotonicity).

For any such that (element-wise compare), we have .

Intuitively, Assumptions 2 and 3 are about the smoothness and monotonicity of expected reward function , which are critical to deal with non-linear rewards .

In this paper, we mainly consider two norms: norm and norm . Important examples that satisfy -bounded smoothness include social influence maximization and Probabilistic maximum coverage bandit (Chen et al., 2013). For -bounded smooth CSB, online shortest path and online maximum spanning tree are typical applications (Wang and Chen, 2018). Obviously, Linear combinatorial semi-bandits is -bounded smooth. We regard and as constants in the whole paper. Apparently, -bounded smoothness is a weaker assumption compared with -bounded smoothness, and we have the following fact:

Fact 1.

Suppose a reward function is -bounded smooth, then it is also -bounded smooth with . On the contrary, suppose a reward function is -bounded smooth, then it is -bounded smooth with .

For many combinatorial problems such as MAX-CUT, Minimum Weighted Set Cover etc, there are only efficient approximation algorithms. Therefore, it is natural to model them as a general approximation oracle defined as below:

Definition 1.

For some , -approximation oracle is an oracle that takes an expectation vector as input, and outputs a super arm , such that . Here is the approximation ratio and

is the success probability of the oracle.

With approximation oracle, we should then consider corresponding approximation regret as we can only solve offline problem approximately:

Definition 2.

-approximation regret of a CMAB algorithm after rounds using an -approximation oracle under the expectation vector is defined as .

2.2 (Local) Differential Privacy

Now we give definitions of DP and LDP, as well as a basic building block.

Definition 3 (Differential Privacy (Dwork et al., 2006; Jain et al., 2012)).

Let be a sequence of data with domain . Let , where be outputs of the randomized algorithm on input . is said to preserve -differential privacy, if for any two data sequences that differ in at most one entry, and for any subset , it holds that

Compared with DP, Local Differential Privacy (LDP) is a stronger notion of privacy than DP, see Kasiviswanathan et al. (2011); Duchi et al. (2013). Since LDP requires to encrypt each user’s data to protect privacy before collection, there is no need to define corresponding streaming version. Here we adopt the LDP definition given in (Bassily and Smith, 2015).

Definition 4 (Ldp).

A mechanism is said to be -local differential private or -LDP, if for any , and any (measurable) subset , there is

To protect -LDP, the most commonly used method is Laplacian mechanism. Suppose the output domain of an algorithm is bounded by a -dimensional L1 ball with radius , Laplacian mechanism just injects a -dimensional random noise to the true output , and each entry of noise is sampled from independently 222 represents The Laplace distribution centered at with scale , and its p.d.f is

. The corresponding variance is

.. It is easy to prove the Laplacian mechanism guarantees -LDP (Dwork and Roth, 2014).

3 -Bounded Smooth CSB with Privacy Guarantee

Since learning under LDP is much more difficult compared with DP, we mainly consider how to design an optimal algorithm for -Bounded Smooth CSB under -LDP guarantee. As we can see, based on our observation for locally differentially private CSB, it is then easy to obtain results for differentially private CSB.

As a warm-up, we show that a simple mechanism can achieve non-trivial regret with LDP guarantee, but the dependence on dimension is sub-optimal. Next, we design an improved version with optimal utility bound, and the matching lower bound is proved in Subsection 3.3.

3.1 A Straightforward Algorithm with Sub-Optimal Guarantee

Our private algorithm is based on previous non-private CSB algorithm, Combinatorial UCB (CUCB) (Chen et al., 2013, 2016). Though the reward function is non-linear in terms of super arm and we only have access to some approximation oracle, which make our setting more complicated compared with previous private stochastic MAB  (Mishra and Thakurta, 2015; Tossou and Dimitrakakis, 2016; Sajed and Sheffet, 2019), we show that the most straightforward method described in Algorithm 1 (denoted as ), i.e. using Laplacian mechanism with respect to each user’s data before collection, is enough to guarantee LDP and corresponding regret.

The key observation is that, the mean estimation of each base arm lies at the core of CUCB algorithm, and adding a Laplacian noise with respect to each observation causes additional variance to these estimations, which can be handled by relaxed upper confidence bounds. Injecting noise to the reward is used both in Tossou and Dimitrakakis (2017) and Agarwal and Singh (2017) for differentially private adversarial MAB. The idea about relaxed UCB also appears before for differentially private stochastic MAB (Mishra and Thakurta, 2015), whereas we study more general locally differentially private CSB with non-linear reward and approximation oracle. Given the Laplacian mechanism, the privacy guarantee of Algorithm 1 is obvious:

1:  Input: Privacy budgets
2:  Initialize: , empirical mean .
3:  for  do
4:      333If a denominator is , we define corresponding constant as .
5:     Play if else
6:     User generates outcome for , and sends to the server, where
7:     Server updates , for , and keep others unchanged.
8:  end for
Algorithm 1
Theorem 1.

Algorithm 1 guarantees -LDP.

Before stating the regret bound, we define some necessary notations. We say a super arm is bad if , and denote the set of bad super arms as . For any base arm , define


and .

Now, we state the utility guarantee of Algorithm 1:

Theorem 2.

Under -bounded smoothness and monotonicity assumptions, the regret of Algorithm 1 is upper bounded by


Compared with corresponding non-private CUCB that achieves regret (Chen et al., 2013, 2016), one can see the regret bound of Algorithm 1 has an extra multiplicative factor , which is the price we pay for protecting LDP. According to our lower bound proved in Subsection 3.3, the dependence on the privacy parameter is optimal. However the additional term brought by privacy protection is undesirable and will hurt final performance for large . In the next subsection, we show how to eliminate this additional factor.

3.2 An Improved Algorithm with the Best Guarantee

Compared with the previous studies that try to eliminate the side-effect of dimension brought by privacy protection under either sparsity or low complexity assumptions (Jain and Thakurta, 2014; Talwar et al., 2015; Zheng et al., 2017), in our general CSB setting, the information at each round is contained in a -dimensional ball, and we do not have any sparsity assumption, which makes the additional factor seem unavoidable.

Somewhat surprisingly, after a careful analysis, we find that there is some redundant information implicitly even without any sparsity assumption. In detail, in the analysis of Algorithm 1, the instant regret of choosing super arm at round is controlled by the largest mean estimation error among all base arms in , which implies that we do not need to require all the observation of base arms in of user to update corresponding empirical means. Instead, we only use the observation of least pulled base arm in to update its empirical mean and keep others unchanged, as it is the weakest one in and causes largest estimation error. Since the user only sends the information of one entry to server now, it is enough to add noise in order to protect it, which then gets rids of the annoying additional factor in the regret guarantee. Denote this variant as , as shown in Algorithm 2.

1:  Input: Privacy budgets
2:  Initialize: , empirical mean .
3:  for  do
5:     Play if else
6:     User generates outcome for , and sends to the server, where
7:     Server updates , and keep others unchanged.
8:  end for
Algorithm 2

Again, the privacy guarantee follows directly from the classic Laplacian mechanism:

Theorem 3.

Algorithm 2 guarantees -LDP.

Since we condense the information required from each user significantly, which is reduced from observations to one observation, now we can inject less noise and prove a much better regret bound compared with the guarantee of Algorithm 1:

Theorem 4.

Under -bounded smoothness and monotonicity assumptions, the regret of Algorithm 2 is upper bounded by


Compared with the non-private theoretical guarantee, theorem 4 implies that we can achieve optimal locally differentially private -bounded smooth CSB without any additional price paid for privacy protection, which is a bit surprising given the previous work about (locally) differentially private learning. See section A in the supplementary materials for the proof of theorem 4.

Multi-Armed Bandits (MAB) is a special case of CSB, where and . In this case, our Algorithms 1 and 2) are exactly the same, and we obtain an algorithm for MAB under -LDP with regret bound , where is the optimal base arm, and is the gap between arm and optimal arm . Apparently, this regret bound is also optimal given the LDP lower bound proved in Basu et al. (2019) and non-private lower bound (Bubeck et al., 2012).

Finally, if one wants to protect -DP rather than -LDP, based on the same observation as above, we can simply use the tree-based aggregation technique (Dwork et al., 2010) with respect to the least pulled base arm to calculate its empirical mean estimation with DP guarantee. Since the tree-based aggregation technique injects much less noise compared with Algorithm 2 designed for LDP, it is not hard to prove that this variant for DP can achieve regret bound .444The proof for this result is actually a combination of techniques used in this subsection and what we will use in subsection 4.2, hence omitted.

3.3 Lower Bounds

In this subsection, we prove the regret lower bound for locally private CSB problem with -bounded smoothness. Like previous work (Kveton et al., 2015; Wang and Chen, 2017), we only consider lower bound with exact oracle, i.e. .

First we define a class of algorithms that we are interested in:

Definition 5.

An algorithm is called consistent if for any suboptimal super arm , the number of times is chosen by the algorithm is subpolynomial in for any stochastic CSB instance, i.e. for any .

Our lower bound is derived for the consistent algorithm class, which is natural for the stochastic CSB and has been used for lower bound analysis in many previous results (Lattimore and Szepesvári, 2018; Basu et al., 2019; Lai and Robbins, 1985; Kveton et al., 2015).

Our analysis focuses on CSB instances where the sub-optimality gap of any super arms are equal. Since general CSB problem is harder than CSB problem with equal sub-optimality gap (The latter problem can be reduced to the former), our lower bound can be directly applied to general CSB class, with replaced with for each base arm .

Theorem 5.

For any and , and any satisfying , the regret of any consistent -locally private algorithm on the CSB problem with -bounded smoothness is bounded from below as

Specifically, for , the regret is at least

The lower bound shows that Algorithm 2 achieves optimal regret with respect to all the parameters of the CSB instance. The proof of the theorem is an almost direct reduction from private MAB. Previous result (Theorem 2 in Basu et al. (2019) ) shows that the regret for any consistent -locally private algorithm for MAB is at least . Since any MAB instance is a special case of CSB with , the regret lower bounds for stochastic CSB with follows directly by reduction. For general CSB problem with -bounded smoothness, we consider a similar instance with the reward of each arm in MAB instance multiplied by . See Section B in the supplementary materials for the detailed analysis. For -bounded smooth CSB under DP setting, using nearly the same technique, it is not hard to prove that the corresponding lower bound is .

4 -Bounded Smooth CSB with Privacy Guarantee

4.1 -Bounded Smooth CSB under LDP

Though our proposed Algorithm 2 is already optimal for -bounded smooth CSB, if we use it for -bounded smooth CSB such as important linear CSB to protect -LDP, we will obtain its regret bound in order due to Fact 1. However, the optimal non-private regret bound for -bounded smooth CSB is (Kveton et al., 2015; Wang and Chen, 2017), which implies a gap with our locally differentially private upper bound. Is it possible to eliminate this additional just like in the previous locally differentially private -bounded smooth CSB? First we prove a lower bound for -Bounded Smooth CSB under LDP guarantee. Our result under -bounded smoothness assumption can be applied to linear CSB problem by setting .

Theorem 6.

For any and such that is an integer, and any satisfying , the regret of any consistent -locally private algorithm on the CSB problem satisfying -bounded smoothness is bounded from below as

Specifically, for , the regret is at least

We borrow the hard instance from Kveton et al. (2015) to prove the lower bound. Consider a -path semi-bandit problem with base arms. The feasible super arms are paths, each containing base arm for . The reward of pulling super arm is times the sum of the weight for . The weights of the different base arms in the same super arm are identical, while the weights in the different paths are i.i.d sampled. Denote the best super arm as

, The weight of each base arm is a Bernoulli random variable with mean:

We use the general canonical bandit model (Lattimore and Szepesvári, 2018) to prove above theorem. See Section C in the supplementary materials for the detailed proof.

Though we can only prove a lower bound of in the same order as corresponding non-private optimal guarantee, we conjecture our lower bound is loose and the right lower bound is . In other words, maybe there is indeed some side-effect for utility guarantee about the dimension if we hope to protect LDP. Intuitively, for bounded smooth CSB, we may have to update all arms in a played super arm for the regret guarantee (instead of only one arm as we did for bounded smooth CSB), and this makes the privacy protection harder with an extra factor of .

Since differential privacy is a relatively weaker notion compared with LDP, there may be some hope to further improve the regret bound if we focus on the guarantee of DP. In next two subsections, we show it is indeed true, by designing an -differentially private algorithm with regret bound , and proving a nearly matching lower bound.

4.2 Upper Bound under DP

Compared with LDP, in which case the learning algorithm (or the server) can only receives noisy information, DP only has some restriction for the output of an algorithm, and the server has authority to collect true data. Thus, it is possible to inject much less noise under DP setting via an economic allocation of privacy budget .

We use tree-based aggregation scheme (Dwork et al., 2009; Chan et al., 2011) to protect -DP in our algorithm, which is an effective method in releasing private continual statistics over a data stream and frequently used in previous work, such as stochastic MAB (Mishra and Thakurta, 2015; Tossou and Dimitrakakis, 2016), Online Convex Optimization (Thakurta and Smith, 2013). Consider a data stream where . In each step , the algorithm receives data , and needs to output the sum , while insuring that the output sequence are -differentially private. Tree-based mechanism solves this problem in an elegant way with a binary tree. Each leaf node denotes data received in step . Each internal node calculates the sum of data in the leaf nodes rooted at it. Notice that one only needs access to nodes and sums up the values on them in order to calculate . Using the Laplacian mechanism, previous results have shown that adding i.i.d Lap() to each node ensures -differential privacy for the scheme as stated in the following lemma:

Lemma 1 (Dwork et al. (2010); Chan et al. (2011)).

Tree-based aggregation scheme with i.i.d noise added to each node is -differentially private.

In our CSB setting, we store a vector with support at most in the leaf nodes of step . Each internal node calculates the sum of in the leaf nodes rooted at it. For each node, we add i.i.d noise to each dimension of the vector stored on the node to guarantee -DP (See Algorithm 3). Based on Lemma 1, we have

1:  Input: Privacy budgets .
2:  Initialize: , empirical mean .
3:  for  to  do
5:     Play if else
6:     User generates outcome for , and sends to the server
7:     Server updates base arms in : , , and keeps others unchanged
8:  end for
Algorithm 3
Theorem 7.

Algorithm 3 guarantees -DP.

In Algorithm 3, when we need to estimate the mean weight based on the previous outcome , we add additional Laplace noise to the sum of due to tree-based aggregation scheme. Note that the number of Laplace noises added (the number of nodes we access to) is only logarithmic. This means that the additional confidence bound due to Laplace noise is only for base arm when it is pulled for times. Compared with the original bound for the sub-Gaussian noise which is of order , the additional bound for Laplace noise enjoys better dependence on . This helps us to separate the term of and in the regret via delicate analysis, and finally derive a nearly optimal bound in the additive form.

Theorem 8.

Under -bounded smoothness and monotonicity assumptions, the regret of Algorithm 3 is upper bounded by

We refer readers to Section D of the supplementary materials for the detailed proof. By relaxing LDP to DP, we have shown that it is possible to eliminate the side-effect on dimension induced by privacy protection and nearly match corresponding non-private optimal bound .

4.3 Lower Bound under DP

In this subsection, we prove the lower bound for CSB algorithm under -DP. Similar with the result of LDP lower bound, we consider CSB algorithm with consistent property. The lower bound stated below implies that our algorithm 3 can achieve near-optimal regret regardless of logarithmic factors:

Theorem 9.

For any and such that , and any satisfying , the regret for any consistent CSB algorithm guaranteeing -DP is at least .

The theorem is proved in section E of the supplementary materials. We only sketch the proof here. Previous results have shown that for non-private stochastic linear CSB, the regret lower bound is at least . By slightly modifying the hard instance, we can show that the regret lower bound for non-private CSB with -bounded smoothness is . Since private CSB is strictly harder than non-private CSB (by reduction), the regret lower bound for private CSB is . We only need to prove that the regret lower bound for private CSB is , from which we can prove that the regret lower bound is .

Now we sketch the proof of term. Note a simple extension of Kveton et al. (2015) can only achieve in our differentially private setting, which is not satisfactory. It is thus necessary to construct some new hard instance to prove Theorem 9.

To solve this problem, we design the following CSB problem as a special case of general CSB with -bounded smoothness. Suppose there are

base arms, each associated with a weight sampled from Bernoulli distribution. These

base arms are divided into three sets, and . contains base arms, which build up the optimal super arm set. contains “public” base arms for sub-optimal super arms. These arms are contained in all sub-optimal super arms. contains base arms. each base arm combined with ”public” base arms in builds up a sub-optimal super arm. Totally we have sub-optimal super arms and one optimal super arm. The mean of the Bernoulli random variable associated to each base arm is defined as follow:

The weights of base arms in are identical, while other weights are i.i.d sampled. The reward of pulling a super arm is times the sum of weights of all base arm . As a result, the sub-optimality gap of each sub-optimal super arm is . With the coupling argument in Karwa and Vadhan (2017), we can prove that is at least for any sub-optimal super arm with high probability. Since there are sub-optimal super arm, we can reach the conclusion that the regret lower bound for private CSB is .

5 Conclusion and Future work

In this paper, we study (locally) differentially private algorithm for Combinatorial Semi-Bandits under two common assumptions about reward functions. For -bounded smooth CSB under -LDP and -DP, we show the optimal regret of these two settings are respectively and , by proving lower bounds and designing (nearly) optimal private algorithms. For relatively weaker -bounded smooth CSB, if we are required to protect -DP instead of -LDP, we show the optimal regret is , and give a differentially private algorithm as well as a nearly matching lower bound. Moreover, above optimal performance in our (locally) differentially private CSB is nearly the same order as non-private setting (Kveton et al., 2015; Chen et al., 2016; Wang and Chen, 2017).

Our Algorithm 2 is applicable for locally private CSB with -bounded smoothness, with a regret upper bound of in this setting. However, the regret lower bound we prove is just . We conjecture that our lower bound is loose and the Algorithm 2 is also near-optimal for locally private CSB with -bounded smoothness. How to improve the lower bound is an interesting open problem for future work.

6 acknowledgements

This work is supported by National Basic Research Program of China (973 Program) (grant no. 2015CB352502), NSFC (61573026), BJNSF (L172037) and Beijing Acedemy of Artificial Intelligence.


  • N. Agarwal and K. Singh (2017) The price of differential privacy for online learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 32–40. Cited by: §1.1, §1, §3.1.
  • P. Auer, N. Cesa-Bianchi, and P. Fischer (2002) Finite-time analysis of the multiarmed bandit problem. Machine learning 47 (2-3), pp. 235–256. Cited by: §1.
  • R. Bassily, A. Smith, and A. Thakurta (2014) Private empirical risk minimization: efficient algorithms and tight error bounds. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pp. 464–473. Cited by: §1.
  • R. Bassily and A. Smith (2015) Local, private, efficient protocols for succinct histograms. In

    Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing

    pp. 127–135. Cited by: §2.2.
  • D. Basu, C. Dimitrakakis, and A. Tossou (2019) Differential privacy for multi-armed bandits: what is it and what is its cost?. arXiv preprint arXiv:1905.12298. Cited by: §1.1, §3.2, §3.3, §3.3, §B.
  • S. Bubeck, N. Cesa-Bianchi, et al. (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning 5 (1), pp. 1–122. Cited by: §1, §3.2.
  • T. H. Chan, E. Shi, and D. Song (2011) Private and continual release of statistics. ACM Transactions on Information and System Security (TISSEC) 14 (3), pp. 1–24. Cited by: §4.2, Lemma 1.
  • W. Chen, Y. Wang, Y. Yuan, and Q. Wang (2016) Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. The Journal of Machine Learning Research 17 (1), pp. 1746–1778. Cited by: (Locally) Differentially Private Combinatorial Semi-Bandits, §1.1, Table 1, §1, §2.1, §3.1, §3.1, §5.
  • W. Chen, Y. Wang, and Y. Yuan (2013) Combinatorial multi-armed bandit: general framework and applications. In International Conference on Machine Learning, pp. 151–159. Cited by: §1.1, §1, §2.1, §3.1, §3.1.
  • R. Combes, M. S. T. M. Shahi, A. Proutiere, et al. (2015) Combinatorial bandits revisited. In Advances in Neural Information Processing Systems, pp. 2116–2124. Cited by: §1.1.
  • G. Cormode, S. Jha, T. Kulkarni, N. Li, D. Srivastava, and T. Wang (2018) Privacy at scale: local differential privacy in practice. In Proceedings of the 2018 International Conference on Management of Data, pp. 1655–1658. Cited by: §1.
  • J. Duchi, M. J. Wainwright, and M. I. Jordan (2013) Local privacy and minimax bounds: sharp rates for probability estimation. In Advances in Neural Information Processing Systems, pp. 1529–1537. Cited by: §2.2.
  • J. Duchi, M. Wainwright, and M. Jordan (2016) Minimax optimal procedures for locally private estimation. arXiv preprint arXiv:1604.02390. Cited by: §1, Lemma 3.
  • C. Dwork, F. McSherry, K. Nissim, and A. Smith (2006) Calibrating noise to sensitivity in private data analysis. In Theory of cryptography, Berlin, Germany, pp. 265–284. Cited by: §1, Definition 3.
  • C. Dwork, M. Naor, T. Pitassi, and G. N. Rothblum (2010) Differential privacy under continual observation. In Proceedings of the forty-second ACM symposium on Theory of computing, pp. 715–724. Cited by: §1, §3.2, Lemma 1.
  • C. Dwork, M. Naor, O. Reingold, G. N. Rothblum, and S. Vadhan (2009) On the complexity of differentially private data release: efficient algorithms and hardness results. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pp. 381–390. Cited by: §4.2.
  • C. Dwork and A. Roth (2014) The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9 (3–4), pp. 211–407. Cited by: §1, §2.2.
  • E. Even-Dar, S. Mannor, and Y. Mansour (2002)

    PAC bounds for multi-armed bandit and markov decision processes


    International Conference on Computational Learning Theory

    pp. 255–270. Cited by: §1.
  • P. Gajane, T. Urvoy, and E. Kaufmann (2018) Corrupt bandits for preserving local privacy. In Algorithmic Learning Theory, pp. 387–412. Cited by: §1.1.
  • A. György, T. Linder, G. Lugosi, and G. Ottucsák (2007) The on-line shortest path problem under partial monitoring. Journal of Machine Learning Research 8 (Oct), pp. 2369–2403. Cited by: §1.1.
  • P. Jain, P. Kothari, and A. Thakurta (2012) Differentially private online learning. In Conference on Learning Theory, pp. 24–1. Cited by: Definition 3.
  • P. Jain and A. G. Thakurta (2014) (Near) dimension independent risk bounds for differentially private learning. In International Conference on Machine Learning, pp. 476–484. Cited by: §1, §3.2.
  • V. Karwa and S. Vadhan (2017)

    Finite sample differentially private confidence intervals

    arXiv preprint arXiv:1711.03908. Cited by: §4.3, §E.
  • S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith (2011) What can we learn privately?. SIAM Journal on Computing 40 (3), pp. 793–826. Cited by: §2.2.
  • D. Kifer, A. Smith, and A. Thakurta (2012) Private convex empirical risk minimization and high-dimensional regression. Journal of Machine Learning Research 1 (41), pp. 3–1. Cited by: §1.
  • B. Kveton, Z. Wen, A. Ashkan, and C. Szepesvari (2015) Tight regret bounds for stochastic combinatorial semi-bandits. In Artificial Intelligence and Statistics, pp. 535–543. Cited by: (Locally) Differentially Private Combinatorial Semi-Bandits, §1.1, Table 1, §1, §3.3, §3.3, §4.1, §4.1, §4.3, §5, §C, §E, §E, §E.
  • T. L. Lai and H. Robbins (1985) Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6 (1), pp. 4–22. Cited by: §1, §3.3.
  • T. Lattimore and C. Szepesvári (2018) Bandit algorithms. preprint. Cited by: §1, §3.3, §4.1, §C.
  • T. Lattimore and C. Szepesvári (2019) An information-theoretic approach to minimax regret in partial monitoring. arXiv preprint arXiv:1902.00470. Cited by: §C.
  • N. Mishra and A. Thakurta (2015) (Nearly) optimal differentially private stochastic multi-arm bandits. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pp. 592–601. Cited by: §1, §3.1, §3.1, §4.2.
  • T. Sajed and O. Sheffet (2019) An optimal private stochastic-mab algorithm based on optimal private stopping rule. In International Conference on Machine Learning, pp. 5579–5588. Cited by: §1, §3.1.
  • R. Shariff and O. Sheffet (2018) Differentially private contextual linear bandits. In Advances in Neural Information Processing Systems, pp. 4296–4306. Cited by: §1.1, §1, §E.
  • A. Smith and A. Thakurta (2013) Differentially private model selection via stability arguments and the robustness of the lasso. J Mach Learn Res Proc Track 30, pp. 819–850. Cited by: §1.
  • K. Talwar, A. Thakurta, and L. Zhang (2015) Nearly optimal private lasso. In Advances in Neural Information Processing Systems, pp. 3025–3033. Cited by: §1, §3.2.
  • A. G. Thakurta and A. Smith (2013) (Nearly) optimal algorithms for private online learning in full-information and bandit settings. In Advances in Neural Information Processing Systems, pp. 2733–2741. Cited by: §1.1, §1, §4.2.
  • A. C. Y. Tossou and C. Dimitrakakis (2017) Achieving privacy in the adversarial multi-armed bandit. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §1.1, §3.1.
  • A. C. Tossou and C. Dimitrakakis (2016) Algorithms for differentially private multi-armed bandits. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §1, §3.1, §4.2, footnote 1.
  • Q. Wang and W. Chen (2017) Improving regret bounds for combinatorial semi-bandits with probabilistically triggered arms and its applications. In Advances in Neural Information Processing Systems, pp. 1161–1171. Cited by: §1.1, Table 1, §1, §3.3, §4.1, §5.
  • S. Wang and W. Chen (2018) Thompson sampling for combinatorial semi-bandits. In International Conference on Machine Learning, pp. 5101–5109. Cited by: §1.1, §2.1, §2.1.
  • K. Zheng, W. Mou, and L. Wang (2017) Collect at once, use effectively: making non-interactive locally private learning possible. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 4130–4139. Cited by: §3.2.


A Proof of Theorem 4

Theorem 4.

(Restate) For Algorithm 2, we have


Suppose denote the event that the oracle fails to produce an -approximate answer with respect to the input vector in step . We have . The number of times happens in expectation is at most . The cumulative regret in these steps is at most

Now we only consider the steps doesn’t happen. We maintain counters in the proof, and denote its value in step as . The initialization of is the same as , i.e. . In step , if doesn’t happen, and the oracle selects a sub-optimal super arm, we increment by one, i.e. , where , otherwise we keep unchanged. This indicates that . Notice that if a sub-optimal super arm is pulled in step , exactly one counter is incremented by one, and . As a result, we have:


Here denote the suboptimal gap when incremented from to in a certain step .

Now we only need to bound and . We denote the following event as : For a fixed step and a fixed base arm ,

The noise in comes from two parts: the Laplacian noise added for privacy and the randomness of . For the first part, by Bernstein’s Inequality over i.i.d Laplace distribution, the confidence bound is with prob. at least . For the second part, since is bounded, the confidence bound is with prob. at least by Hoeffding’s inequality. This shows that happens with prob. . By union bounds over all steps, happens for all and with prob. . We denote this event as .

Suppose happens, we have . If a sub-optimal arm is pulled in step . we have


The first inequality is due to monotonicity and -bounded smoothness assumption. The second inequality is because the oracle returns which satisfies . The third inequality is due to the definition of and the concentration bound for . The last inequality is due to .

Define . If for any , we have by Equ. A. On the other hand, by the definition of , , which leads to a contradiction. This means that if sub-optimal arm is pulled in step , and contains base arm , the counter is at most