The well-known stochastic -armed bandit problem [Thompson1933, Robbins and others1952] involves an agent sequentially choosing among a set of arms , and obtaining a sequence of scalar rewards , such that, if the agent’s action at time is , then it obtains reward drawn from some distribution with expectation . The goal of the decision maker is to draw arms so as to maximize the total reward obtained.
This problem is a model for many applications where there is a need for trading-off exploration and exploitation. This occurs because we only see the reward of the arm we pull. An example is clinical trials, where arms correspond to different treatments or tests, and the goal can be to maximise the number of cured patients over time while being uncertain about the effects of treatments. Other problems, such as search engine advertisement and movie recommendations can be formalised similarly [Pandey and Olston2006].
It has been previously noted [Jain, Kothari, and Thakurta2012, Thakurta and Smith2013, Mishra and Thakurta2015, Zhao et al.2014] that privacy is an important consideration for many multi-armed bandit applications. Indeed, privacy can be easily violated by observing changes in the prediction of the bandit algorithm. This has been demonstrated for recommender systems such as Amazon by [Calandrino et al.2011] and for user-targeted advertising such as Facebook by [Korolova2010]. In both cases, with a moderate amount of side information and by tracking changes in the output of the system, it was possible to learn private information of any targeted user.
Differential privacy (DP) [Dwork2006] provides an answer to this privacy issue by making the output of an algorithm almost insensitive to any single user information. That is, no matter what side information is available to an outside observer, he can not have more information about a user than he already had by observing the outputs released by the algorithm. This goal is achieved by formally bounding the loss in privacy through the use of two parameters as shown in Definition 2.1.
For bandit problems, differential privacy implies that the actions taken by the bandit algorithm do not reveal information about the sequence of rewards obtained. In the context of clinical trials and diagnostic tests, it guarantees that even an adversary with arbitrary side information, such as the identity of each patient, cannot learn anything from the output of the learning algorithm about patient history, condition, or test results.
1.1 Related Work
Differential privacy (DP) was introduced by [Dwork et al.2006]; a good overview is given in [Dwork and Roth2013]. While initially the focus in DP was static databases, interest in its relation to online learning problems has increased recently. In the full information setting, [Jain, Kothari, and Thakurta2012] obtained differentially private algorithms with near-optimal bounds. In the bandit setting, [Thakurta and Smith2013] were the first to present a differentially private algorithm, for the adversarial case, while [Zhao et al.2014] present an application to smart grids in this setting. Then, [Mishra and Thakurta2015] provided a differentially private algorithm for the stochastic bandit problem. Their algorithms are based on two non private stochastic bandit algorithms: Upper Confidence Bound (UCB, [Auer, Cesa-Bianchi, and Fischer2002]
) and Thompson sampling[Thompson1933]. Their results are sub-optimal: although simple index-based algorithms achieving regret exist [Burnetas and Katehakis1996, Auer, Cesa-Bianchi, and Fischer2002], these differentially private algorithms additional poly-log terms in time , as well further linear terms in the number of arms compared to the non-private optimal regret .
We provide a significantly different and improved UCB-style algorithm whose regret only adds a constant, privacy-dependent term to the optimal. We also improve upon previous algorithms by relaxing the need to know the horizon ahead of time, and as a result we obtain a uniform bound. Finally, we also obtain significantly improved bounds for a variant of the original algorithm of [Mishra and Thakurta2015]
, by using a different proof technique and confidence intervals. Let’s note that similarly to their result, we only make distributional assumptions on the data for the regret analysis. To ensure privacy, our algorithms do not make any assumption on the data. We summarize our contributions in the next section.
1.2 Our Contributions
We present a novel differentially private algorithm (DP-UCB-Int) in the stochastic bandit setting that is almost optimal and only add an additive constant term (depending on the privacy parameter) to the optimal non private version. Previous algorithms had in large multiplicative factors to the optimal.
We also provide an incremental but important improvement to the regret of existing differentially private algorithm in the stochastic bandit using the same family of algorithms as previously presented in the literature. This is done by using a simpler confidence bound and a more sophisticated proof technique. These bounds are achieved by DP-UCB-Bound and DP-UCB algorithms.
We present the first set of differentially private algorithm in the bandit setting which are unbounded and do not require the knowledge of the horizon . Furthermore, all our regret analysis holds for any time step .
2.1 Multi-Armed Bandit
The well-known stochastic -armed bandit problem [Thompson1933, Lai and Robbins1985, Auer, Cesa-Bianchi, and Fischer2002] involves an agent sequentially choosing among a set of arms . At each time step , the player selects an action and obtains a reward . The reward is drawn from some fixed but unknown distribution such that . The goal of the decision maker is to draw arms so as to maximize the total reward obtained after interactions. An equivalent notion is to minimize the total regret against an agent who knew the arm with the maximum expectation before the game starts and always plays it. This is defined by:
where is the mean reward of the optimal arm and
is the policy of the decision maker, defining a probability distribution on the next actionsgiven the history of previous actions and rewards . Our goal is to bound the regret uniformly over .
2.2 Differential Privacy
Differential privacy was originally proposed by [Dwork2006], as a way to formalise the amount of information about the input of an algorithm, that is leaked to an adversary observing its output, no matter what the adversary’s side information is. In the context of our setup, the algorithm’s input is the sequence of rewards, and its output the actions. Consequently, we use the following definition of differentially private bandit algorithms.
Definition 2.1 ((-differentially private bandit algorithm).
A bandit algorithm is -differentially private if for all sequences and that differs in at most one time step, we have for all :
where is the set of actions. When , the algorithm is said to be -differential private.
Intuitively, this means that changing any reward for a given arm, will not change too much the best arm released at time or later on. If each is a private information or a point associated to a single individual, then the definition aboves means that the presence or absence of that individual will not affect too much the output of the algorithm. Hence, the algorithm will not reveal any extra information about this individual leading to a privacy protection. The privacy parameters determines the extent to which an individual entry affects the output; lower values of imply higher levels of privacy.
A natural way to obtain privacy is to add a noise such as Laplace noise () to the output of the algorithm. The main challenge is how to get the maximum privacy while adding a minimum amount of noise as possible. This leads to a trade off between privacy and utility. In our paper, we demonstrated how to optimally trade-off this two notions.
2.3 Hybrid Mechanism
The hybrid mechanism is an online algorithm used to continually release the sum of some statistics while preserving differential privacy. More formally, there is a stream of statistics with in . At each time step a new statistic is given. The goal is to output the partial sum () of the statistics from time step 1 to without compromising privacy of the statistics. In other words, we wish to find a randomised mechanism that is -differential private.
The hybrid mechanism solves this problem by combining the Logarithm and Binary Noisy Sum mechanisms. Whenever for some integer , it uses the Logarithm mechanism to release a noisy sum by adding Laplace noise of scale . It then builds a binary tree , which is used to release noisy sums until via the Binary mechanism. This uses the leaf nodes of to store the inputs , while all other nodes store partial sums, with the root containing the sum from to . Since the tree depth is logarithmic, there is only a logarithmic amount of noise added for any given sum, more specifically Laplace noise of scale and mean which is denoted by .
[Chan, Shi, and Song2010] proves that the hybrid mechanism is -differential private for any where is the number of statistics seen so far. They also show that with probability at least , the error in the released sum is upper bounded by . In this paper, we derived and used a tighter bound for this same mechanism (see Appendix in Supplementary Material) which is:
3 Private Stochastic Multi-Armed Bandits
We describe here the general technique used by our algorithms to obtain differential privacy. Our algorithms are based on the non-private UCB algorithm by [Auer, Cesa-Bianchi, and Fischer2002]
. At each time step, UCB based its action according to an optimistic estimate of the expected reward of each arm. This estimate is the sum of the empirical mean and an upper bound confidence equal towhere is the time step and the number of times arm has been played till time . We can observe that the only quantity using the value of the reward is the empirical mean. To achieve differential privacy, it is enough to make the player based its action on differentially private
empirical means for each arm. This is so, because, once the mean of each arm is computed, the action which will be played is a deterministic function of the means. In particular, we can see the differentially private mechanism as a black box, which keeps track of the vector of non-private empirical meansfor the player, and outputs a vector of private empirical means . This is then used by the player to select an action, as shown in Figure 1.
We provide three different algorithms that use different techniques to privately compute the mean and calculate the index of each arm. The first, DP-UCB-Bound, employs the Hybrid mechanism to compute a private mean and then adds a suitable term to the confidence bound to take into account the additional uncertainty due to privacy. The second, DP-UCB employs the same mechanism, but in such a way so as all arms have the same privacy-induced uncertainty; consequently the algorithm then uses the same index as standard UCB. The final one, employs a mechanism that only releases a new mean once at the beginning of each interval. This allows us to obtain the optimal regret rate.
3.1 The DP-UCB-Bound Algorithm
. However, the number and the variance of Laplace noise added by the hybrid mechanism increases as we keep pulling an arm. This means that the sum of each arm get added different amount of noise bigger than the original confidence bound used by UCB. This makes it difficult to identify the best arms. To solve this issue, we add a tight upper bound defined in equation (2.2) on the noise added by the hybrid mechanism.
Algorithm 1 is -differential private after any number of of plays.
This follows directly from the fact that the hybrid mechanism is -DP after any number of plays and a single flip of one reward in the sequence of rewards only affect one mechanism. Furthermore, the whole algorithm is a random mapping from the output of the hybrid mechanism to the action taken and using Proposition 2.1 of [Dwork and Roth2013] completes the proof. ∎
If Algorithm 1 is run with arms having arbitrary reward distributions, then, its expected regret after any number of plays is bounded by:
for any such that where , …, are the expected values of , …, and .
We used the bound on the hybrid mechanism defined in equation 2.2 together with the union and Chernoff-Hoeffding bounds. We then select the error probability at each step to be . This leads to a transcendental inequality solved using the Lambert W function and approximated using section of [Barry et al.2000]. ∎
3.2 The Dp-Ucb Algorithm
The key observation used in Algorithm 2 is that if at each time step we insert a reward to all hybrid mechanisms, then the scale of the noise will be the same. This means that there is no need anymore to compensate an additional bound. More precisely, every time we play an arm and receive the reward , we not only add it to the hybrid mechanism corresponding to arm but we also add a reward of to the hybrid mechanism of all other arms. As these calculate a sum, it doesn’t affect subsequent calculations.
Theorem 3.3 shows the validity of this approach by demonstrating a regret bound with only an additional factor of to the optimal non private regret.
If Algorithm 2 is run with arms having arbitrary reward distributions, then, its expected regret after any number of plays is bounded by:
where denotes the Riemann zeta function.
The proof is similar to the one for Theorem 3.2, but we have to choose the error probability to be . ∎
3.3 The DP-UCB-Int Algorithm
Both Algorithms 1 and 2 enjoy a logarithmic regret with only a small additional factor in the time step to the optimal non-private regret. However, this includes a multiplicative factor of and respectively. Consequently, increasing privacy scales the total regret proportionally. A natural question is whether or not it is possible to get a differentially private algorithm with only an additive constant to the optimal regret. Algorithm 3 answers positively to this question by using novel tricks to achieve differential privacy. Looking at regret analysis of Algorithms 1 and 2, we observe that by adding noise proportional to , we will get a multiplicative factor to the optimal. In other words, to remove this factor, the noise should not depend on . But how can we get -DP in this case?
Note that if we compute and use the mean at each time step with an -DP algorithm, then after time step , our overall privacy is roughly the sum of all . We then change the algorithm so that it only uses a released mean once every times, making privacy . In any case, needs to decrease, at least as , for the sum to be bounded by . However, should also be big enough such that the noise added keeps the UCB confidence interval used at the same order, otherwise, the regret will be higher.
A natural choice for is a p-series. Indeed, by making to be of the form , where is the number of times action has been played until time , its sum will converge to the Riemann zeta function when is appropriately chosen. This choice of leads to the addition of a Laplace noise of scale to the mean (See Lemma 3.1). Now our trade-off issue between high privacy and low regret is just reduced into choosing a correct value for . Indeed, we can pick , for the privacy to converge; but the noise added at each time step will be increasing and greater than the UCB bound; which is not desirable. To overcome this issue, we used the more sophisticated -fold adaptive composition theorem (III-3 in [Dwork, Rothblum, and Vadhan2010]). Roughly speaking, this theorem shows that our overall privacy after releasing the mean a number of times depends on the sum of the square of each individual privacy parameter . So, is enough for convergence and with , the noise added will be decreasing and will eventually become lower than the UCB bound.
In summary, we just need to lazily update the mean of each arm every times. However, we show that the interval of release is much better than and follows a series as defined by Lemma (B.1) in the supplements [Tossou and Dimitrakakis2016]. Algorithm 3 summarizes the idea developed in this section.
The next lemma establishes the privacy each time a new mean is released for a given arm .
The mean computed by Algorithm 3 for a given arm at each interval is -differential private with respect to the reward sequence observed by that arm.
Sketch This follows directly from the fact that we add Laplace noise of scale . ∎
The next theorem establishes the overall privacy after having played for time steps.
After playing for any time steps, Algorithm 3 is -differential private with
for any ,
We begin by using similar observations as in Theorem 3.1. Then, we compute the privacy of the mean of an arm using the -fold adaptive composition theorem in [Dwork, Rothblum, and Vadhan2010] (see the supplements [Tossou and Dimitrakakis2016]).
The next corollary gives a nicer closed form for the privacy parameter which is needed in practice.
After playing for time steps, Algorithm 3 is -differential private with
with the Riemann Zeta Function for any , , .
We upper bounded the first term in theorem 3.4 by the integral test, then for the second term we used for all to conclude the proof. ∎
The following corollary gives the parameter with which one should run Algorithm 3 to achieve a given privacy.
If you run Algorithm 3 with parameter for any , , , you will be at least -differential private.
The proof is obtained by inverting the term using the Riemann zeta function in corollary 3.1. ∎
This is proven using a Laplace concentration inequality to bound the estimate of the mean then we selected the error probability to be . ∎
We perform experiments using arms with rewards drawn from independent Bernoulli distribution. The plot, in logarithmic scale, shows the regret of the algorithms overtime steps averaged over 100 runs. We targeted 2 different privacy levels : 0.1 and 1. For DP-UCB-Int, we pick such that the overall privacy is -DP with as defined in corollary 3.2 and , . We put in parenthesis the input privacy of each algorithm.
We compared against the non private UCB algorithm and the algorithm presented in [Mishra and Thakurta2015] (Private-UCB ) with a failure probability chosen to be .
We perform two scenarios. Firstly we used two arms: one with expectation 0.9 and the other 0.6. The second scenario is a more challenging one with 10 arms having all an expectation of 0.1 except two with 0.55 and 0.2.
As expected, the performance of DP-UCB-Int is significantly better than all other private algorithms. More importantly, the gap between the regret of DP-UCB-Int and the non private UCB does not increase with time confirming the theoretical regret. We can notice that DP-UCB is better than DP-UCB-Bound for small time steps. However, as the time step increases DP-UCB-Bound outperforms DP-UCB and eventually catches its regret. The reason for that is: DP-UCB spends less time to distinguish between arms with close rewards due to the fact that the additional factor in its regret depends on which is not the case for DP-UCB. Private-UCB performs worse than all other algorithms which is not surprising.
Moreover, we noticed that the difference between the best regret (after 100 runs) and worst regret is very consistent for all ours algorithms and the non private UCB (it is under 664.5 for the 2 arms scenario). However, this gap reaches for Private-UCB. This means that our algorithms are able to correctly trade-off between exploration and exploitation which is not the case for Private-UCB.
5 Conclusion and Future Work
In this paper, we have proposed and analysed differentially private algorithms for the stochastic multi-armed bandit problem, significantly improving upon the state of the art. The first two, (DP-UCB and DP-UCB-Bound) are variants of an existing private UCB algorithm [Mishra and Thakurta2015], while the third one uses an interval-based mechanism.
Those first two algorithms are only within a factor of and to the non-private algorithm. The last algorithm, DP-UCB-Int, efficiently trades off the privacy level and the regret and is able to achieve the same regret as the non-private algorithm up to an additional additive constant. This has been achieved by using two key tricks: updating the mean of each arm lazily with a frequency proportional to the privacy and adding a noise independent of . Intuitively, the algorithm achieves better privacy without increasing regret, because its output is less dependent on individual reward.
Perhaps it is possible to improve our bounds further if we are willing to settle for asymptotically low regret [Cowan and Katehakis2015]. A natural future work is to study if we can use similar methods for other mechanisms such as Thompson sampling (known to be differentially private [Dimitrakakis et al.2014]) instead of UCB. Another question is whether a similar analysis can be performed for adversarial bandits.
We would also like to connect more to applications by two extensions of our algorithms. The first natural extension is to consider some side-information. In the drug testing example, this could include some information about the drug, the test performed and the user examined or treated. The second extension would relate to generalising the notion of neighbouring databases to take into account the fact that multiple observations in the sequence (say ) can be associated with a single individual. Our algorithms can be easily extended to deal with this setting (by re-scaling the privacy parameter to ). However, in practice, could be quite large and it will be an interesting future work to check if we could get sub linearity in the parameter under certain conditions.
- [Auer, Cesa-Bianchi, and Fischer2002] Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite time analysis of the multiarmed bandit problem. Machine Learning 47(2/3):235–256.
- [Barry et al.2000] Barry, D.; Parlange, J.-Y.; Li, L.; Prommer, H.; Cunningham, C.; and Stagnitti, F. 2000. Analytical approximations for real values of the lambert w-function. Mathematics and Computers in Simulation 53(1–2):95 – 103.
- [Burnetas and Katehakis1996] Burnetas, A. N., and Katehakis, M. N. 1996. Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics 17(2):122–142.
- [Calandrino et al.2011] Calandrino, J. A.; Kilzer, A.; Narayanan, A.; Felten, E. W.; and Shmatikov, V. 2011. ”you might also like: ” privacy risks of collaborative filtering. In 32nd IEEE Symposium on Security and Privacy, 231–246.
- [Chan, Shi, and Song2010] Chan, T. H.; Shi, E.; and Song, D. 2010. Private and continual release of statistics. In Automata, Languages and Programming. Springer. 405–417.
- [Cowan and Katehakis2015] Cowan, W., and Katehakis, M. N. 2015. Asymptotic behavior of minimal-exploration allocation policies: Almost sure, arbitrarily slow growing regret. arXiv preprint arXiv:1505.02865.
[Dimitrakakis et al.2014]
Dimitrakakis, C.; Nelson, B.; Mitrokotsa, A.; and Rubinstein, B.
Robust and private Bayesian inference.In Algorithmic Learning Theory.
- [Dwork and Roth2013] Dwork, C., and Roth, A. 2013. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9(3–4):211–407.
- [Dwork et al.2006] Dwork, C.; McSherry, F.; Nissim, K.; and Smith, A. 2006. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Conference on Theory of Cryptography, TCC’06, 265–284.
- [Dwork, Rothblum, and Vadhan2010] Dwork, C.; Rothblum, G. N.; and Vadhan, S. 2010. Boosting and differential privacy. In Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, FOCS ’10, 51–60.
- [Dwork2006] Dwork, C. 2006. Differential privacy. In ICALP, 1–12. Springer.
- [Jain, Kothari, and Thakurta2012] Jain, P.; Kothari, P.; and Thakurta, A. 2012. Differentially private online learning. In Mannor, S.; Srebro, N.; and Williamson, R. C., eds., COLT 2012 - The 25th Annual Conference on Learning Theory, volume 23, 24.1–24.34.
- [Korolova2010] Korolova, A. 2010. Privacy violations using microtargeted ads: A case study. In ICDMW 2010, The 10th IEEE International Conference on Data Mining Workshops, 474–482.
- [Lai and Robbins1985] Lai, T. L., and Robbins, H. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6(1):4–22.
[Mishra and Thakurta2015]
Mishra, N., and Thakurta, A.
(nearly) optimal differentially private stochastic multi-arm bandits.
Proceedings of the 31th International Conference on Conference on Uncertainty in Artificial Intelligence (UAI-2015).
- [Pandey and Olston2006] Pandey, S., and Olston, C. 2006. Handling advertisements of unknown quality in search advertising. In Schölkopf, B.; Platt, J. C.; and Hoffman, T., eds., Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing, 1065–1072.
- [Robbins and others1952] Robbins, H., et al. 1952. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society 58(5):527–535.
- [Thakurta and Smith2013] Thakurta, A. G., and Smith, A. D. 2013. (nearly) optimal algorithms for private online learning in full-information and bandit settings. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013., 2733–2741.
- [Thompson1933] Thompson, W. 1933. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of two Samples. Biometrika 25(3-4):285–294.
- [Tossou and Dimitrakakis2016] Tossou, A., and Dimitrakakis, C. 2016. Supplementary Materials for Algorithms for Differentially Private Multi-Armed Bandits. https://www.dropbox.com/s/fotezpnx49nz0i3/single-mab-aaai16-sups-final.pdf. [Online; accessed 1-December-2015].
- [Zhao et al.2014] Zhao, J.; Jung, T.; Wang, Y.; and Li, X. 2014. Achieving differential privacy of data disclosure in the smart grid. In 2014 IEEE Conference on Computer Communications, INFOCOM 2014, 504–512.
Appendix A Collected proofs
a.1 Proof of Theorem 3.2
will be used to indicate the index of an arm. is the differential privacy parameter. is used to denote the time step.
From Lemma (B.2), we know that the error between the empirical and private mean is bounded as with probability at least , is the empirical mean returned by the private mechanism, the true empirical mean, the error due to the differentially private mechanism. It is defined as: . We can rewrite this bound into equations A.1 and A.2.
Let be the number of times arm a is played in the first time steps. Let’s denote the original UCB confidence index.
By following similar steps as in the demonstration of UCB in [Auer, Cesa-Bianchi, and Fischer2002], we have
In equation A.3, is the mean returned by the private mechanism for the best arm when it has been played times. Now we can observe that implies that at least one of the following must hold
Let’s choose ; this leads respectively to
Now consider the last condition (A.6). For this, we want to find the minimum number for which event (A.6) is always false. Event (A.6) is false, implies that where . We observe that for to hold, it is enough that the following two conditions hold for any such that .
We can rewrite the inequality in a more familiar form:
This is a standard transcendental algebraic inequality whose solution is given by the Lambert function. So,
where is the Lambert function of on branch . Note here that the branch is -1 and because , we are always guaranteed to find a real number.
By using the approximation of the Lambert function provided in section of [Barry et al.2000], we can conclude that
which concludes the proof.
a.2 Proof for Theorem 3.3
Appendix B Proofs for UCB-Interval Algorithm
Differential privacy of the Laplace mechanism (See Theorem 4 in [Dwork and Roth2013]) For any real function of the data, a mechanism adding Laplace noise with scale parameter is -differentially private, where is the sensitivity of .
b.1 Proof of Lemma 3.1
Indeed, for each arm, we are adding a Laplace noise of mean 0 and scale where is the number of times this arm has been played. As the sensitivity of the mean is , we use the differential privacy of the Laplace Mechanism (Fact B.1) to conclude the proof. ∎
b.2 Proof of Theorem 3.4
A new Laplace Mechanism is used to compute the mean of each arm. However a new mean is only released times (every time steps) after times steps where is the interval used.