1 Introduction
We consider multiarmed bandit problems in the adversarial setting whereby an agent selects one from a number of alternatives (called arms) at each round and receives a gain that depends on its choice. The agent’s goal is to maximize its total gain over time. There are two main settings for the bandit problem. In the stochastic one, the gains of each arm are generated i.i.d by some unknown probability law. In the adversarial setting, which is the focus of this paper, the gains are generated adversarially. We are interested in finding algorithms with a total gain over
rounds not much smaller than that of an oracle with additional knowledge about the problem. In both settings, algorithms that achieve the optimal (problemindependent) regret bound of are known [Auer, CesaBianchi, and Fischer2002, Burnetas and Katehakis1996, Pandey and Olston2006, Thompson1933, Auer et al.2003, Auer2002, Agrawal and Goyal2012].This problem is a model for many applications where there is a need for tradingoff exploration and exploitation. This is so because, whenever we make a choice, we only observe the gain generated by that choice, and not the gains that we could have obtained otherwise. An example is clinical trials, where arms correspond to different treatments or tests, and the goal is to maximize the number of cured patients over time while being uncertain about the effects of treatments. Other problems, such as search engine advertisement and movie recommendations can be formalized similarly [Pandey and Olston2006].
Privacy can be a serious issue in the bandit setting (c.f. [Jain, Kothari, and Thakurta2012, Thakurta and Smith2013, Mishra and Thakurta2015, Zhao et al.2014]). For example, in clinical trials, we may want to detect and publish results about the best drug without leaking sensitive information, such as the patient’s health condition and genome. Differential privacy [Dwork2006] formally bounds the amount of information that a third party can learn no matter their power or side information.
Differential privacy has been used before in the stochastic setting [Tossou and Dimitrakakis2016, Mishra and Thakurta2015, Jain, Kothari, and Thakurta2012] where the authors obtain optimal algorithms up to logarithmic factors. In the adversarial setting, [Thakurta and Smith2013] adapts an algorithm called Follow The Approximate Leader to make it private and obtain a regret bound of . In this work, we show that a number of simple algorithms can satisfy privacy guarantees, while achieving nearly optimal regret (up to logarithmic factors) that scales naturally with the level of privacy desired.
Our work is also of independent interest for nonprivate multiarmed bandit algorithms, as there are competitive with the current state of the art against switchingcost adversaries (where we recover the optimal bound). Finally, we provide rigorous empirical results against a variety of adversaries.
The following section gives the main background and notations. Section 3.1 describes metaalgorithms that perturb the gain sequence to achieve privacy, while Section 3.2 explains how to leverage the privacy inherent in the EXP3 algorithm by modifying the way gains are used. Section 4 compares our algorithms with EXP3 in a variety of settings. The full proofs of all our main results are in the full version.
2 Preliminaries
2.1 The MultiArmed Bandit problem
Formally, a bandit game is defined between an adversary and an agent as follows: there is a set of arms , and at each round , the agent plays an arm . Given the choice , the adversary grants the agent a gain . The agent only observes the gain of arm , and not that of any other arms. The goal of this agent is to maximize its total gain after rounds, . A randomized bandit algorithm maps every armgain history to a distribution over the next arm to take.
The nature of the adversary, and specifically, how the gains are generated, determines the nature of the game. For the stochastic adversary [Thompson1933, Auer, CesaBianchi, and Fischer2002], the gain obtained at round is generated i.i.d from a distribution . The more general fully oblivious adversary [Audibert and Bubeck2010] generates the gains independently at round but not necessarily identically from a distribution . Finally, we have the oblivious adversary [Auer et al.2003] whose only constraint is to generate the gain as a function of the current action only, i.e. ignoring previous actions and gains.
While focusing on oblivious adversaries, we discovered that by targeting differential privacy we can also compete against the stronger bounded memory adaptive adversary [CesaBianchi, Dekel, and Shamir2013, Merhav et al.2002, Dekel, Tewari, and Arora2012] who can use up to the last gains. The oblivious adversary is a special case with . Another special case of this adversary is the one with switching costs, who penalises the agent whenever he switches arms, by giving the lowest possible gain of 0 (here ).
Regret.
Relying on the cumulative gain of an agent to evaluate its performance can be misleading. Indeed, consider the case where an adversary gives a zero gain for all arms at every round. The cumulative gain of the agent would look bad but no other agents could have done better. This is why one compares the gap between the agent’s cumulative gain and the one obtained by some hypothetical agent, called oracle, with additional information or computational power. This gap is called the regret.
There are also variants of the oracle that are considered in the literature. The most common variant is the fixed oracle, which always plays the best fixed arm in hindsight. The regret against this oracle is :
In practice, we either prove a high probability bound on or an expected value with:
where the expectation is taken with respect to the random choices of both the agent and adversary. There are other oracles like the shifting oracle but those are out of scope of this paper.
Exp3.
The Exponentialweight for Exploration and Exploitation (EXP3 [Auer et al.2003]) algorithm achieves the optimal bound (up to logarithmic factors) of for the weak regret (i.e. the expected regret compared to the fixed oracle) against an oblivious adversary. EXP3
simply maintains an estimate
for the cumulative gain of arm up to round with where(2.1) 
with a well defined constant.
Finally, EXP3
plays one action randomly according to the probability distribution
with as defined above.2.2 Differential Privacy
The following definition (from [Tossou and Dimitrakakis2016]) specifies what is meant when we called a bandit algorithm differentially private at a single round :
Definition 2.1 (Single round (differentially private bandit algorithm).
A randomized bandit algorithm is differentially private at round , if for all sequence and that differs in at most one round, we have for any action subset :
(2.2) 
where denotes the probability distribution specified by the algorithm and with the gains of all arms at round . When , the algorithm is said to be differential private.
The and parameters quantify the amount of privacy loss. Lower (,) indicate higher privacy and consequently we will also refer to (,) as the privacy loss. Definition 2.1 means that the output of the bandit algorithm at round is almost insensible to any single change in the gains sequence. This implies that whether or not we remove a single round, replace the gains, the bandit algorithm will still play almost the same action. Assuming the gains at round are linked to a user private data (for example his cancer status or the advertisement he clicked), the definition preserves the privacy of that user against any third parties looking at the output. This is the case because the choices or the participation of that user would not almost affect the output. Equation (2.2) specifies how much the output is affected by a single user.
We would like Definition 2.1 to hold for all rounds, so as to protect the privacy of all users. If it does for some , then we say the algorithm has perround or instantaneous privacy loss . Such an algorithm also has a cumulative privacy loss of at most with and after steps. Our goal is to design bandit algorithm such that their cumulative privacy loss are as low as possible while achieving simultaneously a very low regret. In practice, we would like and the regret to be sublinear while should be a very small quantity. Definition 2.2 formalizes clearly the meaning of this cumulative privacy loss and for ease of presentation, we will ignore the term ”cumulative” when referring to it.
Definition 2.2 ((differentially private bandit algorithm).
A randomized bandit algorithm is differentially private up to round , if for all and that differs in at most one round, we have for any action subset :
(2.3) 
where and are as defined in Definition 2.1.
Most of the time, we will refer to Definition 2.2 and whenever we need to use Definition 2.1, this will be made explicit.
The simplest mechanism to achieve differential privacy for a function is to add Laplace noise of scale proportional to its sensitivity. The sensitivity is the maximum amount by which the value of the function can change if we change a single element in the inputs sequence. For example, if the input is a stream of numbers in and the function their sum, we can add Laplace noise of scale to each number and achieve differential privacy with an error of in the sum. However, [Chan, Shi, and Song2010] introduced Hybrid Mechanism, which achieves differential privacy with only polylogarithmic error (with respect to the true sum). The idea is to group the stream of numbers in a binary tree and only add a Laplace noise at the nodes of the tree.
As demonstrated above, the main challenge with differential privacy is thus to tradeoff optimally privacy and utility.
Notation.
In this paper, will be used as an index for an arbitrary arm in , while will be used to indicate an optimal arm and is the arm played by an agent at round . We use to indicate the gain of the th arm at round . is the regret of the algorithm after rounds. The index and are dropped when it is clear from the context. Unless otherwise specified, the regret is defined for oblivious adversaries against the fixed oracle. We use ”” to denote that is generated from distribution . is used to denote the Laplace distribution with scale while
denotes the Bernoulli distribution with parameter
.3 Algorithms and Analysis
3.1 : Differential privacy through additional noise
We start by showing that the obvious technique to achieve a given differential privacy in adversarial bandits already beat the stateofthe art. The main idea is to use any base bandit algorithm as input and add a Laplace noise of scale to each gain before observes it. This technique gives DP differential privacy as the gains are bounded in and the noises are added i.i.d at each round.
However, bandits algorithms require bounded gains while the noisy gains are not. The trick is to ignore rounds where the noisy gains fall outside an interval of the form . We pick the threshold such that, with high probability, the noisy gains will be inside the interval . More precisely, can be chosen such that with high probability, the number of rounds ignored is lower than the upper bound on the regret of . Given that in the standard bandit problem, the gains are bounded in , the gains at accepted rounds are rescaled back to .
Theorem 3.2 shows that all these operations still preserve DP while Theorem 3.1 demonstrates that the upper bound on the expected regret of adds some small additional terms to . To illustrate how small those additional terms are, we instantiate with the EXP3 algorithm. This leads to a mechanism called DPEXP3Lap described in Algorithm 1. With a carefully chosen threshold , corollary 3.1 implies that the additional terms are such that the expected regret of DPEXP3Lap is which is optimal in up to some logarithmic factors. This result is a significant improvement over the best known bound so far of from [Thakurta and Smith2013] and solves simultaneously the challenge (whether or not one can get DP mechanism with optimal regret) posed by the authors.
Theorem 3.1.
If is run with input a base bandit algorithm , the noisy reward of the true reward set to with , the acceptance interval set to with the scaling of the rewards outside done using ; then the regret of satisfies:
(3.1) 
where is the upper bound on the regret of when the rewards are scaled from to
Proof Sketch.
We observed that is an instance of run with the noisy rewards instead of . This means is an upper bound of the regret on . Then, we derived a lower bound on showing how close it is to . This allows us to conclude. ∎
Corollary 3.1.
If is run with EXP3 as its base algorithm and , then its expected regret satisfies
Proof.
The proof comes by combining the regret of EXP3 [Auer et al.2003] with Theorem 3.1 ∎
Theorem 3.2.
is differentially private up to round .
Proof Sketch.
Combining the privacy of Laplace Mechanism with the parallel composition [McSherry2009] and postprocessing theorems [Dwork and Roth2013] concludes the proof. ∎
3.2 Leveraging the inherent privacy of Exp3
On the differential privacy of Exp3
[Dwork and Roth2013] shows that a variation of EXP3 for the fullinformation setting (where the agent observes the gain of all arms at any round regardless of what he played) is already differentially private. Their results imply that one can achieve the optimal regret with only a sublogarithmic privacy loss () after rounds.
We start this section by showing a similar result for EXP3 in Theorem 3.3. Indeed, we show that EXP3 is already differentially private but with a perround privacy loss of . ^{1}^{1}1Assuming we want a sublinear regret. See Theorem 3.3 Our results imply that EXP3 can achieve the optimal regret albeit with a linear privacy loss of DP after rounds. This is a huge gap compared with the fullinformation setting and underlines the significance of our result in section 3.1 where we describe a concrete algorithm demonstrating that the optimal regret can be achieved with only a logarithmic privacy loss after rounds.
Theorem 3.3.
The EXP3 algorithm is:
differentially private up to round .
In practice, we also want EXP3 to have a sublinear regret. This implies that and EXP3 is simply DP over rounds.
Proof Sketch.
The first two terms in the theorem come from the observation that EXP3 is a combination of two mechanisms: the Exponential Mechanism [McSherry and Talwar2007] and a randomized response. The last term comes from the observation that with probability we enjoy a perfect DP. Then, we use Chernoff to bound with high probability the number of times we suffer a nonzero privacy loss. ∎
We will now show that the privacy of EXP3 itself may be improved without any additional noise, and with only a moderate impact on the regret.
On the privacy of a Exp3 wrapper algorithm
The previous paragraph leads to the conclusion that it is impossible to obtain a sublinear privacy loss with a sublinear regret while using the original EXP3. Here, we will prove that an existing technique is already achieving this goal. The algorithm which we called is from [Dekel, Tewari, and Arora2012]. It groups the rounds into disjoint intervals of fixed size where the ’th interval starts on round and ends on round . At the beginning of interval , receives an action from EXP3 and plays it for rounds. During that time, EXP3 does not observe any feedback. At the end of the interval, feeds EXP3 with a single gain, the average gain received during the interval.
Theorem 3.4 borrowed from [Dekel, Tewari, and Arora2012] specifies the upper bound on the regret . It is remarkable that this bound holds against the mmemory bounded adaptive adversary. While in theorem 3.5, we show the privacy loss enjoyed by this algorithm, one gets a better intuition of how good those results are from corollary 3.2 and 3.3. Indeed, we can observe that achieves a sublogarithmic privacy loss of with a regret of against a special case of the mmemory bounded adaptive adversary called the switching costs adversary for which . This is the optimal regret bound (in the sense that there is a matching lower bound [Dekel et al.2014]). This means that in some sense we are getting privacy for free against this adversary.
Theorem 3.4 (Regret of [Dekel, Tewari, and Arora2012]).
The expected regret of is upper bounded by:
against the mmemory bounded adaptive adversary for any .
Theorem 3.5 (Privacy loss of ).
is DP up to round .
Proof.
The sensitivity of each gain is now as we are using the average. Combined with theorem (3.3), it means the perround privacy loss is . Given that EXP3 only observes rounds, using the advanced composition theorem [Dwork, Rothblum, and Vadhan2010] (Theorem III.3) concludes the final privacy loss over rounds. ∎
Corollary 3.2.
run with is differentially private up to round with , . Its expected regret against the switching costs adversary is upper bounded by .
Proof.
Corollary 3.3.
run with is differentially private and its expected regret against the switching costs adversary is upper bounded by:
4 Experiments
We tested DPEXP3Lap, together with the nonprivate EXP3 against a few different adversaries. The privacy parameter of DPEXP3Lap is set as defined in corollary 3.2. This is done so that the regret of DPEXP3Lap and are compared with the same privacy level. All the other parameters of DPEXP3Lap are taken as defined in corollary 3.1 while the parameters of are taken as defined in corollary 3.2.
For all experiments, the horizon is and the number of arms is . We performed independent trials and reported the medianofmeans estimator^{2}^{2}2Used heavily in the streaming literature [Alon, Matias, and Szegedy1996] of the cumulative regret. It partitions the trials into equal groups and return the median of the sample means of each group. Proposition 4.1 is a well known result (also in [Hsu and Sabato2013, Lerasle and Oliveira2011]) giving the accuracy of this estimator. Its convergence is
, with exponential probability tails, even though the random variable
may have heavytails. In comparison, the empirical mean can not provide such guarantee for any and confidence in [Catoni2012].Proposition 4.1.
Let be a random variable with mean
and variance
. Assume that we have independent sample of and let be the medianofmeans computed using groups. With probability at least , satisfies .We also reported the deviation of each algorithm using the Gini’s Mean Difference (GMD hereafter) [Gini and Pearson1912]. GMD computes the deviation as with the th order statistics of the sample (that is ). As shown in [Yitzhaki and others2003, David1968], the GMD provides a superior approximation of the true deviation than the standard one. To account for the fact that the cumulative regret of our algorithms might not follow a symmetric distribution, we computed the GMD separately for the values above and below the medianofmeans.
At round , we computed the cumulative regret against the fixed oracle who plays the best arm assuming that the end of the game is at . The oracle uses the actual sequence of gains to decide his best arm. For a given trial, we make sure that all algorithms are playing the same game by generating the gains for all possible pair of roundarm before the game starts.
Deterministic adversary.
As shown by [Audibert and Bubeck2010], the expected regret of any agent against an oblivious adversary can not be worse than that against the worst case deterministic adversary. In this experiment, arm is the best and gives for every even round. To trick the players into picking the wrong arms, the first arm always gives whereas the third gives for every round multiple of . The remaining arms always give . As shown by the figure, this simple adversary is already powerful enough to make the algorithms attain their upper bound.
Stochastic adversary
This adversary draws the gains of the first arm i.i.d from whereas all other gains are drawn i.i.d from .
Fully oblivious adversary.
For the best arm , it first draws a number uniformly in and generates the gain . For all other arms, is drawn from . This process is repeated at every round. In our experiments,
An oblivious adversary.
This adversary is identical to the fully oblivious one for every round multiple of . Between two multiples of the last gain of the arm is given.
The Switching costs adversary
This adversary (defined at Figure 1 in [Dekel et al.2014]) defines a stochastic processes (including simple Gaussian random walk as special case) for generating the gains. It was used to prove that any algorithm against this adversary must incur a regret of .
Discussion
Figure 6 shows our results against a variety of adversaries, with respect to a fixed oracle. Overall, the performance (in term of regret) of DPEXP3Lap is very competitive against that of EXP3 while providing a significant better privacy. This means that DPEXP3Lap allows us to get privacy for free in the bandit setting against an adversary not more powerful than the oblivious one.
The performance of is worse than that of DPEXP3Lap against an oblivious adversary or one less powerful. However, the situation is completely reversed against the more powerful switching cost adversary. In that setting, outperforms both EXP3 and DPEXP3Lap confirming the theoretical analysis. We can see as the algorithm providing us privacy for free against switching cost adversary and adaptive mbounded memory one in general.
5 Conclusion
We have provided the first results on differentially private adversarial multiarmed bandits, which are optimal up to logarithmic factors. One open question is how differential privacy affects regret in the full reinforcement learning problem. At this point in time, the only known results in the MDP setting obtain differentially private algorithms for Monte Carlo policy evaluation
[Balle, Gomrokchi, and Precup2016]. While this implies that it is possible to obtain policy iteration algorithms, it is unclear how to extend this to the full online reinforcement learning problem.Acknowledgements.
This research was supported by the SNSF grants “Adaptive control with approximate Bayesian computation and differential privacy” and “Swiss Sense Synergy”, by the Marie Curie Actions (REA 608743), the Future of Life Institute “Mechanism Design for AI Architectures” and the CNRS Specific Action on Security.
References

[Agrawal and Goyal2012]
Agrawal, S., and Goyal, N.
2012.
Analysis of thompson sampling for the multiarmed bandit problem.
In COLT 2012. 
[Alon, Matias, and
Szegedy1996]
Alon, N.; Matias, Y.; and Szegedy, M.
1996.
The space complexity of approximating the frequency moments.
In 28th STOC, 20–29. ACM.  [Audibert and Bubeck2010] Audibert, J.Y., and Bubeck, S. 2010. Regret bounds and minimax policies under partial monitoring. J. Mach. Learn. Res. 11:2785–2836.
 [Auer et al.2003] Auer, P.; CesaBianchi, N.; Freund, Y.; and Schapire, R. E. 2003. The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1):48–77.
 [Auer, CesaBianchi, and Fischer2002] Auer, P.; CesaBianchi, N.; and Fischer, P. 2002. Finite time analysis of the multiarmed bandit problem. Machine Learning 47(2/3):235–256.
 [Auer2002] Auer, P. 2002. Using confidence bounds for exploitationexploration tradeoffs. Journal of Machine Learning Research 3:397–422.
 [Balle, Gomrokchi, and Precup2016] Balle, B.; Gomrokchi, M.; and Precup, D. 2016. Differentially private policy evaluation. In ICML 2016.
 [Burnetas and Katehakis1996] Burnetas, A. N., and Katehakis, M. N. 1996. Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics 17(2):122–142.
 [Catoni2012] Catoni, O. 2012. Challenging the empirical mean and empirical variance: A deviation study. Annales de l’I.H.P. Probabilités et statistiques 48(4):1148–1185.
 [CesaBianchi, Dekel, and Shamir2013] CesaBianchi, N.; Dekel, O.; and Shamir, O. 2013. Online learning with switching costs and other adaptive adversaries. In NIPS, 1160–1168.
 [Chan, Shi, and Song2010] Chan, T. H.; Shi, E.; and Song, D. 2010. Private and continual release of statistics. In Automata, Languages and Programming. Springer. 405–417.
 [David1968] David, H. 1968. Miscellanea: Gini’s mean difference rediscovered. Biometrika 55(3):573–575.

[Dekel et al.2014]
Dekel, O.; Ding, J.; Koren, T.; and Peres, Y.
2014.
Bandits with switching costs: T2/3 regret.
In
Proceedings of the 46th Annual ACM Symposium on Theory of Computing
, STOC ’14, 459–467. New York, NY, USA: ACM.  [Dekel, Tewari, and Arora2012] Dekel, O.; Tewari, A.; and Arora, R. 2012. Online bandit learning against an adaptive adversary: from regret to policy regret. In ICML. icml.cc / Omnipress.
 [Dwork and Roth2013] Dwork, C., and Roth, A. 2013. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9(3–4):211–407.
 [Dwork, Rothblum, and Vadhan2010] Dwork, C.; Rothblum, G. N.; and Vadhan, S. 2010. Boosting and differential privacy. In Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, FOCS ’10, 51–60.
 [Dwork2006] Dwork, C. 2006. Differential privacy. In ICALP, 1–12. Springer.
 [Gini and Pearson1912] Gini, C., and Pearson, K. 1912. Variabilità e mutabilità: contributo allo studio delle distribuzioni e delle relazioni statistiche. Fascicolo 1. tipografia di Paolo Cuppini.
 [Hsu and Sabato2013] Hsu, D., and Sabato, S. 2013. Loss minimization and parameter estimation with heavy tails. arXiv preprint arXiv:1307.1827.
 [Jain, Kothari, and Thakurta2012] Jain, P.; Kothari, P.; and Thakurta, A. 2012. Differentially private online learning. In Mannor, S.; Srebro, N.; and Williamson, R. C., eds., COLT 2012, volume 23, 24.1–24.34.
 [Lerasle and Oliveira2011] Lerasle, M., and Oliveira, R. I. 2011. Robust empirical mean estimators. arXiv preprint arXiv:1112.3914.
 [McSherry and Talwar2007] McSherry, F., and Talwar, K. 2007. Mechanism design via differential privacy. In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’07, 94–103. Washington, DC, USA: IEEE Computer Society.
 [McSherry2009] McSherry, F. D. 2009. Privacy integrated queries: An extensible platform for privacypreserving data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, SIGMOD ’09, 19–30. New York, NY, USA: ACM.

[Merhav et al.2002]
Merhav, N.; Ordentlich, E.; Seroussi, G.; and Weinberger, M. J.
2002.
On sequential strategies for loss functions with memory.
IEEE Trans. Information Theory 48(7):1947–1958.  [Mishra and Thakurta2015] Mishra, N., and Thakurta, A. 2015. (nearly) optimal differentially private stochastic multiarm bandits. Proceedings of the 31th UAI.
 [Pandey and Olston2006] Pandey, S., and Olston, C. 2006. Handling advertisements of unknown quality in search advertising. In Schölkopf, B.; Platt, J. C.; and Hoffman, T., eds., Twentieth NIPS, 1065–1072.
 [Thakurta and Smith2013] Thakurta, A. G., and Smith, A. D. 2013. (nearly) optimal algorithms for private online learning in fullinformation and bandit settings. In NIPS, 2733–2741.
 [Thompson1933] Thompson, W. 1933. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of two Samples. Biometrika 25(34):285–294.
 [Tossou and Dimitrakakis2016] Tossou, A. C. Y., and Dimitrakakis, C. 2016. Algorithms for differentially private multiarmed bandits. In AAAI, 2087–2093. AAAI Press.

[Yitzhaki and others2003]
Yitzhaki, S., et al.
2003.
Gini’s mean difference: A superior measure of variability for nonnormal distributions.
Metron 61(2):285–316.  [Zhao et al.2014] Zhao, J.; Jung, T.; Wang, Y.; and Li, X. 2014. Achieving differential privacy of data disclosure in the smart grid. In 2014 IEEE Conference on Computer Communications, INFOCOM 2014, 504–512.
Comments
There are no comments yet.