1 Introduction
In the paradigm of online learning, a learning algorithm makes a sequence of predictions given the (possibly incomplete) knowledge of the correct answers for the past queries. In contrast to statistical learning, online learning algorithms typically offer distributionfree guarantees. Consequently, online learning algorithms are well suited to dynamic and adversarial environments, where realtime learning from changing data is essential making them ubiquitous in practical applications such as servicing search advertisements. In these settings often these algorithms interact with sensitive user data, making privacy a natural concern for these algorithms. A natural notion of privacy in such settings is differential privacy (Dwork et al., 2006) which ensures that the outputs of an algorithm are indistinguishable in the case when a user’s data is present as opposed to when it is absent in a dataset.
In this paper, we design differentially private algorithms for online linear optimization with nearoptimal regret, both in the full information and partial information (bandit) settings. This result improves the known best regret bounds for a number of important online learning problems – including prediction from expert advice and nonstochastic multiarmed bandits.
1.1 FullInformation Setting: Privacy for Free
For the fullinformation setting where the algorithm gets to see the complete loss vector every round, we design
differentially private algorithms with regret bounds that scale as (Theorem 3.1), partially resolving an open question to improve the previously best known bound of posed in (Smith & Thakurta, 2013). A decomposition of the bound on the regret bound of this form implies that when , the regret incurred by the differentially private algorithm matches the optimal regret in the nonprivate setting, i.e. differential privacy is free. Moreover even when , our results guarantee a subconstant regret per round in contrast to the vacuous constant regret per round guaranteed by existing results.Concretely, consider the case of online linear optimization over the cube, with unit normbounded loss vectors. In this setting, (Smith & Thakurta, 2013) achieves a regret bound of , which is meaningful only if . Our theorems imply a regret bound of . This is an improvement on the previous bound regardless of the value of . Furthermore, when is between and , the previous bounds are vacuous whereas our results are still meaningful. Note that the above arguments show an improvement over existing results even for moderate value of . Indeed, when is very small, the magnitude of improvements are more pronounced.
Beyond the separation between and , the key point of our analysis is that we obtain bounds for general regularization based algorithms which adapt to the geometry of the underlying problem optimally, unlike the previous algorithms (Smith & Thakurta, 2013) which utilizes euclidean regularization. This allows our results to get rid of a polynomial dependence on (in the term) in some cases. Online linear optimization over the sphere and prediction with expert advice are notable examples.
We summarize our results in Table 1.
Function Class ( Dimensions)  Previous Best Known Regret  Our Regret Bound  Best Nonprivate Regret 

Prediction with Expert Advice  
Online linear optimization over the Sphere  
Online linear optimization over the Cube  
Online Linear Optimization 
1.2 Bandits: Reduction to the Nonprivate Setting
In the partialinformation (bandit) setting, the online learning algorithm only gets to observe the loss of the prediction it prescribed. We outline a reduction technique that translates a nonprivate bandit algorithm to a differentially private bandit algorithm, while retaining the dependency of the regret bound on the number of rounds of play (Theorem 4.5). This allows us to derive the first differentially private algorithm for bandit linear optimization achieving regret, using the algorithm for the nonprivate setting from (Abernethy et al., 2012). This answers a question from (Smith & Thakurta, 2013) asking if regret is attainable for differentially private linear bandits .
An important case of the general bandit linear optimization framework is the nonstochastic multiarmed bandits problem(Bubeck et al., 2012b), with applications for website optimization, personalized medicine, advertisement placement and recommendation systems. Here, we propose an differentially private algorithm which enjoys a regret of (Theorem 4.1), improving on the previously best attainable regret of (Smith & Thakurta, 2013).
We summarize our results in Table 2.
Function Class ( Dimensions)  Previous Best Known Regret  Our Regret Bound  Best Nonprivate Regret 

Bandit Linear Optimization  
Nonstochastic Multarmed Bandits 
1.3 Related Work
The problem of differentially private online learning was first considered in (Dwork et al., 2010), albeit guaranteeing privacy in a weaker setting – ensuring the privacy of the individual entries of the loss vectors. (Dwork et al., 2010) also introduced the treebased aggregation scheme for releasing the cumulative sums of vectors in a differentially private manner, while ensuring that the total amount of noise added for each cumulative sum is only polylogarithmically dependent on the number of vectors. The stronger notion of privacy protecting entire loss vectors was first studied in (Jain et al., 2012), where gradientbased algorithms were proposed that achieve differntial privacy and regret bounds of . (Smith & Thakurta, 2013) proposed a modification of FollowtheApproximateLeader template to achieve
regret for strongly convex loss functions, implying a regret bound of
for general convex functions. In addition, they also demonstrated that under bandit feedback, it is possible to obtain regret bounds that scale as . (Dwork et al., 2014a; Jain & Thakurta, 2014) proved that in the special case of prediction with expert advice setting, it is possible to achieve a regret of . While most algorithms for differentially private online learning are based on the regularization template, (Dwork et al., 2014b) used a perturbationbased algorithm to guarantee differential privacy for the problem of online PCA. (Tossou & Dimitrakakis, 2016) showed that it is possible to design differentially private algorithms for the stochastic multiarmed bandit problem with a separation of for the regret bound. Recently, an independent work due to (Tossou & Dimitrakakis, 2017), which we were made aware of after the first manuscript, also demonstrated a regret bound in the nonstochastic multiarmed bandits setting. We match their results (Theorem 4.1), as well as provide a generalization to arbitrary convex sets (Theorem 4.5).1.4 Overview of Our Techniques
Full Information Setting: We consider the two well known paradigms for online learning, FolllowtheRegularizedLeader (FTRL) and FolllowthePerturbedLeader (FTPL)
. In both cases, we ensure differential privacy by restricting the mode of access to the inputs (the loss vectors). In particular, the algorithm can only retrieve estimates of the loss vectors released by a tree based aggregation protocol (Algorithm
2) which is a slight modification of the protocol used in (Jain et al., 2012; Smith & Thakurta, 2013). We outline a tighter analysis of the regret minimization framework by crucially observing that in case of linear losses, the expected regret of an algorithm that injects identically (though not necessarily independently) distributed noise per step is the same as one that injects a single copy of the noise at the very start of the algorithm.The regret analysis of FollowtheLeader based algorithm involves two components, a bias term due to the regularization and a stability term which bounds the change in the output of the algorithm per step. In the analysis due to (Smith & Thakurta, 2013)
, the stability term is affected by the variance of the noise as it changes from step to step. However in our analysis, since we treat the noise to have been sampled just once, the stability analysis does not factor in the variance and the magnitude of the noise essentially appears as an additive term in the bias.
Bandit Feedback: In the bandit feedback setting, we show a general reduction that takes a nonprivate algorithm and outputs a private algorithm (Algorithm 4). Our key observation here (presented as Lemma 4.3) is that on linear functions, in expectation the regret of an algorithm on a noisy sequence of loss vectors is the same as its regret on the original loss sequence as long as noise is zero mean. We now bound the regret on the noisy sequence by conditioning out the case when the noise can be large and using exploration techniques from (Bubeck et al., 2012a) and (Abernethy et al., 2008).
2 Model and Preliminaries
This section introduces the model of online (linear) learning, the distinction between full and partial feedback scenarios, and the notion of differential privacy in this model.
FullInformation Setting: Online linear optimization (Hazan et al., 2016; ShalevShwartz, 2011) involves repeated decision making over rounds of play. At the beginning of every round (say round ), the algorithm chooses a point in , where is a (compact) convex set. Subsequently, it observes the loss and suffers a loss of . The success of such an algorithm, across rounds of play, is measured though regret, which is defined as
where the expectation is over the randomness of the algorithm. In particular, achieving a sublinear regret () corresponds to doing almost as good (averaging across rounds) as the fixed decision with the least loss in hindsight. In the nonprivate setting, a number of algorithms have been devised to achieve regret, with additional dependencies on other parameters dependent on the properties of the specific decision set and loss set . (See (Hazan et al., 2016) for a survey of results.)
Following are three important instantiations of the above framework.

Prediction with Expert Advice: Here the underlying decision set is the simplex and the loss vectors are constrained to the unit cube .

OLO over the Sphere: Here the underlying decision is the euclidean ball and the loss vectors are constrained to the unit euclidean ball .

OLO over the Cube: The decision is the unit cube , while the loss vectors are constrained to the set .
PartialInformation Setting: In the setting of bandit feedback, the critical difference is that the algorithm only gets to observe the value , in contrast to the complete loss vector as in the full information scenario. Therefore, the only feedback the algorithm receives is the value of the loss it incurs for the decision it takes. This makes designing algorithms for this feedback model challenging. Nevertheless for the general problem of bandit linear optimization, (Abernethy et al., 2008) introduced a computationally efficient algorithm that achieves an optimal dependence of the incurred regret of on the number of rounds of play. The nonstochastic multiarmed bandit (Auer et al., 2002) problem is the bandit version of the prediction with expert advice framework.
Differential Privacy: Differential Privacy (Dwork et al., 2006) is a rigorous framework for establishing guarantees on privacy loss, that admits a number of desirable properties such as graceful degradation of guarantees under composition and robustness to linkage acts (Dwork et al., 2014a).
Definition 2.1 (Differential Privacy).
A randomized online learning algorithm on the action set and the loss set is differentially private if for any two sequence of loss vectors and differing in at most one vector – that is to say – for all , it holds that
Remark 2.2.
The above definition of Differential Privacy is specific to the online learning scenario in the sense that it assumes the change of a complete loss vector. This has been the standard notion considered earlier in (Jain et al., 2012; Smith & Thakurta, 2013). Note that the definition entails that the entire sequence of predictions produced by the algorithm is differentially private.
Notation: We define , , and , where is the norm. By Holder’s inequality, it is easy to see that for all with . We define the distribution to be the distribution over such that each coordinate is drawn independently from the Laplace distribution with parameter .
3 FullInformation Setting: Privacy for Free
In this section, we describe an algorithmic template (Algorithm 1) for differentially private online linear optimization, based on FollowtheRegularizedLeader scheme. Subsequently, we outline the noise injection scheme (Algorithm 2), based on the Treebased Aggregation Protocol (Dwork et al., 2010), used as a subroutine by Algorithm 1 to ensure input differential privacy. The following is our main theorem in this setting.
Theorem 3.1.
Algorithm 1 when run with where , regularization , decision set and loss vectors , the regret of Algorithm 1 is bounded by
where
and is the distribution induced by the sum of independent samples from , represents the dual of the norm with respect to the hessian of . Moreover, the algorithm is differentially private, i.e. the sequence of predictions produced is differentially private.
The above theorem leads to following corollary where we show the bounds obtained in specific instantiations of online linear optimization.
Corollary 3.2.
Substituting the choices of listed below, we specify the regret bounds in each case.

Prediction with Expert Advice: Choosing and ,

OLO over the Sphere Choosing and

OLO over the Cube With and
3.1 Proof of Theorem 3.1
We first prove the privacy guarantee, and then prove the claimed bound on the regret. For the analysis, we define the random variable
to be the net amount of noise injected by the TreeBasedAggregation (Algorithm 2) on the true partial sums. Formally, is the difference between cumulative sum of loss vectors and its differentially private estimate used as input to the argmin oracle.Further, let be the distribution induced by summing of independent samples from .
Privacy : To make formal claims about the quality of privacy, we ensure input differential privacy for the algorithm – that is, we ensure that the entire sequence of partial sums of the loss vectors is differentially private. Since the outputs of Algorithm 1 are strictly determined by the prefix sum estimates produced by TreeBasedAgg, by the postprocessing theorem, this certifies that the entire sequence of choices made by the algorithm (across all rounds of play) is differentially private. We modify the standard Treebased Aggregation protocol to make sure that the noise on each output (partial sum) is distributed identically (though not necessarily independently) across time. While this modification is not essential for ensuring privacy, it simplifies the regret analysis.
Lemma 3.3 (Privacy Guarantees with Laplacian Noise).
Choose any . When Algorithm 2 is run with , the following claims hold true:

Privacy: The sequence is differentially private.

Distribution: , where each is independently sampled from .
Proof.
By Theorem 9 ((Jain et al., 2012)), we have that the sequence is differentially private. Now the sequence is differentially private because differential privacy is immune to postprocessing(Dwork et al., 2014a).
Note that the PrivateSum algorithm adds exactly independent draws from the distribution to , where is the minimum set of already populated nodes in the tree that can compute the required prefix sum. Due to Line 6 in Algorithm 2, it is made certain that every prefix sum released is a sum of the true prefix sum and independent draws from . ∎
Regret Analysis: In this section, we show that for linear loss functions any instantiation of the FollowtheRegularizedLeader algorithm can be made differentially private with an additive loss in regret.
Theorem 3.4.
For any noise distribution , regularization , decision set and loss vectors , the regret of Algorithm 1 is bounded by
where , , and represents the dual of the norm with respect to the hessian of .
Proof.
To analyze the regret suffered by Algorithm 1, we consider an alternative algorithm that performs a oneshot noise injection – this alternate algorithm may not be differentially private. The observation here is that the alternate algorithm and Algorithm 1 suffer the same loss in expectation and therefore the same expected regret which we bound in the analysis below.
Consider the following alternate algorithm which instead of sampling noise at each step instead samples noise at the beginning of the algorithm and plays with respect to that. Formally consider the sequence of iterates defined as follows. Let .
We have that
(1) 
To see the above equation note that since have the same distribution.
Therefore it is sufficient to bound the regret of the sequence . The key idea now is to notice that the addition of one shot noise does not affect the stability term of the FTRL analysis and therefore the effect of the noise need not be paid at every time step. Our proof will follow the standard template of using the FTLBTL (Kalai & Vempala, 2005) lemma and then bounding the stability term in the standard way. Formally define the augmented series of loss functions by defining
where is a sample. Now invoking the Follow the Leader, Be the Leader Lemma (Lemma 5.3, (Hazan et al., 2016)) we get that for any fixed
Therefore we can conclude that
(2)  
(3) 
where Therefore we now need to bound the stability term . Now, the regret bound follows from the standard analysis for the stability term in the FTRL scheme (see for instance (Hazan et al., 2016)). Notice that the bound only depends on the change in the cumulative loss per step i.e. , for which the change is the loss vector across time steps. Therefore we get that
(4) 
∎
3.2 Regret Bounds for FTPL
In this section, we outline algorithms based on the FollowthePerturbedLeader template(Kalai & Vempala, 2005). FTPLbased algorithms ensure lowregret by perturbing the cumulative sum of loss vectors with noise from a suitably chosen distribution. We show that the noise added in the process of FTPL is sufficient to ensure differential privacy. More concretely, using the regret guarantees due to (Abernethy et al., 2014), for the fullinformation setting, we establish that the regret guarantees obtained scale as . While Theorem 3.5 is valid for all instances of online linear optimization and achieves regret, it yields suboptimal dependence on the dimension of the problem. The advantage of FTPLbased approaches over FTRL is that FTPL performs linear optimization over the decision set every round, which is possibly computationally less expensive than solving a convex program every round, as FTRL requires.
Theorem 3.5 (FTPL: Online Linear Optimization).
Let and . Choosing and , we have that is
Moreover the algorithm is differentially private.
The proof of the theorem is deferred to the appendix.
4 Differentially Private MultiArmed Bandits
In this section, we state our main results regarding bandit linear optimization, the algorithms that achieve it and prove the associated regret bounds. The following is our main theorem concerning nonstochastic multiarmed bandits.
Theorem 4.1 (Differentially Private MultiArmed Bandits).
Bandit Feedback: Reduction to the Nonprivate Setting
We begin by describing an algorithmic reduction that takes as input a nonprivate bandit algorithm and translates it into an differentially private bandit algorithm. The reduction works in a straightforward manner by adding the requisite magnitude of Laplace noise to ensure differential privacy. For the rest of this section, for ease of exposition we will assume that both and are sufficiently large.
The following Lemma characterizes the conditions under which Algorithm 4 is differentially private
Lemma 4.2 (Privacy Guarantees).
Assume that each loss vector is in the set , such that . For where , the sequence of outputs produced by the Algorithm is differentially private.
The following lemma charaterizes the regret of Algorithm 4. In particular we show that the regret of Algorithm 4 is, in expectation, same as that of the regret of the input algorithm on a perturbed version of loss vectors.
Lemma 4.3 (Noisy Online Optimization).
Consider a loss sequence and a convex set . Define a perturbed version of the sequence as random vectors as where is a random vector such that are independent and for all .
Let be a full information (or bandit) online algorithm which outputs a sequence and takes as input (respectively ) at time . Let be a fixed point in the convex set. Then we have that
Proof of Lemma 4.2.
Consider a pair of sequence of loss vectors that differ at exactly one time step – say and . Since the prediction of produced by the algorithm at time step any time can only depend on the loss vectors in the past , it is clear that the distribution of the output of the algorithm for the first rounds is unaltered. We claim that , it holds that
Before we justify the claim, let us see how this implies that desired statement. To see this, note that conditioned on the value fed to the inner algorithm at time , the distribution of all outputs produced by the algorithm are completely determined since the feedback to the algorithm at other time steps (discounting ) stays the same (in distribution). By the above discussion, it is sufficient to demonstrate differential privacy for each input fed (as feedback) to the algorithm .
For the sake of analysis, define as follows. If , define . Else, define to be such that if and only if and otherwise, where breaks ties arbitrarily. Define . Now note that .
It suffices to establish that each is differentially private. To argue for this, note that Laplace mechanism (Dwork et al., 2014a) ensures the same, since the norm of is bounded by . ∎
4.1 Proof of Theorem 4.1
Privacy: Note that since as . Therefore by Lemma 4.2, setting is sufficient to ensure differential privacy.
Regret Analysis: For the purpose of analysis we define the following pseudo loss vectors.
where by definition . The following follows from Fact C.1 proved in the appendix.
Taking a union bound, we have
(5) 
To bound the norm of the loss we define the event . We have from (5) that . We now have that
Since the regret is always bounded by we get that the second term above is at most 1. Therefore we will concern ourselves with bounding the first term above. Note that remains independent and symmetric even when conditioned on the event . Moreover the following statements also hold.
(6) 
(7) 
Equation (6) follows by noting that remains symmetric around the origin even after conditioning. It can now be seen that Lemma 4.3 still applies even when the noise is sampled from conditioned under the event (due to Equation 6). Therefore we have that
(8) 
To bound the above quantity we make use of the following lemma which is a specialization of Theorem 1 in (Bubeck et al., 2012a) to the case of multiarmed bandits.
Lemma 4.4 (Regret Guarantee for Algorithm 5).
If is such that , we have that the regret of Algorithm 5 is bounded by
Now note that due to the conditioning and therefore we have that
It can be seen that the condition in Theorem 4.4 is satisfied for exploration and under the condition as long as
which holds by the choice of these parameters. Finally
4.2 Differentially Private Bandit Linear Optimization
In this section we prove a general result about bandit linear optimization over general convex sets, the proof of which is deferred to the appendix.
Theorem 4.5 (Bandit Linear Optimization).
Let be a convex set. Fix loss vectors such that . We have that Algorithm 4 when run with parameters (with ) and algorithm (Abernethy et al., 2012) with step parameter we have the following guarantees that the regret of the algorithm is bounded by
where is the selfconcordance parameter of the convex body . Moreover the algorithm is differentially private.
5 Conclusion
In this work, we demonstrate that ensuring differential privacy leads to only a constant additive increase in the incurred regret for online linear optimization in the full feedback setting. We also show nearly optimal bounds (in terms of T) in the bandit feedback setting. Multiple avenues for future research arise, including extending our bandit results to other challenging partialinformation models such as semibandit, combinatorial bandit and contextual bandits. Another important unresolved question is whether it is possible to achieve an additive separation in in the adversarial bandit setting.
References
 Abernethy et al. (2008) Abernethy, Jacob, Hazan, Elad, and Rakhlin, Alexander. Competing in the dark: An efficient algorithm for bandit linear optimization. In COLT, pp. 263–274, 2008.
 Abernethy et al. (2014) Abernethy, Jacob, Lee, Chansoo, Sinha, Abhinav, and Tewari, Ambuj. Online linear optimization via smoothing. In COLT, pp. 807–823, 2014.
 Abernethy et al. (2012) Abernethy, Jacob D, Hazan, Elad, and Rakhlin, Alexander. Interiorpoint methods for fullinformation and bandit online learning. IEEE Transactions on Information Theory, 58(7):4164–4175, 2012.
 Auer et al. (2002) Auer, Peter, CesaBianchi, Nicolo, Freund, Yoav, and Schapire, Robert E. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
 Bubeck et al. (2012a) Bubeck, Sébastien, CesaBianchi, Nicolo, Kakade, Sham M, Mannor, Shie, Srebro, Nathan, and Williamson, Robert C. Towards minimax policies for online linear optimization with bandit feedback. In COLT, volume 23, 2012a.

Bubeck et al. (2012b)
Bubeck, Sébastien, CesaBianchi, Nicolo, et al.
Regret analysis of stochastic and nonstochastic multiarmed bandit
problems.
Foundations and Trends® in Machine Learning
, 5(1):1–122, 2012b.  Dwork et al. (2006) Dwork, Cynthia, McSherry, Frank, Nissim, Kobbi, and Smith, Adam. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pp. 265–284. Springer, 2006.

Dwork et al. (2010)
Dwork, Cynthia, Naor, Moni, Pitassi, Toniann, and Rothblum, Guy N.
Differential privacy under continual observation.
In
Proceedings of the fortysecond ACM symposium on Theory of computing
, pp. 715–724. ACM, 2010.  Dwork et al. (2014a) Dwork, Cynthia, Roth, Aaron, et al. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014a.

Dwork et al. (2014b)
Dwork, Cynthia, Talwar, Kunal, Thakurta, Abhradeep, and Zhang, Li.
Analyze gauss: optimal bounds for privacypreserving principal component analysis.
In Proceedings of the 46th Annual ACM Symposium on Theory of Computing, pp. 11–20. ACM, 2014b.  Hazan et al. (2016) Hazan, Elad et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(34):157–325, 2016.
 Jain & Thakurta (2014) Jain, Prateek and Thakurta, Abhradeep G. (near) dimension independent risk bounds for differentially private learning. In Proceedings of the 31st International Conference on Machine Learning (ICML14), pp. 476–484, 2014.
 Jain et al. (2012) Jain, Prateek, Kothari, Pravesh, and Thakurta, Abhradeep. Differentially private online learning. In COLT, volume 23, pp. 24–1, 2012.
 Kalai & Vempala (2005) Kalai, Adam and Vempala, Santosh. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005.
 ShalevShwartz (2011) ShalevShwartz, Shai. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107–194, 2011.
 Smith & Thakurta (2013) Smith, Adam and Thakurta, Abhradeep Guha. (nearly) optimal algorithms for private online learning in fullinformation and bandit settings. In Advances in Neural Information Processing Systems, pp. 2733–2741, 2013.
 Tossou & Dimitrakakis (2016) Tossou, Aristide and Dimitrakakis, Christos. Algorithms for differentially private multiarmed bandits. In AAAI 2016, 2016.

Tossou & Dimitrakakis (2017)
Tossou, Aristide C. Y. and Dimitrakakis, Christos.
Achieving privacy in the adversarial multiarmed bandit.
In
14th International Conference on Artificial Intelligence (AAAI 2017)
, 2017.
Appendix A Proofs for FTPL (Theorem 3.5)
Theorem A.1 (Privacy Guarantees with Gaussian Noise).
Choose any . When Algorithm 2 is run with , the following claims hold true:

Privacy: The sequence is differentially private.

Distribution: It holds that .
Proof.
By Theorem 9 ((Jain et al., 2012)), we have that the sequence is differentially private. Now the sequence is differentially private because differential privacy is immune to postprocessing (Dwork et al., 2014a).
Note that the PrivateSum algorithm adds exactly independent draws from the distribution to , where is the minimum set of already populated nodes in the tree that can compute the required prefix sum. Due to Line 6, it is made certain that every prefix sum released is a sum of the prefix sum and independent draws from . ∎
Appendix B Noisy OCO Theorem Proof
We now give the proof of Lemma 4.3.
Proof.
The proof of the lemma is a straightforward calculation. First note that
Now we have that
Comments
There are no comments yet.