1 Introduction
Contextual bandits are an important learning paradigm used by many recommendation systems (Langford and Zhang, 2008). The paradigm considers a series of interactions between the learner and the environment: in each interaction, the learner receives a context feature and selects an arm based on that context. The environment provides the learner with a reward after the arm is pulled (i.e., an action is executed). In traditional contextual bandit scenarios, the learner is a single party: that is, the party that pulls the arm is also the party who has access to all context features and who receives the reward. In many practical scenarios, however, contextual bandit learning involves multiple parties: for example, recommendation systems generally involve content producers, content consumers, and the party that operates the recommendation service itself. These parties may not be willing or allowed to share all the information with each other that is needed to produce good recommendations. For instance, a travelrecommendation service could recommend better itineraries by taking prior airline bookings, hotel reservations, and airline and hotel reviews into account as context. To do so, the travelrecommendation service requires data from booking, reservation, and review systems that may be operated by other parties. Similarly, a restaurantrecommendation service may be improved by considering a user’s prior reservations made via a restaurantreservation service.
In this paper, we develop a privacypreserving contextual bandit that successfully learns in such settings. We study a multiparty contextual bandit setting in which: (1) all parties may provide some of the context features but none of the parties may learn each other’s features, (2) the party that pulls the arm is the only one who may know which arm was pulled, and (3) the party that receives the reward is the only one who may observe the reward value. To learn effectively, our algorithm combines techniques from secure multiparty computation (BenOr et al., 1988) and differential privacy (Dwork et al., 2006). It achieves a high degree of privacy with limited losses in prediction accuracy by exploiting the fact that exploration mechanisms in contextual bandits already provide differential privacy. We provide theoretical guarantees on the privacy of our algorithm and empirically demonstrate its efficacy.
2 Problem Statement
Learning setting. We consider a multiparty contextual bandit setting with parties , arms , and iterations. Each iteration, , of the contextualbandit learner comprises three main stages:

[leftmargin=*]

Each party provides context features in a privacypreserving way.

Party pulls arm without revealing to any of the other parties.

Party receives reward without revealing to any of the other parties.
The parties collaborate with the aim of learning a joint policy that maximizes the average reward over all iterations:
where the expectation is over randomness in the environment. Our algorithm comes with a theoretical bound on the regret: the difference between the total reward obtained taking the best action at each iteration and that obtained following policy .
Our policy set, , is the set of all models with a linear relation between context features and the corresponding score for arm (Li et al., 2010). In particular, let be the concatenated context features of all parties. To compute the score for arm at iteration we use a linear model . Let be the design matrix at iteration which consists of the
context vectors corresponding to arm
. Similarly, is the vector of observed rewards for the same arm. The weights at iteration are found by minimizing the leastsquares objective using the standard linear leastsquares solution: where and .Security model. In line with the cooperative nature of our learning setting, we assume an honestbutcurious security model (Goldreich, 2009): we assume parties do not collude and follow the specified algorithm, but may try and learn as much as possible from the information they observe when executing the algorithm.
Our algorithm comes with a differentialprivacy guarantee on the information that parties and can obtain on context features . The action must be revealed to party so that they can pull the corresponding arm. Similarly, party must receive the reward. Hence, some information is ultimately revealed to parties and . We provide a differentialprivacy guarantee on this information leakage by exploiting the randomness introduced by epsilongreedy exploration.
The other parties, , do not learn anything about context features for which . The algorithm assumes that all parties have access to privately shared random numbers generated during an offline phase.^{1}^{1}1These can be generated by a trusted third party or a leveled HE implementation (Brakerski et al., 2012).
3 PrivacyPreserving Contextual Bandits
Our privacypreserving contextual bandit learner employs a relatively standard epsilongreedy policy that assumes a linear relation between the context features and the score for an arm. To obtain privacy guarantees, we use arithmetic secret sharing techniques commonly used in secure multiparty computation (MPC) to implement our algorithm (Damgård et al., 2011). We rely on the differentially private properties of epsilongreedy policies when performing actions (see Section 4).
In arithmetic secret sharing, a scalar value (where denotes a ring with elements, and is large) is shared across parties in such a way that the sum of the shares reconstructs the original value . We denote the sharing of by , where indicates party ’s share of . The representation has the property that . We use a simple encoding to represent the realvalued context features and model weights in : to obtain , we multiply with a large scaling factor and rounding to the nearest integer: , where for some precision parameter, . We decode an encoded value, , by computing . Encoding realvalued numbers this way incurs a precision loss that is inversely proportional to .
Party shares a context feature by drawing numbers uniformly at random from and distributes them among the other parties. These form the shares for parties. Subsequently, party computes its own share as . As a result, none of the other parties can infer any information about from their share. To allow party to pull an arm , all parties communicate their shares to party , which computes and performs the corresponding action, . Subsequently, party receives the reward, .
Algorithm LABEL:alg:bandits describes our privacypreserving contextual bandit learner. All the computations in the algorithm are performed by the parties directly in the secretshared representation without leaking information to the other parties. All computations on shares are performed modulus .
algocf[ht!]
We rely on the homomorphic properties of additive secret sharing in order to perform computations directly on the encrypted data. Below, we give an overview of how these computations are implemented. The primary cost in executing Algorithm LABEL:alg:bandits is the number of communication rounds needed between parties for certain operations, for example, the evaluation of the function. We note that this communication can sometimes be overlapped with other computations.
Addition. The addition of two encrypted values, , can be trivially implemented by having each party sum their shares of and . That is, each party computes .
Multiplication. To facilitate multiplication of two secret shared values, the parties use random Beaver triples (Beaver, 1991) that were generated in an offline preprocessing phase. A Beaver triple of secret shared values satisfies the property . The parties use the Beaver triple to compute and and decrypt and . This does not leak information if and were drawn uniformly at random from the ring . The product can now be evaluated by computing . It is straightforward to confirm that the result of the private multiplication is correct:
(1) 
Decrypting and requires a round of communication among all parties: a communication round. The required correction for the additional scaling term, , incurs a second communication round.
Square. To compute the square , the parties use a Beaver pair such that . As before, the parties then compute , decrypt , and obtain the result via .
Dot product, matrixvector, and matrixmatrix multiplication. The operations on scalars we described above can readily be used to perform operations on vectors and matrices that are secretshared in elementwise fashion. Specifically, dot products combine multiple elementwise multiplications and additions. Matrixvector and matrixmatrix multiplication is implemented by repeated computation of dot products of two arithmetically secretshared vectors.
Matrix inverse. At each round, the algorithm computes matrix inverses of the matrices (in line 10), which is computationally costly. Because we only perform rankone updates of each matrix, we can maintain a representation of instead, rendering the matrix inversion in line 10 superfluous, and use the ShermanMorrison formula (Bartlett, 1951) to perform the update in line 21:
(2) 
This expression comprises only multiplications, additions, and a reciprocal (see below).
Reciprocal. We compute the reciprocal using a NewtonRhapson approximation with iterates . The NewtonRhapson approximation converges rapidly when is initialized well. The reciprocal is only used in the ShermanMorrison formula, so we choose the initialization with this in mind. Because in Equation 2 is the inverse of a positivedefinite matrix, it is itself positive definite; the denominator in Equation 2, therefore, lies in the range (in our experiments, we empirically find ). Empirically, we found that an initial value of works well in this range.
Exponential. We note that . We approximate this quantity by setting and evaluating the following expression for :
Herein, we can efficiently compute in rounds by repeatedly squaring the intermediate result.
Argmax. We compute the index of the maximum value of a vector where as a onehot vector of the same dimension. The algorithm for evaluating the has three main stages:

[leftmargin=*]

Use Algorithm 1 to construct a vector of the same length as that contains ones at the indices of all maximum values of and zeros elsewhere.

Break ties in by multiplying it elementwise with a random permutation of . This permutation, , is generated and securely shared offline.

Use Algorithm 1 to construct a onehot vector that indicates the maximum value of .
The permutation in step 2 randomly breaks ties for the index of the maximum value. We opt for random tiebreaking because breaking ties deterministically may leak information. For example, if ties were broken by always selecting the last maximum value, the adversary would learn that did not have multiple maximum values if it observed a that has as its first element.
In Algorithm 1, the evaluation of all terms is performed on a binary secret share of and . A binary secret share is a is a special type of arithmetic secret sharing for binary data in which the ring size (Goldreich et al., 1987). To convert an arithmetic share into a binary share , each party first secretly shares its arithmetic share with the other parties and then performs addition of the resulting shares. To construct the binary share of its arithmetic share , party : (1) draws random bit strings and shares those with the other parties and (2) computes its own binary share . The parties now each obtained a binary share of without having to decrypt . This process is repeated for each party to create binary shares of all arithmetic shares . Subsequently, the parties compute . The summation is implemented by Ripplecarry adder in rounds (Catrina and De Hoogh, 2010).
Subsequently, the operation is performed by computing , constructing the binary secret sharing per the procedure outlined above, obtaining the most significant bits, , and converting those bits back to an arithmetic share. To convert from a binary share to an arithmetic share , the parties compute , where contains the th bits of the binary share and is the total number of bits in the shared secret. To create the arithmetic share of a bit, , each party draws numbers uniformly at random from and shares the difference between their bits and the random numbers with the other parties. The parties sum all resulting shares to obtain .
Overall, the evaluation of requires seven communication rounds. We parallelize the reduction over , and perform the reduction over using a binary reduction tree in communication rounds.
4 Privacy Guarantee
The privacy guarantees for our algorithm rely on: (1) wellknown guarantees on the security of arithmetic and binary secret sharing mechanisms and (2) the differentially private opening of actions by party . For security guarantees of secret sharing, we refer the reader to Damgård et al. (2011). We focus on the differentially private opening of actions. Our primary observation is a natural link between epsilongreedy exploration and differential privacy in Theorem 1.
A mechanism is differentially private if for all datasets and that differ by a single example, and for all output sets , the following condition holds (Dwork, 2011):
Theorem 1.
If only the selected action at round is revealed, then greedy exploration is differentially private.
Proof.
The probability of selecting the action that corresponds to the maximum score is given by
. We use this probability to bound the privacy loss, :Using the fact that the exploration parameter , we also observe that:
To complete the proof, we observe that:
The above result is a generalization of the randomized response protocol (Warner, 1965) to arms and arbitrary . Because the privacy loss grows logarithmically with the number of actions, we obtain differentially private action selection for reasonable settings of the exploration parameter .
5 Experiments
We perform experiments on the MNIST dataset to evaluate the efficiency and rewardprivacy tradeoff of our algorithm. We reduce the data dimensionality by projecting each digit image to the first 20 principal components of the dataset, and normalize each resulting vector to have unit length. We perform a single sweep through the images in the MNIST training set. At each iteration, the parties receive a new secretshared image and need to select one of arms (digit classes). The reward is if the selected arm corresponds to the correct digit class, and otherwise. We implement Algorithm LABEL:alg:bandits on a ring with . We rely on the property of 64bit integer operations, where a result that is too large to fit in 64 bits is automatically mapped to with . We use bits of precision to encode floatingpoint values into (see supplemental material) and NewtonRhapson iterations for computing the reciprocal. Most of our experiments are performed using parties. Code reproducing the results of our experiments is available from http://www.anonymized.com.
Reward and privacy. Figure 1 shows the average reward that our private contextual bandits obtain during its sweep over the data set for four different values of . We compared the results obtained by our algorithm to that of a “plaintext” implementation of epsilongreedy contextual bandits, and confirmed that the observed rewards are the same for a range of values.
Figure 2 shows the average reward (averaged over runs) observed as a function of the differential privacy parameter, (higher values represent less privacy). The results were obtained by varying and are shown for experiments with three different dataset sizes, . The results show that at certain levels of differential privacy, the reward obtained by the private algorithm is higher than that of its nonprivate counterpart (). Indeed, some amount of exploration benefits the learner whilst also providing differential privacy. For higher levels of privacy, however, the reward obtained starts to decrease rapidly because too much exploration is needed to obtain the required level of privacy.
Operation  Rounds  Slowdown 
Addition  0  
Multiplication  2  
Reciprocal  30  
Argmax 
Efficiency and scale. Table 1 reports the runtime of the key operations in Algorithm LABEL:alg:bandits
, compared to an implementation of the same operations in PyTorch. In our experiments, the private contextualbandit implementation with
parties and actions is nearly 500 slower than a regular implementation. In realworld settings, the slowdown would likely be even higher because of network latency: our experiments were performed on a single machine where each party is implemented as a separate process. There are two fundamental sources of inefficiency in Algorithm LABEL:alg:bandits:
[leftmargin=*]

The weight update is in a regular contextualbandit implementation (only the weights for the selected arm are updated) but in Algorithm LABEL:alg:bandits: the private implementation cannot reveal the selected arm to other parties and, therefore, has to update all the weights. Fortunately, these weight updates are trivially parallelizable over arms.

Some operations (e.g., reciprocal and argmax) require additional computation and communication.
We also perform experiments in which we vary the number of arms, . To increase the number of arms, we construct a means clustering of the dataset and set
. We define the rewards to be Bernoullidistributed with
. The probabilities are Gaussiankernel values based on the distances from a data point to the inferred clusters: , where is the th cluster and is used to control the difficulty of the problem. We set in our experiments. Figure 3 demonstrates how the contextualbandit algorithm scales with the number of arms. For small numbers of arms (), the implementation overhead dominates the computation time. For larger numbers of arms (), we observe quadratic scaling in the number of arms. Figure 4 shows how the algorithm scales as a function of the number of parties, . The results illustrate that the runtime of our algorithm is : all parties communicate with each other in every communication round, which leads to the quadratic scaling observed.Figure 5 presents result that demonstrate how the reward changes as a function of the privacy loss when the number of arms, , is varied. The results show that the privacy loss increases logarithmically in the number of arms, but that the amount of exploration needed also increases. As a result, the optimal privacy loss in terms of reward only tends to increase slightly as the number of arms in the bandit increases. Indeed, this increase may be prohibitively large for webscale recommendation applications in which the bandit has to select one arm out of millions of arms.
Membership inference attacks. To empirically measure the privacy of our contextualbandit algorithm, we also performed experiments in which an adversary tries to infer whether or not a sample was part of the training dataset by applying the membership inference attack of Yeom et al. (2018)
on model checkpoints saved at various points during training. The membership inference attack computes an empirical estimate of the joint actionreward distribution on the data that was used to train the model and on a heldout test set, respectively (we use the MNIST test set as heldout set). We use the resulting
and to infer trainingdata membership for an example . Specifically, we: (1) evaluate the model on , (2) observe the selected arm , (3) receive corresponding reward , and (4) predict training data membership if .Following Yeom et al. (2018), we measure the advantage of the adversary: the difference in the true positive rate and the false positive rate in predicting training set membership. The adversary advantage during training is shown in Figure 6 for models trained with different values of . The results show that in the early stages of learning, the adversary has a slight advantage of , this advantage rapidly decreases below after the learner has observed more training examples^{2}^{2}2For higher values of
, there is more variance in the model parameters during training, which is reflected in higher variance in the advantage values.
. The advantage slightly increases in the later stages of training: interestingly, this happens because the model slightly underfits on the MNIST dataset. Overall, the empirical results suggest that our contextual bandit learner is, indeed, maintaining privacy well in practice.6 Related Work
This study fits into a larger body of work on privacypreserving machine learning. Prior work has used similar techniques from secure multiparty computation (and homomorphic encryption) for secure evaluation and/or training of deep networks
(Dowlin et al., 2016; Hynes et al., 2018; Juvekar et al., 2018; Mohassel and Zhang, 2017; Riazi et al., 2017; Shokri and Shmatikov, 2015; Wagh et al., 2018) and has developed secure dataaggregation techniques (Bonawitz et al., 2017) for use in federatedlearning scenarios (Bonawitz et al., 2019). To the best of our knowledge, our study is the first to use this family of techniques in online learning, using the randomness introduced by exploration mechanisms to obtain a differentialprivacy guarantee on the output produced by the learner.Most closely related to our work are studies on differential private online learning (Dwork et al., 2010; Jain et al., 2012; Thakurta and Smith, 2013). In particular, Mishra and Thakurta (2015) develops UCB and Thompson samplers for (noncontextual) bandits with differentialprivacy guarantees based on treebased aggregation (Chan et al., 2010; Dwork et al., 2009). Followup work improved differentially private UCB to have better regret bounds (Tossou and Dimitrakakis, 2016). Recent work (Shariff and Sheffet, 2018) also developed a joint differentially private version of LinUCB (Li et al., 2010). In contrast to those prior studies, we study a more challenging setting in which the parties that implement the learner may not leak information about their observed contexts, actions, and rewards to each other. Having said that, our algorithm may be improved using the differentially private mechanisms of Mishra and Thakurta (2015); Tossou and Dimitrakakis (2016). In this study, we opted for the simpler epsilongreedy mechanism because it can be implemented efficiently on arithmetically shared data. We leave the implementation of differentially private UCB and Thompson samplers in our secure multiparty computation framework to future work.
7 Discussion and Future Work
We presented a privacypreserving contextual bandit algorithm that works correctly in practice and that comes with theoretical guarantees on (differential) privacy. Although our experimental evaluation of this algorithm demonstrates its effectiveness, several avenues for improvement of our algorithm remain:

[leftmargin=*]

Increase numerical stability. Repeated use of the ShermanMorrison formula is known to produce cancellation errors that may lead to numerical instabilities. A numerically stable version of our algorithm would regularly compute the actual matrix inverse to eliminate such errors, and add a diagonalregularizer to prevent numerical issues due to illconditioning.

Robustness to disappearing parties. In many practical settings, parties may temporarily disappear because of system failures (Bonawitz et al., 2019). To allow the learning algorithm to continue to operate in such scenarios, different types of secret sharing (e.g., Shamir sharing (Shamir, 1979)) may be needed. The contextual bandit itself could learn to be robust to failing parties by employing a kind of “party dropout” at training time (Srivastava et al., 2014).

Security under stricter security models. The current algorithm assumes parties are honestbutcurious, which means that parties do not deviate from the protocol in Algorithm LABEL:alg:bandits. It is important to note that out privacy guarantees do not hold in stricter security models in which one or more parties operate adversarially or in settings in which the parties collude. Our current algorithm can be extended to provide guarantees under stricter security models: for instance, extending the algorithm to use message authentication codes (Goldreich, 2009) would allow the parties to detect attacks in which a minority of the parties behaves adversarially. Unfortunately, such extensions generally increase the computational and communication requirements of the learner.

Robustness to timing attacks. In practical scenarios, there is a delay between making the action and receiving the reward. This may affect the learner in that the parameterupdate stage operates on different parameters than were used during the prediction stage. More importantly, it makes the learner susceptible to timing attacks (Kocher, 1996): if the distribution of reward delays depends on the action being selected, parties may be able to infer the selected action from the observed time delay, counter to our guarantees. A realworld implementation of our algorithm should, therefore, introduce random time delays in the operations performed by parties and to prevent information leakage.

Stronger membershipinference attacks. The membershipinference attacks we considered in this study (Yeom et al., 2018) are not designed to use the full actionreward sequence as side information in the attack. It may thus be possible to strengthen these membershipinference attacks by using the full actionreward sequence, which may be observed by an external observer of the algorithm. The development of such stronger attacks may help to obtain better empirical insights into the level of privacy provided by our privacypreserving contextual bandits.
Acknowledgments
The authors thank Mark Tygert and Ilya Mironov for helpful discussions and comments on early drafts of this paper.
References
 Bartlett (1951) M. S. Bartlett. An inverse matrix adjustment arising in discriminant analysis. The Annals of Mathematical Statistics, 22(1):107–111, 1951.
 Beaver (1991) D. Beaver. Efficient multiparty protocols using circuit randomization. In Annual International Cryptology Conference, pages 420–432. Springer, 1991.

BenOr et al. (1988)
M. BenOr, S. Goldwasser, and A. Wigderson.
Completeness theorems for noncryptographic faulttolerant
distributed computation.
In
Proceedings of the twentieth annual ACM symposium on Theory of computing
, pages 1–10. ACM, 1988.  Bonawitz et al. (2017) K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth. Practical secure aggregation for privacy preserving machine learning. Cryptology ePrint Archive, Report 2017/281, 2017. https://eprint.iacr.org/2017/281.
 Bonawitz et al. (2019) K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Konecny, S. Mazzocchi, H. B. McMahan, T. V. Overveldt, D. Petrou, D. Ramage, and J. Roselander. Towards federated learning at scale: System design. In arXiv 1902.01046, 2019.
 Brakerski et al. (2012) Z. Brakerski, C. Gentry, and V. Vaikuntanathan. (leveled) fully homomorphic encryption without bootstrapping. In ITCS, 2012.
 Catrina and De Hoogh (2010) O. Catrina and S. De Hoogh. Improved primitives for secure multiparty integer computation. In International Conference on Security and Cryptography for Networks, pages 182–199. Springer, 2010.
 Chan et al. (2010) T. H. Chan, E. Shi, and D. Song. Private and continual release of statistics. In ICALP, 2010.
 Damgård et al. (2005) I. Damgård, M. Fitzi, E. Kiltz, J. B. Nielsen, and T. Toft. Unconditionally secure constantrounds multiparty computation for equality, comparison, bits and exponentiation. In TCC, 2005.
 Damgård et al. (2011) I. Damgård, V. Pastro, N. Smart, and S. Zakarias. Multiparty computation from somewhat homomorphic encryption. Cryptology ePrint Archive, Report 2011/535, 2011. https://eprint.iacr.org/2011/535.
 Demmler et al. (2015) D. Demmler, T. Schneider, and M. Zohner. ABY – a framework for efficient mixedprotocol secure twoparty computation. In NDSS, 2015.

Dowlin et al. (2016)
N. Dowlin, R. GiladBachrach, K. Laine, K. Lauter, M. Naehrig, and J. Wernsing.
Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy.
Technical Report MSRTR20163, February 2016.  Dwork (2011) C. Dwork. Differential privacy. Encyclopedia of Cryptography and Security, pages 338–340, 2011.
 Dwork et al. (2006) C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
 Dwork et al. (2009) C. Dwork, M. Naor, O. Reingold, G. Rothblum, and S. Vadhan. On the complexity of differentially private data release: Efficient algorithms and hardness results. In STOC, pages 381–390, 2009.
 Dwork et al. (2010) C. Dwork, M. Naor, T. Pitassi, and G. N. Rothblum. Differential privacy under continual observation. In STOC, 2010.
 Goldreich (2009) O. Goldreich. Foundations of Cryptography. Cambridge University Press, 2009.
 Goldreich et al. (1987) O. Goldreich, S. Micali, and A. Wigderson. How to play any mental game or a completeness theorem for protocols with honest majority. In STOC, pages 218–229, 1987.
 Hynes et al. (2018) N. Hynes, R. Cheng, and D. Song. Efficient deep learning on multisource private data. In arXiv 1807.06689, 2018.
 Jain et al. (2012) P. Jain, P. Kothari, and A. Thakurta. Differentially private online learning. In COLT, volume 23, 2012.
 Juvekar et al. (2018) C. Juvekar, V. Vaikuntanathan, and A. Chandrakasan. Gazelle: A low latency framework for secure neural network inference. In arXiv 1801.05507, 2018.
 Kocher (1996) P. C. Kocher. Timing attacks on implementations of DiffieHellman, RSA, DSS, and other systems. In CRYPTO, pages 104–113, 1996.

Langford and Zhang (2008)
J. Langford and T. Zhang.
The epochgreedy algorithm for contextual multiarmed bandits.
In Advances in Neural Information Processing Systems, volume 20, 2008.  Li et al. (2010) L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670. ACM, 2010.
 Mishra and Thakurta (2015) N. Mishra and A. Thakurta. (nearly) optimal differentially private stochastic multiarm bandits. In UAI, 2015.
 Mohassel and Zhang (2017) P. Mohassel and Y. Zhang. SecureML: A system for scalable privacypreserving machine learning. In 2017 IEEE Symposium on Security and Privacy (SP), pages 19–38. IEEE, 2017.
 Riazi et al. (2017) M. S. Riazi, C. Weinert, O. Tkachenko, E. M. Songhori, T. Schneider, and F. Koushanfar. Chameleon: A hybrid secure computation framework for machine learning applications. In Cryptology ePrint Archive, volume 2017/1164, 2017.
 Shamir (1979) A. Shamir. How to share a secret. Communications of the ACM, 22(11):612–613, 1979.
 Shariff and Sheffet (2018) R. Shariff and O. Sheffet. Differentially private contextual linear bandits. In Advances in Neural Information Processing Systems, 2018.

Shokri and Shmatikov (2015)
R. Shokri and V. Shmatikov.
Privacypreserving deep learning.
In Proceedings of the ACM Conference on Computer and Communications Security, 2015.  Srivastava et al. (2014) N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
 Thakurta and Smith (2013) A. G. Thakurta and A. D. Smith. (nearly) Optimal algorithms for private online learning in fullinformation and bandit settings. In Advances in Neural Information Processing Systems, volume 26, pages 2733–2741, 2013.
 Tossou and Dimitrakakis (2016) A. Tossou and C. Dimitrakakis. Algorithms for differentially private multiarmed bandits. In AAAI, 2016.
 Wagh et al. (2018) S. Wagh, D. Gupta, and N. Chandran. SecureNN: Efficient and private neural network training. In Cryptology ePrint Archive, volume 2018/442, 2018.
 Warner (1965) S. L. Warner. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309):63–69, 1965.
 Yeom et al. (2018) S. Yeom, I. Giacomelli, M. Fredrikson, and S. Jha. Privacy risk in machine learning: Analyzing the connection to overfitting. In CSF, 2018.
References
 Bartlett (1951) M. S. Bartlett. An inverse matrix adjustment arising in discriminant analysis. The Annals of Mathematical Statistics, 22(1):107–111, 1951.
 Beaver (1991) D. Beaver. Efficient multiparty protocols using circuit randomization. In Annual International Cryptology Conference, pages 420–432. Springer, 1991.

BenOr et al. (1988)
M. BenOr, S. Goldwasser, and A. Wigderson.
Completeness theorems for noncryptographic faulttolerant
distributed computation.
In
Proceedings of the twentieth annual ACM symposium on Theory of computing
, pages 1–10. ACM, 1988.  Bonawitz et al. (2017) K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth. Practical secure aggregation for privacy preserving machine learning. Cryptology ePrint Archive, Report 2017/281, 2017. https://eprint.iacr.org/2017/281.
 Bonawitz et al. (2019) K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Konecny, S. Mazzocchi, H. B. McMahan, T. V. Overveldt, D. Petrou, D. Ramage, and J. Roselander. Towards federated learning at scale: System design. In arXiv 1902.01046, 2019.
 Brakerski et al. (2012) Z. Brakerski, C. Gentry, and V. Vaikuntanathan. (leveled) fully homomorphic encryption without bootstrapping. In ITCS, 2012.
 Catrina and De Hoogh (2010) O. Catrina and S. De Hoogh. Improved primitives for secure multiparty integer computation. In International Conference on Security and Cryptography for Networks, pages 182–199. Springer, 2010.
 Chan et al. (2010) T. H. Chan, E. Shi, and D. Song. Private and continual release of statistics. In ICALP, 2010.
 Damgård et al. (2005) I. Damgård, M. Fitzi, E. Kiltz, J. B. Nielsen, and T. Toft. Unconditionally secure constantrounds multiparty computation for equality, comparison, bits and exponentiation. In TCC, 2005.
 Damgård et al. (2011) I. Damgård, V. Pastro, N. Smart, and S. Zakarias. Multiparty computation from somewhat homomorphic encryption. Cryptology ePrint Archive, Report 2011/535, 2011. https://eprint.iacr.org/2011/535.
 Demmler et al. (2015) D. Demmler, T. Schneider, and M. Zohner. ABY – a framework for efficient mixedprotocol secure twoparty computation. In NDSS, 2015.

Dowlin et al. (2016)
N. Dowlin, R. GiladBachrach, K. Laine, K. Lauter, M. Naehrig, and J. Wernsing.
Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy.
Technical Report MSRTR20163, February 2016.  Dwork (2011) C. Dwork. Differential privacy. Encyclopedia of Cryptography and Security, pages 338–340, 2011.
 Dwork et al. (2006) C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
 Dwork et al. (2009) C. Dwork, M. Naor, O. Reingold, G. Rothblum, and S. Vadhan. On the complexity of differentially private data release: Efficient algorithms and hardness results. In STOC, pages 381–390, 2009.
 Dwork et al. (2010) C. Dwork, M. Naor, T. Pitassi, and G. N. Rothblum. Differential privacy under continual observation. In STOC, 2010.
 Goldreich (2009) O. Goldreich. Foundations of Cryptography. Cambridge University Press, 2009.
 Goldreich et al. (1987) O. Goldreich, S. Micali, and A. Wigderson. How to play any mental game or a completeness theorem for protocols with honest majority. In STOC, pages 218–229, 1987.
 Hynes et al. (2018) N. Hynes, R. Cheng, and D. Song. Efficient deep learning on multisource private data. In arXiv 1807.06689, 2018.
 Jain et al. (2012) P. Jain, P. Kothari, and A. Thakurta. Differentially private online learning. In COLT, volume 23, 2012.
 Juvekar et al. (2018) C. Juvekar, V. Vaikuntanathan, and A. Chandrakasan. Gazelle: A low latency framework for secure neural network inference. In arXiv 1801.05507, 2018.
 Kocher (1996) P. C. Kocher. Timing attacks on implementations of DiffieHellman, RSA, DSS, and other systems. In CRYPTO, pages 104–113, 1996.

Langford and Zhang (2008)
J. Langford and T. Zhang.
The epochgreedy algorithm for contextual multiarmed bandits.
In Advances in Neural Information Processing Systems, volume 20, 2008.  Li et al. (2010) L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670. ACM, 2010.
 Mishra and Thakurta (2015) N. Mishra and A. Thakurta. (nearly) optimal differentially private stochastic multiarm bandits. In UAI, 2015.
 Mohassel and Zhang (2017) P. Mohassel and Y. Zhang. SecureML: A system for scalable privacypreserving machine learning. In 2017 IEEE Symposium on Security and Privacy (SP), pages 19–38. IEEE, 2017.
 Riazi et al. (2017) M. S. Riazi, C. Weinert, O. Tkachenko, E. M. Songhori, T. Schneider, and F. Koushanfar. Chameleon: A hybrid secure computation framework for machine learning applications. In Cryptology ePrint Archive, volume 2017/1164, 2017.
 Shamir (1979) A. Shamir. How to share a secret. Communications of the ACM, 22(11):612–613, 1979.
 Shariff and Sheffet (2018) R. Shariff and O. Sheffet. Differentially private contextual linear bandits. In Advances in Neural Information Processing Systems, 2018.

Shokri and Shmatikov (2015)
R. Shokri and V. Shmatikov.
Privacypreserving deep learning.
In Proceedings of the ACM Conference on Computer and Communications Security, 2015.  Srivastava et al. (2014) N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
 Thakurta and Smith (2013) A. G. Thakurta and A. D. Smith. (nearly) Optimal algorithms for private online learning in fullinformation and bandit settings. In Advances in Neural Information Processing Systems, volume 26, pages 2733–2741, 2013.
 Tossou and Dimitrakakis (2016) A. Tossou and C. Dimitrakakis. Algorithms for differentially private multiarmed bandits. In AAAI, 2016.
 Wagh et al. (2018) S. Wagh, D. Gupta, and N. Chandran. SecureNN: Efficient and private neural network training. In Cryptology ePrint Archive, volume 2018/442, 2018.
 Warner (1965) S. L. Warner. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309):63–69, 1965.
 Yeom et al. (2018) S. Yeom, I. Giacomelli, M. Fredrikson, and S. Jha. Privacy risk in machine learning: Analyzing the connection to overfitting. In CSF, 2018.
Appendix A Secret Sharing
Our privacypreserving contextual bandits use two different types of secret sharing: (1) arithmetic secret sharing Damgård et al. (2011); and (2) binary secret sharing Goldreich et al. (1987). Below, we describe the secret sharing methods for single values but they can trivially be extended to realvalued vectors .
Arithmetic secret sharing.
Arithmetic secret sharing is a type of secret sharing in which the sum of the shares reconstruct the original data . We refer to the shared representation of as . The shared representation across parties, , is given by , where indicates the share of that party has. The representation has the property that . To make sure that none of the parties can learn any information about from their share , shares need to be sampled uniformly from a ring of size , , and all computations on shares must be performed modulus . If is realvalued, it is encoded to lie in using the mechanism described in Appendix B before it is encrypted.
To encrypt the unencrypted data , party that possesses draws numbers uniformly at random from and distributes them among the other parties. Subsequently, party computes its own share as . Thus all the parties (including party
) obtain a random number that is uniformly distributed over the ring, from which they cannot infer any information about
. To decrypt , the parties communicate their shares and compute .Binary secret sharing.
Binary secret sharing is a special type of arithmetic secret sharing for binary data in which the ring size (Goldreich et al., 1987). Because addition modulo two is equivalent to taking an exclusive OR (XOR) of the bits, this type of sharing is often referred to as XOR secret sharing. To distinguish binary shares from arithmetic shares, we denote a binary share of variable across parties by . Just as with arithmetic sharing, binary secret shares allow for “linear” operations on bits without decryption. For example, binary sharing allows for the evaluation of any circuit expressed as XOR and AND gates. While it is much more efficient to do addition and multiplication of integers with arithmetic shares, logical expressions such as are more efficient to compute with binary shares. In equations, we denote AND by and XOR by .
To encrypt the unencrypted bit , party that possesses draws random bits and distributes those among the other parties. These form the shares for parties. Subsequently, party computes its own share as . Thus all the parties (including party ) obtain a random bit from which they cannot infer any information about .
a.1 Converting Between SecretSharing Types
Contextual bandit algorithms involve both functions that are easier to compute on arithmetic secret shares (e.g., matrix multiplication) and functions that are easier to implement via on binary secret shares (e.g., argmax) using binary circuits. Therefore, we use both types of secret sharing and convert between the two types using the techniques proposed in Demmler et al. (2015).
From to : To convert from an arithmetic share to a binary share , each party first secretly shares its arithmetic share with the other parties and then performs addition of the resulting shares. To construct the binary share of its arithmetic share , party : (1) draws random bit strings and shares those with the other parties and (2) computes its own binary share . The parties now each obtained a binary share of without having to decrypt . This process is repeated for each party to create binary secret shares of all arithmetic shares . Subsequently, the parties compute . The summation is implemented by Ripplecarry adder that can be evaluated in rounds Catrina and De Hoogh (2010); Damgård et al. (2005).
From to : To convert from a binary share to an arithmetic share , the parties compute , where denotes the th bit of the binary share and is the total number of bits in the shared secret. To create the arithmetic share of a bit, , each party draws a number uniformly at random from and shares the difference between their bit and the random number with the other parties. The parties sum all resulting shares to obtain .
a.2 Logical Operations and the Sign Function
We rely on binary secret sharing to implement logical operations and the sign function.
XOR and AND. XOR and AND are addition and multiplication modulo 2 where the numbers belong to the set — they are the binary operations in . As a result, the techniques we use for addition and multiplication of arithmetically shared values (see paper) can be used to implement XOR and AND as well. Evaluating function amounts to one party computing , and evaluating amounts to each party computing . Similarly, is evaluated by having each party compute . The AND operation between two private values, , is implemented akin to the private multiplication protocol using Beaver triples.
Sign function. We express the sign function on an arithmetically shared value as . Using this expression, the sign function can be implemented by first converting the arithmetic share, , to a binary share, , using the conversion procedure described above. Subsequently, we obtain the most significant bit, , and convert it back to an arithmetic share to obtain .
Appendix B FixedPrecision Encoding
Contextual bandit algorithms generally use realvalued parameters and data. Therefore, we need to encode the realvalued numbers as integers before we can arithmetically share them. We do so by multiplying with a large scaling factor and rounding to the nearest integer: , where for some precision parameter, . We decode an encoded value, , by computing . Encoding realvalued numbers this way incurs a precision loss that is inversely proportional to .
Since we scale by a factor to encode floatingpoint numbers, we must scale down by a factor after every multiplication. We do this using the public division protocol described in Appendix C.
Appendix C Public Division
A simple method to divide an arithmetically shared value, , by a public value, , would simply divide the share of each party by ). However, such a method can produce incorrect results when the sum of shares “wraps around” the ring size, . Defining to be the number of wraps such that , indeed, we observe that:
Therefore, the simple division method fails when , which happens with probability in the twoparty case. Many prior MPC implementations specialize to the case and rely on this probability being negligible Mohassel and Zhang (2017); Riazi et al. (2017); Wagh et al. (2018). However, when the probability of failure grows rapidly and we must account for the number of wraps, .
We do so by privately computing a secret share of the number of wraps in , , using Algorithm 2. We use to compute the correct value of the division by :
In practice, it can be difficult to compute in Algorithm 2 (line 8). We note that has a fixed probability of being nonzero, irrespective of whether the number of parties is two or larger, i.e., regardless of the number of parties . In practice, we therefore skip the computation of and simply set . This implies that incorrect results can be produced by our algorithm with small probability. For example, when we multiply two realvalues, and , the result will be encoded as which has probability of producing an error. This probability can be reduced by increasing or reducing the precision parameter, .
Appendix D Numerical Precision
In addition to the experiments presented in the paper, we also performed experiments to measure the impact of varying the precision in the fixedpoint encoding and the numerical approximations. Figure 7 shows the average reward as a function of the bits of precision used in encoding of floatingpoint values. The optimal precision is bits with a sharp drop in reward obtained below and above bits. The drop below bits is due precision loss causing numerical instability. Algorithm 1 in the main paper is susceptible to three forms of numerical instability: (1) illconditioning due to relying on the normal equations to solve the leastsquares problem, (2) degeneracies in the matrix which can become singular or nonpositivedefinite, and (3) cancellation errors due to use of the ShermanMorrison update. The drop in observed reward when using more than bits of precision is due to wraparound errors that arise because we do not correctly compute (see Section B). This causes public divisions to fail catastrophically with higher probability, impeding the accuracy of the learning algorithm.
Figure 8 shows how the average reward changes as a function of the number of NewtonRhapson iterations used to approximate the private reciprocal. The results reveal that three iterations suffice in our experiments. We note that the domain of the private reciprocal in Algorithm 1 in the main paper is . In all our experiments, we observed empirically that . For use cases that require a larger range of values, , more iterations and a different initial value may be needed to ensure rapid convergence.