In this paper we study the problem of stochastic multi-armed bandits (sMAB) with contaminated rewards, or contaminated stochastic bandits (CSB). This scenario assumes that rewards associated with an action are sampled i.i.d. from a fixed distribution but that the learner only observes a reward after an adversary has an opportunity to contaminate it. The observed reward can be unrelated to the reward distribution and can be maliciously chosen to fool the learner. An outline for this setup is presented in Section 2.
We are primarily motivated by the use of bandit algorithms in education, where the rewards often come directly from human opinion. Whether responses come from undergraduate students, a community sample, or paid participants on platforms like MTurk, there is always reason to believe some responses are not thoughtful or relevant to the question or could be assisted by bots (Necka et al., 2016).
An example in education is a recent paper testing bandit Thompson sampling to identify high quality student generated solution explanations to math problems using MTurk participants(Williams et al., 2016). Using a rating between 1-10 from 150 participants, the results showed that Thompson sampling identified participant generated explanations that when viewed by other participants significantly improved their chance of solving future problems compared to no explanation or “bad” explanations identified by the algorithm. While the proportion of contaminated responses will always depend on the population, recent work suggests when already using checks for fraudulent participants, between of MTurk participants give low-quality samples (Ahler et al., 2018; Ryan, 2018; Necka et al., 2016). It’s possible given these statistics that accounting for these low quality responses or outright contamination could improve performance. This is especially relevant in educational setting where the number of iterations an algorithm can run is often low.
The recent work in CSB has varied assumptions on the power of the adversary, the contamination, and the reward distributions. Many papers require the rewards and contamination to be bounded (Kapoor et al., 2018; Gupta et al., 2018; Lykouris et al., 2018). Others don’t require boundedness, but do assume that the adversary contaminates uniformly across rewards (Altschuler et al., 2019). All works make some assumption on the number of rewards for an action an adversary can contaminate. We discuss previous work more thoroughly in section 3.
Our work expands on these papers by allowing for a full knowledge adaptive adversary that can give unbounded contamination in any manner. Our only constraint is that for some fixed , no more than proportion of rewards for an action are contaminated. We achieve this by proving concentration inequalities for two robust mean estimators in the -contamination context and implementing them in an -contamination robust UCB algorithm. We are able to show that when is small enough, the regret of our algorithm analyzed on the true reward distributions achieves loss. Through simulations, we show that with certain adversaries, our algorithm outperforms algorithms designed for stochastic (UCB1) and adversarial bandits (EXP3) as well as those that have “best of both worlds” guarantees (EXP3++ and TsallisInf) even when our constraint on is broken.
Though we are motivated by applications in education of bandit algorithms and use this context to determine appropriate parameters in the simulations, we point out opportunities for CSB modeling to arise in other contexts as well:
There is always a chance that human feedback is not mindful or effortful, and therefore is not representative of the underlying truth related to an action. This may appear in online surveys that are used for A/B testing, or as is the case above in the explanation generation example.
Internet users who wish to preserve privacy can intentionally click on ads to obfuscate their true interests either manually or through browser apps. Similarly, malware can click on ads from one company to falsely indicate high interest, which can cause higher rankings in searches or more frequent use of the ad than it would otherwise merit Pearce et al. (2014); Crussell et al. (2014).
2 Problem setting
Here we specify our notation and present the -contaminated stochastic bandit problem. We then argue for a specific notion of regret for CSB. We compare our setting to others current in the field in section 3.
We use to represent for , the indicator function to be 1 if true and 0 otherwise. Let be the number of times action has been chosen at time and
to be the vector of all observed rewards for actionat time . The suboptimality gap for action is and we define .
2.1 -Contaminated Stochastic Bandits
A basic parameter in our framework is , the fraction of rewards for an action that the adversary is allowed to contaminate. Before play, the environment picks a true reward from fixed distribution for all and . The adversary observes these rewards and then play begins. At time the learner chooses an action . The adversary sees then chooses an observed reward and then the learner observes only .
We present the contaminated stochastic bandits game in algorithm 1.
We allow the adversary to corrupt in any fashion as long as for every time there is no more than an -fraction of contaminated rewards for any action. That is, we constrain the adversary such that,
We allow the adversary to give unbounded contamination that can be chosen with full knowledge of the learner’s history as well as current and future rewards. This setting allows the adversary to act differently across actions and place no constraints on the contamination itself, but rather the rate of contamination.
2.2 Notion of regret
A traditional goal in bandit learning is to minimize the observed cumulative regret gained over the total number of plays . Because the adversary in this model can affect the observed cumulative regret, we argue to instead use a notion of regret that considers only the underlying true rewards. We call this uncontaminated regret and give the definition below for any time and policy in terms of the true rewards ,
This definition eq. 2.1 is first mentioned in Kapoor et al. (2018) along with another notion of regret that compares the sum of the observed (possibly contaminated) rewards to the sum of optimal, uncontaminated rewards,
We argue that eq. 2.2 gives little information about the performance of an algorithm. This notion of regret can be negative, and with no bounds on the contamination it can be arbitrarily small and potentially meaningless. We believe that any regret that compares a true component to an observed (possibly contaminated) component is not a useful measure of performance in CSB as it is unclear what regret an optimal strategy should produce.
3 Related Work
We start by briefly addressing why adversarial and “best of both world” algorithms are not optimized for CSB. We then cover relevant work in robust statistics, followed by current work in robust bandits and how our model differs and relates.
3.1 Adversarial bandits
Adversarial bandits make no assumptions on the observed rewards, and their regret is analysed with respect to the best fixed action where “best” is defined using the observed rewards. There are no theoretical guarantees with respect to the uncontaminated regret, so it is not immediately clear how they will perform in a CSB problem. We remark that adversarial analysis assumes uniformly bounded observed rewards whereas we allow observed rewards to be unbounded. The general adversarial framework does not take advantage of the structure present in CSB, namely that the adversary can only corrupt a small fraction of rewards, so it likely that performance improvements can be make.
3.2 Best of both worlds
A developing line of work is algorithms that enjoy “best of both worlds” guarantees. That is, they perform well in both stochastic and adversarial environments without knowing a priori which environment they will face. Early work in this area (Auer and Chiang, 2016; Bubeck and Slivkins, 2012) started by assuming a stochastic environment and implementing some method to detect a failure of the i.i.d. assumption, at which point the algorithm switches to an algorithm for the adversarial environment for the remainder of iterations. Further work implements algorithms that can handle an environment that is some mixture of stochastic and adversarial, as in EXP3++ and TsallisInf (Seldin and Slivkins, 2014; Zimmert and Seldin, 2018).
While these algorithms are aimed well for a stochastic environment with some adversarial rewards, they differ from contamination robust algorithms in that all observed rewards are thought to be informative. Therefore, their uncontaminated regret has not been analysed and there are no guarantees in the CSB setting.
3.3 Contamination robust statistics
The -contamination model we consider is closely related to the one introduced by Huber in 1964 (Huber, 1964)
. Their goal was to estimate the mean of a Gaussian mixture model wherefraction of the sample was not sampled from the main Gaussian component. There has been a recent increase of work using this model, especially in extensions to the high-dimensional case (Diakonikolas et al. (2019), Kothari and Steurer (2017), Lai et al. (2016), Liu et al. (2019)). These works often the assumption of a Gaussian mixture component, though there has been expanding work with non-Gaussian models as well.
3.4 Contamination robust bandits
Some of the first work in CSB started by assuming both rewards and contamination were bounded (Lykouris et al., 2018; Gupta et al., 2018). These works assume an adversary that can contaminate at any time step, but that is constrained in the cumulative contamination. That is, the cumulative absolute difference of the contaminated reward, , to the true reward, , is bounded, . Lykouris et. al. provides a layered UCB-type active arm elimination algorithm. Gupta expands on this work to provide a probabilistic active arm elimination algorithm with better regret guarantees.
Recent work in implementing a robust UCB replaces the empirical mean with the empirical median, and gives guarantees for the uncontaminated regret with Gaussian rewards (Kapoor et al., 2018)
. They consider an adaptive adversary but require the contamination to be bounded, though the bound need not be known. They cite work that can expand their robust UCB to distributions with bounded fourth moments by using the agnostic mean(Lai et al., 2016), though give no uncontaminated regret guarantees. In one dimension, the agnostic mean takes the mean of the smallest interval containing fraction of points. This estimator is also known as the -shorth mean. Our work expands on this model by allowing for unbounded contamination and analysing the uncontaminated regret for sub-Gaussian rewards.
CSB has also been analysed in the best arm identification problem (Altschuler et al., 2019)
. Using a Bernoulli adversary that contaminates any reward with probability, Altschuler et. al. consider three adversaries of increasing power, from the oblivious adversary, which does not know the player’s history nor the current action or reward, to a malicious adversary, which can contaminate knowing the player’s history and the current action and reward. They give analysis of the probability of best arm selection and sample complexity of an active arm elimination algorithm. While their performance measure is different than ours, we generalize their context to allow an adversary to contaminate in any fashion.
There is also work that explores adaptive adversarial contamination on -greedy and UCB algorithms (Jun et al., 2018). They give a thorough analysis with both theoretical guarantees and simulations of the affects an adversary can have on these two algorithms when the adversary does not know the optimal action but is otherwise fully adaptive. They show these standard algorithms are susceptible to contamination. Similar work looks at contamination in contextual bandits with a non-adaptive adversary (Ma et al., 2018).
4 Main results
We present concentration inequalities on two robust mean estimators in the -contamination context. We use these robust estimators in two variants of a -contamination robust UCB algorithm and prove uncontaminated regret guarantees for both.
4.1 Contamination robust mean estimators
The estimators we analyse have been in use for many decades as robust statistics. Our contribution is to analyze them within our -contamination model and provide simple finite-sample concentration inequalities for ease of use in UCB-type algorithms.
4.1.1 Trimmed mean
Our first estimator suggested for use in the contaminated model is the -trimmed mean (Liu et al., 2019).
Trim the smallest and largest -fraction of points from the sample and calculate the mean of the remaining points. This estimator uses fraction of sample points.
The intuition being if the contamination is large, then it will be removed from the sample. If it is small, it should have little affect on the mean estimate. Next we provide the concentration inequality for the -trimmed mean.
Theorem 1 (Trimmed mean concentration).
Let be the set of points that are drawn from a -sub-Gaussian distribution with mean . Let be a sample where an -fraction of these points are contaminated by an adversary. For , we have,
with probability at least .
Our proof techniques are adapted from Liu et al. (2019).
Let be the set of points that are drawn from a -sub-Gaussian distribution. Without loss of generality assume . Let be a sample where an -fraction of these points are contaminated by an adversary.
Let represent the points which are not contaminated and represent the contaminated points. Then our sample can be represented by the union . Let represent the points that remain after trimming fraction of the largest and smallest points, and be the set of points that were trimmed. Then we have,
Combining we get,
with probability at least . Letting and , and assuming , we have,
with probability at least . ∎
A more detailed proof can be found in the appendix.
4.1.2 -shorth mean
Lai’s (2016) agnostic mean, which we use the more common term -shorth mean for, can be considered a variation of the trimmed mean.
Take the mean of the shortest interval that removes the smallest and largest fraction of points such that , where are chosen to minimize the interval length of remaining points. Uses fraction of sample points.
-shorth mean is less computationally efficient than the trimmed mean, but may be a better mean estimator when the contaminated points are not large outliers and are skewed in one direction. Intuitively this is because the-shorth mean can trim off contamination that would require removing most of the sample with the trimmed mean. Next we provide the concentration inequality for the -shorth mean.
Theorem 2 (-shorth mean concentration).
Let be the set of points that are drawn from a -sub-Gaussian distribution with mean . Let be a sample where an -fraction of these points are contaminated by an adversary. For , , we have,
with probability at least .
Proof is contained in the appendix and follows a similar approach as shown for the trimmed mean.
Our methods ensured that the first term in each concentration bound is the same, giving them similar regret guarantees when implemented in a UCB algorithm. We emphasize that the -shorth mean uses fraction of a sample while the -trimmed mean uses fraction of a sample. We remark that if there is no contamination and then our inequalities reduce to the standard concentration inequality for the empirical mean of samples drawn from a sub-Gaussian distribution.
4.2 Contamination robust UCB
We present the contamination robust-UCB (crUCB) algorithm for -CSB with sub-Gaussian rewards.
We provide uncontaminated regret guarantees for crUCB below for both the -trimmed and the -shorth mean.
Theorem 3 (-trimmed mean crUCB uncontaminated regret).
Let and . Then with algorithm 2 with the -trimmed mean, sub-Gaussian reward distributions with , and contamination rate , we have the uncontaminated regret bound,
Theorem 4 (-shorth mean crUCB uncontaminated regret).
Let and . Then with algorithm 2 with the -shorth mean, sub-Gaussian reward distributions with , and contamination rate , we have the uncontaminated regret bound,
From creftype 3 and 4 we get that crUCB has the same order of regret in the CSB setting as UCB1 has in the standard sMAB setting. The constraint on the magnitude of is quite strong, but we show in section 5 that they can be broken and still obtain good empirical performance.
If then no algorithm can get sublinear regret since distinguishing between the top two actions is statistically impossible even with infinite samples. We give an example in the appendix.
We compare our crUCB algorithms using the trimmed mean (tUCB) and shorth mean (sUCB) against a standard stochastic algorithm (UCB1, Auer and Cesa-Bianchi (2002)), a standard adversarial algorithm (EXP3, Auer et al. (2002)), two “best of both worlds” algorithms (EXP3++, Seldin and Lugosi (2017), 0.5-TsallisInf, Zimmert and Seldin (2018)), and another contamination robust algorithm (RUCB-MAB, Kapoor et al. (2018)). Each trial has five actions (), is run for 100 iterations (), for and uses optimized parameters for each algorithm. For sUCB and tUCB, we set . The plots are average results over 100 trials.
Our choice of comes from our motivation to apply contaminated bandits in education, where the sample sizes are often much smaller than for example in advertising. We similarly chose number of arms and proportion contamination to be in a realistic range for the application we have in mind.
Rewards and gaps
We look at two true reward distributions with , the binomial(n=10) and the truncated Gaussian (min=0, max = 10, =3). We use gap
. All non-optimal actions have the same true distribution. We chose the binomial distribution to simulate the Likert scale often seen in rating questions, and the truncated Gaussian in lieu of a Gaussian distribution to easily implement algorithms that require bounded rewards for simple implementation.
We use two types of adversaries: the Bernoulli adversary which gives a contaminated reward at every time step with probability , and the cluster adversary, which gives contaminated rewards in a row when implemented at iteration .
We use random contamination which chooses a contaminated reward uniformly from ranges such that non-optimal actions receive higher rewards than the optimal action.
Typically the average regret is used as the performance measure, but because our total number of iterations is so small, we found this visual uninformative except in the most extreme cases. Instead we plot the average number of times the optimal action was chosen at each iteration. This measurement is supported by the uncontaminated regret analysis which proves high probability bounds on the number of non-optimal action plays.
We recommend to view the plots on a color screen.
In fig. 0(a) we point out that the “best of both worlds” algorithms EXP3++ and TsallisInf consistently perform worse in the purely stochastic environment than all others in this small setting. From Seldin et al. (2012) we get that in the purely stochastic environment, EXP3 performs similarly to UCB1 for small (). To our knowledge this is the first time the behavior of EXP3++ and TsallisInf has been considered for small . The rest of the plots in fig. 1 show how non-contamination robust algorithms degrade in performance with increasing proportions of contamination for a large . The impact of cluster contamination at iteration 50 is shown in fig. 0(d). In fig. 2 we see similar outcomes for the truncated Gaussian.
With small and large , we see in fig. 2(d) and fig. 3(d) that EXP3 has the best performance. As the environment becomes more adversarial, we expect tUCB and sUCB to degrade in performance because they are using less information. It is not clear why EXP3 would necessarily perform better since its goal is still to optimize on all observed rewards. This could be an artifact of our simulation choices that would not hold up to other types of adversaries or contamination.
When assuming few iterations, increasing and/or decreasing quickly insures few optimal action picks regardless of algorithm. Similarly, small and large can render bounded contamination impotent.
We include a sensitivity analysis of sUCB and tUCB to choices of in the appendix. This generally shows that it is better to choose an larger than than the converse.
We have included RUCB-MAB in our simulations because it is simple to implement and can perform well. We caution the reader that its guarantees only hold for Gaussian true rewards, and can perform extremely poorly when contamination is close to the true mean, a scenario that would otherwise have little to no impact on all other algorithms used. We include an example of this in the appendix.
We have presented two variants of an -contamination robust UCB algorithm to handle uninformative or malicious rewards in the stochastic bandit setting. As the main contribution, we proved concentration inequalities for the -trimmed and -shorth mean in the -contamination setting and analyzed the uncontaminated regret of crUCB with these robust mean estimators. Our algorithms are simple to implement, and provide uncontaminated regret guarantees near those in the sMAB setting provided the contamination is small enough. We have shown through simulation that these algorithms can outperform “best of both worlds” algorithms and those for stochastic or adversarial environments when using a small number of iterations and chosen to be reasonable when implementing bandits in education. Our simulations also show that the tight bounds on the magnitude of required to reach our regret guarantees are likely very conservative, as all our examples use larger than permitted by our theory.
A weak point of these algorithms is they only have improved performance compared to non-robust algorithms when the contamination is generally in the tail of the true reward distribution. They also require knowledge of before hand. The first point is less troublesome, as the closer contamination gets to the true mean, the less effect it will have. Choice of may come from domain knowledge, but could also require a separate study.
In this work we assumed a fully adaptive adversarial contamination, constrained only be the total fraction of contamination at any time step. By making more assumptions about the adversary, it is likely possible to improve uncontaminated regret bounds.
The adversaries used in the simulation are quite simple and do not take full advantage of the power we allow in our model. We designed them as a first test of our algorithms and associated theory. In the future, we would like to design simulated adversaries that are modeled on real world contamination. It will also be important to deploy contamination robust algorithms in the real world. This will require thought on how to select various tuning parameters ahead of the deployment.
There remain many open questions in this area. In particular, we think this work could be improved by
UCB-type algorithms are often outperformed in applications by the randomized Thompson sampling algorithm. Creating a randomized algorithm that accounts for the contamination model would increase the practicality of this line of work.
Contamination correlated with true rewards
One possibility is that the contaminated rewards contain information of the true rewards. An example of this is when contamination gives a missing reward. If the probability of missingness if different among actions, then it might be that missingness provides some additional information about the optimal action. For example, when dropout is correlated with the treatment condition.
L.N. acknowledges the support of NSF via grant DMS-1646108 and thanks Joseph Jay Williams for helpful discussions and for inspiring this work. A.T. would like to acknowledge the support of a Sloan Research Fellowship and NSF grant CAREER IIS-1452099.
- Ahler et al.  Douglas J Ahler, Carolyn E Roush, and Gaurav Sood. Micro-Task Market for Lemons. Working Paper. Epub ahead of print, page 39, 2018.
Altschuler et al. 
Jason Altschuler, Victor-Emmanuel Brunel, and Alan Malek.
Best Arm Identification for Contaminated Bandits.
Journal of Machine Learning Research, 20:1–39, 2019.
- Auer and Cesa-Bianchi  Peter Auer and Nicolo Cesa-Bianchi. Finite-time Analysis of the Multiarmed Bandit Problem. Machine learning,, page 22, 2002.
- Auer and Chiang  Peter Auer and Chao-Kai Chiang. An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. Conference on Learning Theory, pages 116–120, May 2016.
- Auer et al.  Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The Nonstochastic Multiarmed Bandit Problem. SIAM Journal on Computing, 32(1):48–77, January 2002. ISSN 0097-5397, 1095-7111. doi: 10.1137/S0097539701398375.
- Bai et al.  Yang Bai, Paul Hibbing, Constantine Mantis, and Gregory J. Welk. Comparative evaluation of heart rate-based monitors: Apple Watch vs Fitbit Charge HR. Journal of Sports Sciences, 36(15):1734–1741, August 2018. ISSN 0264-0414. doi: 10.1080/02640414.2017.1412235.
- Bubeck and Slivkins  Sebastien Bubeck and Aleksandrs Slivkins. The best of both worlds: Stochastic and adversarial bandits. Conference on Learning Theory, February 2012.
- Crussell et al.  Jonathan Crussell, Ryan Stevens, and Hao Chen. MAdFraud: Investigating ad fraud in android applications. In Proceedings of the 12th Annual International Conference on Mobile Systems, Applications, and Services - MobiSys ’14, pages 123–134, Bretton Woods, New Hampshire, USA, 2014. ACM Press. ISBN 978-1-4503-2793-0. doi: 10.1145/2594368.2594391.
- Diakonikolas et al.  Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust Estimators in High Dimensions without the Computational Intractability. SIAM Journal on Computing, 48(2):742–864, 2019.
- Feehan et al.  Lynne M Feehan, Jasmina Geldman, Eric C Sayre, Chance Park, Allison M Ezzat, Ju Young Yoo, Clayon B Hamilton, and Linda C Li. Accuracy of Fitbit Devices: Systematic Review and Narrative Syntheses of Quantitative Data. JMIR mHealth and uHealth, 6(8):e10527, August 2018. ISSN 2291-5222. doi: 10.2196/10527.
- Gupta et al.  Anupam Gupta, Tomer Koren, and Kunal Talwar. Better Algorithms for Stochastic Bandits with Adversarial Corruptions. Advances in Neural Information Processing Systems, pages 3640–3649, 2018.
- Huber  Peter Huber. Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics, 35(1):73–101, March 1964.
- Jun et al.  Kwang-Sung Jun, Lihong Li, Yuzhe Ma, and Xiaojin Zhu. Adversarial Attacks on Stochastic Bandits. arXiv:1810.12188 [cs, stat], October 2018.
- Kapoor et al.  Sayash Kapoor, Kumar Kshitij Patel, and Purushottam Kar. Corruption-tolerant bandit learning. Machine Learning, pages 1–29, August 2018. ISSN 0885-6125, 1573-0565. doi: 10.1007/s10994-018-5758-5.
- Kothari and Steurer  Pravesh K. Kothari and David Steurer. Outlier-robust moment-estimation via sum-of-squares. arXiv:1711.11581 [cs, stat], November 2017.
- Lai et al.  K. A. Lai, A. B. Rao, and S. Vempala. Agnostic Estimation of Mean and Covariance. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 665–674, October 2016. doi: 10.1109/FOCS.2016.76.
- Liu et al.  Liu Liu, Tianyang Li, and Constantine Caramanis. High Dimensional Robust Estimation of Sparse Models via Trimmed Hard Thresholding. arXiv:1901.08237 [cs, math, stat], January 2019.
Lykouris et al. 
Thodoris Lykouris, Vahab Mirrokni, and Renato Paes Leme.
Stochastic bandits robust to adversarial corruptions.
Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC, pages 114–122, March 2018.
Ma et al. 
Yuzhe Ma, Kwang-Sung Jun, Lihong Li, and Xiaojin Zhu.
Data Poisoning Attacks in Contextual Bandits.
International Conference on Decision and Game Theory for Security, page 19, 2018.
- Necka et al.  Elizabeth A. Necka, Stephanie Cacioppo, Greg J. Norman, and John T. Cacioppo. Measuring the Prevalence of Problematic Respondent Behaviors among MTurk, Campus, and Community Participants. PLOS ONE, 11(6):e0157732, June 2016. ISSN 1932-6203. doi: 10.1371/journal.pone.0157732.
- Pearce et al.  Paul Pearce, Vacha Dave, Chris Grier, Kirill Levchenko, Saikat Guha, Damon McCoy, Vern Paxson, Stefan Savage, and Geoffrey M. Voelker. Characterizing Large-Scale Click Fraud in ZeroAccess. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security - CCS ’14, pages 141–152, Scottsdale, Arizona, USA, 2014. ACM Press. ISBN 978-1-4503-2957-6. doi: 10.1145/2660267.2660369.
- Ryan  Timothy Ryan. Data contamination on MTurk — Timothy J. Ryan, 2018.
- Seldin and Lugosi  Yevgeny Seldin and Gábor Lugosi. An Improved Parametrization and Analysis of the EXP3++ Algorithm for Stochastic and Adversarial Bandits. Conference on Learning Theory, pages 1743–1759, February 2017.
- Seldin and Slivkins  Yevgeny Seldin and Aleksandrs Slivkins. One Practical Algorithm for Both Stochastic and Adversarial Bandits. ICML, page 13, 2014.
- Seldin et al.  Yevgeny Seldin, Nicolò Cesa-Bianchi, Peter Auer, François Laviolette, and John Shawe-Taylor. PAC-Bayes-Bernstein Inequality for Martingales and its Application to Multiarmed Bandits. Proceedings of the Workship on On-line Trading of Exploration and Exploitation 2, 2012.
- Williams et al.  Joseph Jay Williams, Juho Kim, Anna Rafferty, Samuel Maldonado, Krzysztof Z. Gajos, Walter S. Lasecki, and Neil Heffernan. AXIS: Generating Explanations at Scale with Learnersourcing and Machine Learning. In Proceedings of the Third (2016) ACM Conference on Learning @ Scale - L@S ’16, pages 379–388, Edinburgh, Scotland, UK, 2016. ACM Press. ISBN 978-1-4503-3726-7. doi: 10.1145/2876034.2876042.
- Zimmert and Seldin  Julian Zimmert and Yevgeny Seldin. An Optimal Algorithm for Stochastic and Adversarial Bandits. arXiv:1807.07623 [cs, stat], July 2018.
Appendix A Proofs
a.1 Theorem 1
Proof of creftype 1.
Without loss of generality assume for the underlying true distribution. For -sub-Gaussian, by definition, we have:
Let represent the points which are not contaminated and represent the contaminated points. Then our sample can be represented by the union . Let represent the points that remain after trimming fraction of the largest and smallest points, and be the set of points that were trimmed. Then we have that.
Combining we get,
with probability at least . Letting and , and assuming , we have,
with probability at least . ∎
a.2 Theorem 2
Proof of creftype 2.
Without loss of generality assume for the underlying true distribution. Let -sub-Gaussian.
We want to bound the impact of the contaminated points in our interval. Once we have this bound, the proof follows just as in the trimmed mean.
Assume and . Let be the interval that contains the shortest fraction of , be the interval that contains (i.e. the remaining good points after contamination), and be the interval that contains the points of after trimming the largest and smallest fraction of points. Use to denote the length of interval . It must be that because otherwise the points in would contain fraction of . Let be a point in and be a point in . Recall that trMean is the trimmed mean of the contaminated sample from above. Then we have,
The second step comes from and both being in and because . The third step comes from .
To bound the length of we have,
with probability at least , we get that for ,
Now that we have a bound on the contaminated points in , our analysis follows as before,
Combining we get,
With probability at least . Letting and , and assuming , we have,
With probability at least . ∎
a.3 Theorem 3
Proof of creftype 3.
First will show that for non-optimal actions. Assume .
Now to find for non-optimal actions.
Finally, we can find the regret following the standard analysis,
a.4 Theorem 4
Proof of creftype 4.
The proof for the contamination robust UCB using the -shorth mean is similar to that of the trimmed mean.
Using the analysis from the trimmed mean regret, we again get,
Using this value and standard regret analysis yields
Appendix B Relationship of and
One quick example showing that can prohibit sublinear regret is to consider the CSB game with two actions and Bernoulli rewards. If and then an adversary can choose all the contaminated rewards for to be 1 making it appear that . Thus the actions are indistinguishable to the learner.
Appendix C Additional Simulations
We first provide a plots exploring the sensitivity of crUCB to choices of . Then we look at the sensitivity of RUCB-MAB to certain adversaries.
c.1 Sensitivity of crUCB to
We provide two figures showing the sensitivity to of sUCB and tUCB. The contexts used is the same as in 5.
In fig. 5 we can see that for random cluster contamination at iteration 50 it is best to use an . It appears that tUCB is less sensitive to the choice of than sUCB given that it still performs near optimally for . This suggests it is better to overshoot than undershoot it.
In fig. 6 we see that sUCB is again most sensitive to the choice of and choosing an near gives the best results as we see that choosing an too large can worsen performance.
These plots show that in general sUCB is slightly more sensitive to the choice of , but that both perform best when is close to or greater than .
c.2 Sensitivity of RUCB-MAB to reward and contamination distribution
Here we give some examples where RUCB-MAB (which we refer to as kUCB in this section, [Kapoor et al., 2018]) can perform poorely.
In fig. 6(a) we use the the same setup as within section 5. This plot shows that when the random cluster contamination happens at the beginning of the trails, RUCB-MAD (kUCB) can be severely impacted.
In fig. 6(b), we consider Gaussian true rewards, with , and this time use . Here we see when the gap is small and the contamination gives the mean of the other action, RUCB-MAB (kUCB) confuses the optimal and non-optimal actions. While this setting can trick RUCB-MAB (kUCB), it has little impact on UCB1, sUCB or tUCB.
We can see from fig. 7 that while RUCB-MAB (kUCB) can perform well in certain contexts, a malicious adversary can easily impact performance in ways that are ineffective to many other algorithms.