1 Introduction
Multiple armed bandits (MABs), a popular framework of sequential decision making model, has been widely investigated and has many applicants in a variety of scenarios [5, 20, 19]. The contextual bandits model is an extension of the multiarmed bandits model with contextual information. At each round, the reward is associated with both the arm (a.k.a, action) and the context, while the reward of stochastic MABs is only associated with the arm. Contextual bandits algorithms have a broad range of applications, such as recommender systems [21], wireless networks [32], etc.
In the modern industryscale applications of bandit algorithms, action decisions, reward signal collection, and policy iterations are normally implemented in a distributed network. Action decisions and reward signals may need to be transmitted over communication links. When data packets containing the reward signals and action decisions etc are transmitted through the network, the adversary can implement adversarial attacks by intercepting and modifying these data packets. As the result, poisoning attacks on contextual bandits could possibly happen. In many applications of contextual bandits, an adversary may have an incentive to force the contextual bandits system to learn a specific policy. For example, a restaurant may attack the bandit systems to force the systems into increasing the restaurant’s exposure. Thus, understanding the risks of different kinds of adversarial attacks on contextual bandits is essential for the safe applications of the contextual bandit model and designing robust contextual bandit systems.
While there are many existing works addressing adversarial attacks on supervised learning models
[34, 29, 7, 10, 35, 8, 6], the understanding of adversarial attacks on contextual bandit models is less complete. Of particular relevance to our work is a line of interesting recent work on adversarial attacks on MABs [18, 22, 24] and on linear contextual bandits [27, 12]. In recent works in MABs setting, the types of attacks include both reward poisoning attacks and action poisoning attacks. In the reward poisoning attacks, there is an adversary who can manipulate the reward signal received by the agent [18, 22]. In the action poisoning attacks, the adversary can manipulate the action signal chosen by the agent before the environment receives it [24]. Among existing works on adversarial attacks against linear contextual bandits, they focus on the reward poisoning [27, 12] or context poisoning attacks [12]. In the context poisoning attacks, the adversary can modify the context observed by the agent without changing the reward associated with the context. There are also some recent interesting work on adversarial attacks against reinforcement learning algorithms under various setting
[3, 17, 28, 36, 33, 30, 31, 25].In this paper, we aim to investigate the impact of action poisoning attacks on contextual bandit models. To our best knowledge, this paper is the first work to analyze the impact of action poisoning attacks on contextual bandit models. More detailed comparisons of various types of attacks against contextual bandits will be provided in Section 3. We note that the goal of this paper is not to promote any particular type of poisoning attack. Rather, our goal is to understand the potential risks of action poisoning attacks. We note that for the safe applications and design of robust contextual bandit algorithms, it is essential to address all possible weaknesses of the models and understanding the risks of different kinds of adversarial attacks. Since the action poisoning attack is an important aspect of poisoning attacks and may threaten the bandit systems, it is important to understand the potential risks of action poisoning attacks.
In this paper, we study the action poisoning attack against linear contextual bandit in both whitebox and blackbox settings. In the whitebox setting, we assume that the attacker knows the coefficient vectors associated with arms. Thus, at each round, the attacker knows the mean rewards of all arms. While it is often unrealistic to exactly know the coefficient vectors, the understanding of the whitebox attacks could provide valuable insights on how to design the more practical blackbox attacks. In the blackbox setting, we assume that the attacker has no prior information about the arms and does not know the agent’s algorithm. The limited information that the attacker has are the context information, the action signal chosen by the agent, and the reward signal generated from the environment. In both whitebox and blackbox settings, the attacker aims to manipulate the agent into frequently pulling a target arm chosen by the attacker with a minimum cost. The cost is measured by the number of rounds that the attacker changes the actions selected by the agent. The contributions of this paper are:

We propose a new online action poisoning attack against contextual bandit in which the attacker aims to force the agent to frequently pull a target arm chosen by the attacker via strategically changing the agent’s actions.

We introduce a whitebox attack strategy that can manipulate any sublinearregret linear contextual bandit agent into pulling a target arm rounds over a horizon of rounds, while incurring a cost that is sublinear dependent on .

We design a blackbox attack strategy whose performance nearly matches that of the whitebox attack strategy. We apply the blackbox attack strategy against a very popular and widely used bandit algorithm: LinUCB. We show that our proposed attack scheme can force the LinUCB agent into pulling a target arm times with attack cost scaling as .

We evaluate our attack strategies using both synthetic and realworld datasets. We observe empirically that the total cost of our blackbox attack is sublinear for a variety of contextual bandit algorithms.
2 Related Work
In this section, we discuss related works on two parts: adversarial attacks that cause standard bandit algorithms to fail and robust bandit algorithms that can defend against such attacks.
Attacks Models. In MABs setting, [18] proposes an interesting reward poisoning attack strategy that can force Greedy or upper confidence bound (UCB) agent to select a target arm while only spending logarithmic effort. The main idea of the attack scheme in [18] is to modify the reward signals associated with nontarget arms to smaller values. As the agent only observes the modified reward signals, the target arm appears to the optimal arm for the agent. [22] proposes an optimization based framework for offline reward poisoning attacks on MABs. Furthermore, it studies the online attacks on MABs, and proposes an adaptive attack strategy that is effective in attacking any bandit algorithm without knowing what particular algorithm the agent is using. [24] proposes an adaptive action poisoning attack strategy that can force the UCB agent to pull a target arm times over rounds while the total attack cost being only .
In linear contextual bandit setting, [27] studies offline reward poisoning attacks and investigates the feasibility and the impacts of such attacks. The attacker in [27] aims to force the agent to pull a target arm on a particular context. [12] extends the attack idea of [18, 22] to linear contextual bandits. It proves that the proposed reward poisoning attack strategy can force any bandit algorithms to pull a specific set of arms when the rewards are bounded. It introduces an adaptive reward poisoning attack strategy and observes empirically that the total cost of the adaptive attack is sublinear. In addition, [12] analyzes the context poisoning attacks in whitebox setting and shows that LinUCB is vulnerable to such attack. In the filed of adversarial attacks on RL, [25] studies blackbox action poisoning attacks against RL.
Robust algorithms. Lots of efforts have been made to design robust bandit algorithms to defend adversarial attacks. In the MABs setting, [26] introduces a bandit algorithm, called Multilayer Active Arm Elimination Race algorithm, that is robust to reward poisoning attacks by using a multilayer approach. [15] presents an algorithm named BARBAR that is robust to reward poisoning attacks and the regret of the proposed algorithm is nearly optimal. [14]
considers a reward poisoning attack model where an adversary attacks with a certain probability at each round. As its attack value at each round can be arbitrary and unbounded, the attack model could be powerful. The paper proposes algorithms that are robust to these types of attacks.
[11] introduces a reward poisoning attack setting where each arm can only manipulate its own reward. Every arm can be considered as an adversary and each arm seeks to maximize its own expected number of pull count. Under this setting, [11]proves that Thompson Sampling, UCB, and
greedy can be modified to be robust to such attacks.[23] introduce a bandit algorithm, called MOUCB, that is robust to action poisoning attacks and achieves a regret upper bound that increases over rounds in a logarithmic order or increases with attack cost in a linear order.In the linear contextual bandit setting, [4] proposes a stochastic linear bandit algorithm, called Robust Phased Elimination (RPE), that is robust to reward poisoning attacks. It provides two variants of RPE algorithm which separately work on known attack budget case and agnostic attack budget case. [9] provides a robust linear contextual bandit algorithm, called RobustBandit, that works under both the reward poisoning attacks and context poisoning attacks.
3 Problem Setup
Consider the standard contextual linear bandit model in which the environment consists of arms. In each round , the agent observes a context , pulls an arm and receives a reward . Each arm is associated with an unknown but fixed coefficient vector . In each round , the reward is
where is a conditionally independent zeromean subgaussian noise and denotes the inner product. Hence, the expected reward of arm under context follows the linear setting:
(1) 
for all and all arm . If we consider the algebra , becomes measurable and becomes measurable.
In this paper, we assume that there exist and , such that for all round and arm , and , where denotes the norm. We assume that there exist such that for all , and, for all and all arm , .
The agent is interested in minimizing the cumulative pseudoregret
(2) 
where .
In this paper, we introduce a novel adversary setting, in which the attacker can manipulate the action chosen by the agent. In particular, at each round , after the agent chooses an arm , the attacker can manipulate the agent’s action by changing to another . If the attacker decides not to attack, . The environment generates a random reward based on the postattack arm and the context . Then the agent and the attacker receive reward from the environment. Since the agent does not know the attacker’s manipulations and the presence of the attacker, the agent will still view as the reward corresponding to the arm .
The goal of the attacker is to design an attack strategy so as to manipulate the agent into pulling a target arm very frequently but by making attacks as rarely as possible. Without loss of generality and for notation convenience, we assume arm is the “attack target” arm or target arm. Define the set of rounds when the attacker decides to attack as . The cumulative attack cost is the total number of rounds where the attacker decides to attack, i.e., . The attacker can monitor the contexts, the actions of the agent and the reward signals from the environment.
We now compare the three types of poisoning attacks against contextual linear bandit: reward poisoning attack, action poisoning attack and context poisoning attack. In the reward poisoning attack [27, 12], after the agent observes context and chooses arm , the environment will generate reward based on context and arm . Then, the attacker can change the reward to and feed to the agent. Compared with the reward poisoning attacks, the action poisoning attack considered in this paper is more difficult to carry out. In particular, as the action poisoning attack only changes the action, it can impact but does not have direct control of the reward signal. By changing the action to , the reward received by the agent is changed from to
which is a random variable drawn from a distribution based on the action
and context . This is in contrast to reward poisoning attacks where an attacker has direct control and can change the reward signal to any value of his choice. In the context poisoning attack [12], the attacker only changes the context shown to the agent. The reward is also generated based on the true context and the agent’s action . Nevertheless, the agent’s action may be indirectly impacted by the manipulation of the context, and so as the reward. Since the attacker attacks before the agent pulls an arm, the context poisoning attack is the most difficult to carry out. As mentioned in the introduction, the goal of this paper is not to promote any particular types of poisoning attacks. Instead, our goal is to understand the potential risks of action poisoning attacks, as the safe applications and design of robust contextual bandit algorithm relies on the addressing all possible weakness of the models.As the action poisoning attack only changes the actions, it can impact but does not have direct control of the agent’s observations. Furthermore, when the action space is discrete and finite, the ability of the action poisoning attacker is severely limited. It is reasonable to limit the choice of the target policy. Here we introduce an important assumption that the target arm is not the worst arm:
Assumption 1.
For all , .
If the target arm is the worst arm in most contexts, the attacker should change the target arm to a better arm or the optimal arm so that the agent learns that the target set is optimal for almost every context. In this case, the cost of attack may be up to . Assumption 1 does not imply that the target arm is optimal at some contexts. The target arm could be suboptimal for all contexts. Fig. 1 shows an example of one dimension linear contextual bandit model, where the axis represents the contexts and the axis represents the mean rewards of arms under different contexts. As shown in Fig. 1, the arm 3 and arm 4 satisfy the assumption 1. In addition, the arm 3 is not optimal at any context.
4 Attack Schemes and Cost Analysis
In this section, we introduce action poisoning attack schemes in the whitebox setting and blackbox setting respectively. In order to demonstrate the significant security threat of action poisoning attacks to linear contextual bandits, we investigate our action poisoning attack strategy against a widely used algorithm: LinUCB algorithm. Furthermore, we analyze the attack cost of our action poisoning attack schemes.
4.1 Overview of LinUCB
For reader’s convenience, we first provide a brief overview of the LinUCB algorithm [21]. The LinUCB algorithm is summarized in Algorithm 1
. The main steps of LinUCB are to obtain estimates of the unknown parameters
using past observations and then make decisions based on these estimates. Define as the set of rounds up to where the agent pulls arm . Let . Then, at round , the regularized leastsquares estimate of with regularization parameter is obtained by [21](4) 
where with
being identity matrix.
After ’s are obtained, at each round, an upper confidence bound of the mean reward has to be calculated for each arm (step 5 of Algorithm 1). Then, the LinUCB algorithm picks the arm with the largest upper confidence bound (step 7 of Algorithm 1). By following the setup in ”optimism in the face of uncertainty linear algorithm” (OFUL) [1], we set
(5) 
We define . It is easy to verify that is a monotonically increasing function over .
4.2 Whitebox Attack
We first consider the whitebox attack scenario, in which the attacker has knowledge of the environment. In particular, in the whitebox attack scenario, the attacker knows the coefficient vectors ’s for all . The understanding of this scenario could provide useful insights for the more practical blackbox attack scenario to be discussed in Section 4.3.
The proposed attack strategy works as follows. When the agent chooses arm , the attacker does not attack. When the agent chooses arm , the attacker changes it to arm
(6) 
where and
(7) 
We now explain the main idea behind the attack strategy specified in (6) and (7). Intuitively speaking, using (6) and (7), the attacker can manipulate the agent into learning some particular coefficient vectors. In particular, for arm (the target arm), the agent obtains the true coefficient vector . For any arm , because of the attacks, the agent will obtain an estimate of arm ’s coefficient vector to be close to . This can be seen as follows. Under Assumption 1 and the fact that the mean rewards are positive, we have that . Under the whitebox attack, when the agent pulls a nontarget arm , the mean reward received by the agent should satisfy
(8)  
Because of the positive mean reward assumption and the fact that , the target arm appears to be the optimal arm for any context. Hence the proposed whitebox attack strategy can easily mislead an agent who employs a contextual bandit algorithm with sublinear pseudoregret, as shown in the following proposition whose proof can be found in Appendix A of the supplementary material.
Proposition 1.
For . Suppose the agent employs a contextual bandit algorithm whose cumulative pseudoregret is upper bounded by over with probability . When using the proposed whitebox attack scheme, with probability , the attacker can force the agent to pull the target arm for times, while the total attack cost is upper bounded by .
Proposition 1 applies to any contextual bandit algorithm whose cumulative pseudoregret is upper bounded by . To further illustrate the proposed attack scheme, we now provide a finer analysis the impact of this attack on LinUCB described in Algorithm 1. We first build a high probability bound for the estimate of parameter at the agent, when the agent computes the estimate of by (4) and the attacker employs the whitebox attack. Recall that .
Lemma 1.
Under the proposed whitebox attack, the estimate of for each arm obtained by LinUCB agent as described in Algorithm 1 satisfies
(9) 
with probability , for all arm and all . Here, is the weighted norm of vector for a positive definite matrix .
The proof of Lemma 1 is provided in Appendix B of the supplementary material. Lemma 1 shows that, under our whitebox attack, the agent’s estimate of the parameter of nontarget arm, i.e. , will converge to . Thus, the agent is misled to believe that arm is the optimal arm for every context in most rounds. The following theorem provides an upper bound of the cumulative cost of the attack.
Theorem 1.
Define . Under the same assumptions as in Lemma 1, for any with probability at least , for all , the attacker can manipulate the LinUCB agent into pulling the target arm in at least rounds, using an attack cost
(10) 
4.3 Blackbox Attack
We now focus on the more practical blackbox setting, in which the attacker does not know any of arm’s coefficient vector. The attacker knows the value of (or a lower bound) in which the equation (3) holds for all . Although the attacker does not know the coefficient vectors for all arms, the attacker can compute an estimate of the unknown parameters by using past observations. On the other hand, there are multiple challenges brought by the estimation errors that need to properly addressed.
The proposed blackbox attack strategy works as follows. When the agent chooses arm , the attacker does not attack. When the agent chooses arm , the attacker changes it to arm
(11) 
where
(12) 
and
(13) 
when and , and
(14) 
with where .
For notational convenience, we set and when . We define that, if , and ; and .
(15) 
where and
(16) 
Here, is the estimation of by the attacker, while in (4) is the estimation of at the agent side. We will show in Lemma 2 and Lemma 4 that will be close to the true value of while will be close to a suboptimal value chosen by the attacker. This disparity gives the attacker the advantage and foundation for carrying out the attack.
We now highlight the main idea why our blackbox attack strategy works. As discussed in Section 4.2, if the attacker knows the coefficient vectors of all arms, the proposed whitebox attack scheme can mislead the agent to believe that the coefficient vector of every nontarget arm is , hence the agent will think the target arm is optimal. In the blackbox setting, the attacker does not know the coefficient vector for any arm. The attacker should estimate the coefficient vector of each arm. Then, the attacker will use the estimated coefficient vector to replace the true coefficient vector in the whitebox attack scheme. As the attacker does not know the true values of ’s, we need to design the estimator , the attack choice and the probability carefully. In the following, we explain the main ideas behind our design choices.
Firstly, we explain why we design estimator using the form (15), in which the attacker employs the importance sampling to obtain an estimate of . There are two reasons for this. Firstly, for a successful attack, the number of observation in arm
will be limited. Hence if the importance sampling is not used, the estimation variance of the mean reward
at the attacker side for some contexts may be large. Secondly, the attacker’s action is stochastic when the agent pulls a nontarget arm. Thus, the attacker uses the observations at round when the attacker pulls arm with certain probability, i.e. when , to estimate . At the agent side, since the agent’s action is deterministic, the agent uses the observations at round when the agent pulls arm , i.e. when , to estimate .Secondly, we explain ideas behind the choice of in (12). Under our blackbox attack, when the agent pulls a nontarget arm , the mean reward received by the agent satisfies
(17)  
In our whitebox attack scheme, is the worst arm at context . In the blackbox setting, the attack does not know a prior which arm is the worst. In the proposed blackbox attack scheme, as indicated in (12), we use the lower confidence bound (LCB) method to explore the worst arm and is the arm whose lower confidence bound is the smallest.
Finally, we provide reasons why we choose using (14). In our whitebox attack scheme, we have that . Thus, in our blackbox attack scheme, we limit the choice of to . Furthermore, in (7) used for the whitebox attack, is computed by the true mean reward. Now, in the blackbox attack, as the attacker does not the true coefficient vector, the attacker use the estimation of to compute the second term in the clip function in (14).
In summary, intuitively speaking, our design of , and can ensure that the attacker’s estimation will be close to , while the agent’s estimation will be close to . In the following, we make these statements precise, and formally analyze the performance of the proposed blackbox attack scheme.
First, we analyze the estimation at the attacker side. We establish a confidence ellipsoid of at the attacker.
Lemma 2.
Assume the attacker performs the proposed blackbox action poisoning attack. With probability , we have
(18) 
holds for all arm and all simultaneously.
Lemma 2 shows that lies in an ellipsoid with center at with high probability, which implies that the attacker has good estimate of each arm.
We then analyze the estimation at the agent side. The following lemma provides an upper bound on the absolute difference between and .
Lemma 3.
Under the blackbox attack, with probability , the estimate obtained by an LinUCB agent satisfies
simultaneously for all when .
The bound in Lemma 3 consists of the confidence ellipsoid of the estimate of arm and that of arm . As mentioned above, for a successful attack, the number of observations on arm will be limited. Thus, in our proposed algorithm, the attacker use the importance sampling to obtain the estimate of , which will increases the number of observations that can be used to estimate the coefficient vector of arm . Using Lemma 3, we have the following lemma regarding the estimation at the agent side.
Lemma 4.
Consider the same assumption as in Lemma 2. With a probability at least , the estimate obtained by the LinUCB agent will satisfy
(19)  
simultaneously for all arm and all .
Lemma 4 shows that, under the proposed blackbox attack scheme, the agent’s estimate of the parameter of nontarget arm, i.e. , will converge to . As the result, the agent will believe that the target arm is the optimal arm for any context in most rounds. Using these supporting lemmas, we can then analyze the performance of the proposed blackbox attack strategy.
Theorem 2.
Under the same assumptions as in Lemma 4, with probability at least , for all , the attacker can manipulate a LinUCB agent into pulling the target arm in at least rounds, using an attack cost
Theorem 2 shows that our blackbox attack strategy can manipulate a LinUCB agent into pulling a target arm times with attack cost scaling as . Compared with the result for the whitebox attack, the blackbox attack only brings an additional factor.
5 Numerical Experiments
In this section, we empirically evaluate the performance of the proposed action poisoning attack schemes on three contextual bandit algorithms: LinUCB [1], LinTS [2], and Greedy. We run the experiments on three datasets:
Synthetic data: The dimension of contexts and the coefficient vectors is . We set the first entry of every context and coefficient vector to . The other entries of every context and coefficient vector are uniformly drawn from . Thus, for all round and arm , , and mean rewards . The reward noise
is drawn from a Gaussian distribution
.Jester dataset [13]: Jester contains 4.1 million ratings of jokes in which the rating values scale from to . We normalize the rating to . The dataset includes 100 jokes and the ratings were collected from 73,421 users between April 1999  May 2003. We consider a subset of 10 jokes and 38432 users. Every jokes are rated by each user. We perform a lowrank matrix factorization () on the ratings data and obtain the features for both users and jokes. At each round, the environment randomly select a user as the context and the reward noise is drawn from a Gaussian distribution .
MovieLens 25M dataset: [16]MovieLens 25M dataset contains 25 million 5star ratings of 62,000 movies by 162,000 users. The preprocessing of this data is almost the same as the Jester dataset, except that we consider a subset of 10 movies and 7344 users. At each round, the environment randomly select a user as the context and the reward noise is drawn from a Gaussian distribution .
We set and . For all the experiments, we set the total number of rounds and the number of arms . We independently run ten repeated experiments. Results reported are averaged over the ten experiments.
The results are shown in Table 1 and Figure 2. These experiments show that the action poisoning attacks can force the three agents to pull the target arm very frequently, while the agents rarely pull the target arm under no attack. Under the attacks, the true regret of the agent becomes linear as the target arm is not optimal for most context. Table 1 show the number of rounds the agent pulls the target arm among total rounds. In the synthetic dataset, under the proposed whitebox attacks, the target arm is pulled more than of the times by the three agent (see Table 1). The target arm is pulled more than of the times in the worst case (the blackbox attacks on LinUCB). Fig 2 shows the cumulative cost of the attacks on three agents for the three datasets. The results show that the attack cost of every attack scheme on every agent for every dataset scales sublinearly, which exposes a significant security threat of the action poisoning attacks on linear contextual bandits.
Synthetic  Jester  MovieLens  

Greeedy without attacks  2124.6  5908.7  3273.5 
Whitebox attack on Greeedy  982122.5  971650.9  980065.6 
Blackbox attack on Greeedy  973378.5  939090.2  935293.8 
LinUCB without attacks  8680.9  16927.2  13303.4 
Whitebox attack on LinUCB  981018.7  911676.9  969118.6 
Blackbox attack on LinUCB  916140.8  875284.7  887373.1 
LinTS without attacks  5046.9  18038.0  9759.0 
Whitebox attack on LinTS  981112.8  908488.3  956821.1 
Blackbox attack on LinTS  918403.8  862556.8  825034.8 
6 Conclusion
In this paper, we have proposed a new class of attacks on linear contextual bandits: action poisoning attacks. We have shown that our whitebox attack strategy is able to force any linear contextual bandit agent, whose regret scales sublinearly with the total number of rounds, into pulling a target arm chosen by the attacker. In addition, we have shown that our whitebox attack strategy can force LinUCB agent into pulling a target arm times with attack cost scaled as . We have further shown that the proposed blackbox attack strategy can force LinUCB agent into pulling a target arm times with attack cost scaled as . Our results expose a significant security threat to contextual bandit algorithms. In the future, we will investigate the defense strategy to mitigate the effects of this attack.
References
 [1] (2011) Improved algorithms for linear stochastic bandits. Advances in neural information processing systems 24, pp. 2312–2320. Cited by: Appendix B, Appendix C, Appendix F, Appendix G, §4.1, §5.

[2]
(201317–19 Jun)
Thompson sampling for contextual bandits with linear payoffs.
In
Proceedings of the 30th International Conference on Machine Learning
, Proceedings of Machine Learning Research, Atlanta, Georgia, USA, pp. 127–135. Cited by: §5. 
[3]
(2017)
Vulnerability of deep reinforcement learning to policy induction attacks.
In
International Conference on Machine Learning and Data Mining in Pattern Recognition
, pp. 262–275. Cited by: §1. 
[4]
(2021)
Stochastic linear bandits robust to adversarial attacks..
In
International Conference on Artificial Intelligence and Statistics
, pp. 991–999. Cited by: §2.  [5] (201412) Simple and scalable response prediction for display advertising. ACM Transactions on Intelligent Systems and Technology (TIST) 5 (4), pp. 61:1–61:34. External Links: ISSN 21576904, Link, Document Cited by: §1.
 [6] (2020) Teaching with limited information on the learner’s behaviour. In International Conference on Machine Learning, pp. 2016–2026. Cited by: §1.
 [7] (2019) Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pp. 1310–1320. Cited by: §1.
 [8] (2019) Teaching a blackbox learner. In Proceedings of the 36th International Conference on Machine Learning, Vol. 97, pp. 1547–1555. Cited by: §1.
 [9] (2021) Robust stochastic linear contextual bandits under adversarial attacks. arXiv preprint arXiv:2106.02978. Cited by: §2.
 [10] (2019) Generalized no free lunch theorem for adversarial robustness. In Proceedings of the 36th International Conference on Machine Learning, pp. 1646–1654. Cited by: §1.
 [11] (2020) The intrinsic robustness of stochastic bandits to strategic manipulation. In International Conference on Machine Learning, pp. 3092–3101. Cited by: §2.
 [12] (2020) Adversarial attacks on linear contextual bandits. In Advances in Neural Information Processing Systems, pp. 14362–14373. Cited by: §1, §2, §3.
 [13] (2001) Eigentaste: a constant time collaborative filtering algorithm. information retrieval 4 (2), pp. 133–151. Cited by: §5.
 [14] (2020Feb.) Robust stochastic bandit algorithms under probabilistic unbounded adversarial attack. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. , New York City, NY, pp. 4036–4043. Cited by: §2.
 [15] (2019) Better algorithms for stochastic bandits with adversarial corruptions. In Conference on Learning Theory, pp. 1562–1578. Cited by: §2.
 [16] (2015) The movielens datasets: history and context. Acm transactions on interactive intelligent systems (tiis) 5 (4), pp. 1–19. Cited by: §5.

[17]
(2019)
Deceptive reinforcement learning under adversarial manipulations on cost signals.
In
International Conference on Decision and Game Theory for Security
, pp. 217–237. Cited by: §1.  [18] (2018Dec.) Adversarial attacks on stochastic bandits. In Advances in Neural Information Processing Systems, Montréal, Canada, pp. 3644–3653. Cited by: §1, §2, §2.
 [19] (201507) Cascading bandits: learning to rank in the cascade model. In International Conference on Machine Learning, Lille, France, pp. 767–776. Cited by: §1.
 [20] (2011Feb.) Cognitive medium access: exploration, exploitation and competition. IEEE transactions on mobile computing 10 (2), pp. 239–253. Cited by: §1.
 [21] (2010) A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670. Cited by: §1, §4.1, Algorithm 1.
 [22] (201906) Data poisoning attacks on stochastic bandits. In International Conference on Machine Learning, Vol. 97, Long Beach, CA, pp. 4042–4050. Cited by: §1, §2, §2.
 [23] (2020) Actionmanipulation attacks against stochastic bandits: attacks and defense. IEEE Transactions on Signal Processing 68 (), pp. 5152–5165. Cited by: §2.
 [24] (2020) Actionmanipulation attacks on stochastic bandits. In ICASSP, Vol. , pp. 3112–3116. Cited by: §1, §2.
 [25] (2021) Provably efficient blackbox action poisoning attacks against reinforcement learning. Advances in Neural Information Processing Systems 34. Cited by: §1, §2.

[26]
(201806)
Stochastic bandits robust to adversarial corruptions.
In
Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing
, Los Angeles, CA, pp. 114–122. External Links: ISBN 9781450355599 Cited by: §2.  [27] (2018) Data poisoning attacks in contextual bandits. In International Conference on Decision and Game Theory for Security, pp. 186–204. Cited by: §1, §2, §3.
 [28] (2019) Policy poisoning in batch reinforcement learning and control. In Advances in Neural Information Processing Systems, Vol. 32, pp. . Cited by: §1.

[29]
(2017)
Universal adversarial perturbations.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 1765–1773. Cited by: §1.  [30] (2020) Policy teaching via environment poisoning: trainingtime adversarial attacks against reinforcement learning. In International Conference on Machine Learning, pp. 7974–7984. Cited by: §1.
 [31] (2021) Reward poisoning in reinforcement learning: attacks against unknown learners in unknown environments. arXiv preprint arXiv:2102.08492. Cited by: §1.
 [32] (2019) Contextual multiarmed bandits for link adaptation in cellular networks. In Proceedings of the 2019 Workshop on Network Meets AI & ML, pp. 44–49. Cited by: §1.
 [33] (2021) Vulnerabilityaware poisoning mechanism for online rl with unknown dynamics. In International Conference on Learning Representations, Cited by: §1.

[34]
(2014)
Intriguing properties of neural networks
. In International Conference on Learning Representations, Cited by: §1.  [35] (2019) On the convergence and robustness of adversarial training.. In ICML, Vol. 1, pp. 2. Cited by: §1.
 [36] (2020) Adaptive rewardpoisoning attacks against reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning, Vol. 119, pp. 11225–11234. Cited by: §1.
Appendix A Proof of Proposition 1
When the agent pulls a nontarget arm , the mean reward received by the agent should satisfy . In the observation of the agent, the target arm becomes optimal and the nontarget arms are associated with the coefficient vector . In addition, the cumulative pseudoregret should satisfy . If is upper bounded by , is also upper bounded by .
Appendix B Proof of Lemma 1
In our model, the mean reward is bounded by . Since the mean rewards are bounded and the rewards are generated independently, we have and . Thus, is a bounded martingale difference sequence w.r.t the filtration .
Then, by Azuma’s inequality,
(22)  
where represents confidence bound. In order to ensure the confidence bounds hold for all arms and all round simultaneously, we set so
(23)  
where the last inequality is obtained from the fact that
(24)  
In other words, with probability , we have
(25)  
for all arms and all .
Note that is positive definite. We define as the weighted innerproduct. According to CauchySchwarz inequality, we have
(26) 
Assume that . From Theorem 1 and Lemma 11 in [1], we know that for any , with probability at least
(27)  
for all arms and all ,
For the third part of the right hand side of (29),
(28) 