Data Poisoning Attacks in Contextual Bandits

08/17/2018 ∙ by Yuzhe Ma, et al. ∙ Google University of Wisconsin-Madison 0

We study offline data poisoning attacks in contextual bandits, a class of reinforcement learning problems with important applications in online recommendation and adaptive medical treatment, among others. We provide a general attack framework based on convex optimization and show that by slightly manipulating rewards in the data, an attacker can force the bandit algorithm to pull a target arm for a target contextual vector. The target arm and target contextual vector are both chosen by the attacker. That is, the attacker can hijack the behavior of a contextual bandit. We also investigate the feasibility and the side effects of such attacks, and identify future directions for defense. Experiments on both synthetic and real-world data demonstrate the efficiency of the attack algorithm.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As an important step toward trustworthy AI, adversarial learning studies robustness of machine learning systems against malicious attacks 

[7, 10]. Training set poisoning is a type of attack where the adversary can manipulate the training data such that a machine learning algorithm trained on the poisoned data would produce a defective model. The defective model is often similar to a good model, but affords the adversary certain nefarious leverages [3, 5, 9, 12, 14, 15, 17]. Understanding training set poisoning is essential to developing defense mechanisms.

Recent studies on training set poisoning attack focused heavily on supervised learning. There has been little study on poisoning sequential decision making algorithms, even though they are widely employed in the real world. In this paper, we aim to fill in the gap by studying training set poisoning against contextual bandits. Contextual bandits are extensions of multi-armed bandits with side information and have seen wide applications in industry including news recommendation 

[13], online advertising [6], medical treatment allocation [11], and also promotion of users’ well-being [8].

Let us take news recommendation as a running example for poisoning against contextual bandits. A news website has articles (i.e., arms). It runs an adaptive article recommendation algorithm (the contextual bandit algorithm) to learn a policy in the backend. Every time a user (represented by a context vector) visits the website, the website displays an article that it thinks is most likely to interest the user based on the historical record of all users. Then the website receives a unit reward if the user clicks through the displayed article, and receives no reward otherwise. Usually the website keeps serving users throughout the day and updates its article selection policy periodically (say, during the nights or every few hours). This provides an opportunity for an attacker to perform offline data poisoning attacks, e.g. the attacker can sneak into the website backend at night before the policy is updated, and poison the rewards collected during the daytime. The website unknowingly updates its policy with the poisoned data. On the next day it behaves as the attacker wanted.

More generally, we study adversarial attacks in contextual bandit where the attacker poisons historical rewards in order to force the bandit to pull a target arm under a target context. One can view this attack as a form of offline reward shaping [16], but it is adversarial reward shaping. Our main contribution is an optimization-based attack framework for this attack setting. We also study the feasibility and side effect of the attack. We show on both synthetic and real-world data that the attack is effective. This exposes a security threat in AI systems that involve contextual bandits.

2 Review of Contextual Bandit

This section reviews contextual bandits, which will be the victim of the attack in this paper. A contextual bandit is an abstraction of many real-world decision making problems such as product recommendation and online advertising. Consider for example a news website which strives to recommend the most interesting news articles personalized for individual users. Every time a user visits the website, the website observes certain contextual information that describes the user such as age, gender, location, past news consumption patterns, etc. The website also has a pool of candidate news articles, one of which will be recommended and shown to the user. If the recommended article is interesting, the user may click on it; otherwise, the user may click on other items on the page or navigate to another page. The click probability here depends on both the user (via the context) and the recommended article. Such a dependency can be learned based on click logs and used for better recommendation for future users.

An important aspect of the problem is that the click feedback is observed only for the recommended article, not for others. In other words, the decision (choosing which article to show to a user) is irrevocable; it is impractical to force the user to revisit the webpage so as to recommend a different article. As a result, the feedback data being collected is necessarily biased towards the current recommendation algorithm being employed by the website, raising the need for balancing exploration and exploitation when choosing arms [13]. This is in stark contrast to a typical prediction task solved by supervised learning where predictions do not affect the data collection.

Formally, a contextual bandit has a set of contexts and a set of arms. A contextual bandit algorithm proceeds in rounds . At round , the algorithm observes a context vector , chooses to pull an arm , and observes a reward . The goal of the algorithm is to maximize the total reward garnered over rounds. In the news recommendation example above, it is natural to define if user clicks on the article and otherwise, so that maximizing clicks is equivalent to maximizing the click-through rate, a critical business metric in online recommender systems.

In this work, we focus on the most popular and well-studied setting called linear bandits, where the expected reward is linear map of the context vector. Specifically, we assume each arm is associated with an unknown vector with , so that for every :


where is a -subGaussian noise. For simplicity, we assume is unbounded and thus the reward can take any value in .

Most contextual bandit algorithms adopt the optimism-in-face-of-uncertainty (OFU) principle for efficient exploration. The OFU principle constructs an Upper Confidence Bound (UCB) for the mean reward of each arm based on historical data and then selects the arm with the highest UCB at each time step [4, 1]. In round , the historical data consists of the context, action, reward triples from the previous rounds. It is useful to split the historical data so that the feedback from the same arm is pooled together. Define . Let be the number of times arm was pulled up to time . This implies that . For each , let be the design matrix for rounds, where arm was pulled and each row of is a previous context. Similarly, let be the corresponding reward (column) vector.

A UCB-style algorithm first forms a point estimate of

by ridge regression


where is a regularization parameter. At round , the algorithm observes the context and then selects the arm with the highest UCB:


where is the Mahalanobis norm and . Intuitively, for less frequently chosen , the second term above tends to be large, thus encouraging exploration. The exploration parameter is algorithm-specific. For example, in LinUCB [13] and in OFUL [1] , where is a confidence parameter. Here, we assume may depend on input parameters like and observed data up to , but not .

In Algorithm 1, we summarize the contextual bandit algorithm. While the bandit algorithm updates its estimates in every round (step 3), in practice due to various considerations such updates often happen in mini-batches, e.g., several times an hour, or during the nights when fewer users visit the website [13, 2]. Between these consecutive updates, the bandit algorithm follows a fixed policy obtained from the last update.

1:  Parameters: confidence , regularizer , UCB function .
2:  for  do
3:     Receive context , estimate with (2).
4:     Pull arm .
5:     World generates reward .
6:     Append and to and , respectively.
7:  end for
Algorithm 1 Contextual bandit algorithm

3 Attack Algorithm in Contextual Bandit

We now introduce an attacker with the following attack goal:

Attack goal : On a particular attack target context , force the bandit algorithm to pull an attack target arm .

For example, the attacker may want to manipulate the news service so that a particular article is shown to users from certain political bases. The attack is aimed at the current round , or more generally the whole period when the arm-selection policy is fixed. Any suboptimal arm can be the target arm. For concreteness, in our experiments the attacker always picks the worst arm as the target arm. This is defined in the sense of the worst UCB, namely replacing argmax with argmin in (3), resulting in the target arm in (21).

We assume the attacker has full knowledge of the bandit algorithm and has access to all historical data. The attacker has the power to poison the historical reward vector111In this paper we restrict the poisoning to modifying rewards for ease of exposition. More generally, the attacker can add, remove, or modify both the rewards and the context vectors. Our optimization-based attack framework can be generalized to such stronger attacks, though the optimization could become combinatorial. , . Specifically, the attacker can make arbitrary modifications , so that the reward vector for arm becomes . After the poisoning attack, the ridge regression performed by the bandit algorithm yields a different solution:


Because such attacks happen on historical rewards in between bandit algorithm updates, we call it offline.

Now we can formally define the attack goal.

Definition 1 (Weak attack)

A target context is called weakly attacked into pulling target arm if after attack the following inequalities are satisfied:


In other words, the algorithm is manipulated into choosing for context .

To avoid being detected, the attacker hopes to make the poisoning as small as possible. We measure the magnitude of the attack by the squared -norm .222The choice of norm is application dependent, see e.g., [15, Figure 3]. Any norm works for the attack formulation. We therefore formulate the attack as the following optimization problem:


The weak attack above ensures that, given the target context , the bandit algorithm is forced to pull arm instead of any other arms. Unfortunately, the constraints do not result in a closed convex set. To formulate the attack as a convex optimization problem, we introduce a stronger notion of attack that implies weak attack:

Definition 2 (Strong attack)

A target context is called -strongly attacked into pulling target arm , for some , if after attack the following holds:


This is essentially a large margin condition which requires the UCB of to be at least greater than the UCB of any other arm . The margin parameter is chosen by the attacker. We achieve strong attack with the following optimization problem:


The optimization problem above is a quadratic program with linear constraints in . We summarize the attack in Algorithm 2. In the next section we discuss when the algorithm is feasible.

1:  Input: victim contextual bandit (Algorithm 1), target context , target arm , attack margin , historical data .
2:  Solve (8) for .
3:  If a solution is found, poison ; otherwise return infeasible.
Algorithm 2 Data Poisoning Attack in Contextual Bandit

4 Feasibility of Attack

While one can always write down the training set attack algorithm as optimization (8), there is no guarantee that such attack is feasible. In particular, the inequality constraints may result in an empty set. One may naturally ask: are there context vectors that simply cannot be strongly attacked?333Even if some context cannot be strongly attacked, the attacker might be able to weakly attack it. Weak attack is sufficient for the attacker to force an arm pull of . However, as strong attack approaches weak attack. Thus we only need to characterize strong attacks. In this section we present a full characterization of the feasibility question for strong attack. As we will see, attack feasibility depends on the original training data. Understanding the answer helps us to gauge the difficulty of poisoning, and may aid the design of defenses.

The main result of this section is the following theorem that characterizes a sufficient and necessary condition for the strong attack to be feasible.

Theorem 4.1

A context cannot be strongly attacked into pulling if and only if there exists such that the following two conditions are both satisfied:

(i) , and

(ii) .

Before presenting the proof, we first provide intuition. The key idea is that a context cannot be strongly attacked if some non-target arm is always better than for for any attack. This can happen because there are two terms in the arm selection criterion (3) while the attack can affect the first term only. It turns out that under the condition the first term becomes zero. If there exists a non-target arm that has a larger second term than that of the target arm (the condition ), then no attack can force the bandit algorithm to choose the target arm.

We present an empirical study on the feasibility of attack in Section 6.3.

Lemma 1

, where .


First, we prove . Note that


Therefore, we have


Now we show the other direction. Note that


which implies .∎

Proof (Theorem 4.1)

() According to lemma 1, condition implies


Combined with (ii) we have for any and ,


Thus, cannot be attacked.

() This is equivalent to prove if , then can be attacked. To show can be attacked, it suffices to find a solution for the optimization problem.

If , then or . Assume (similar for the case ), then . Let . For any , arbitrarily fix some , then define


Let , where . Thus,


Therefore, we have for all that


which means can be attacked.

If , simply letting and suffices, concluding the proof.∎

5 Side Effects of Attack

While the previous section characterized contexts that cannot be strongly attacked, this section asks an opposite question: suppose the attacker was able to strongly attack some by solving (8), what other contexts are affected by the attack? For example, there might exist some context whose pre-attack chosen arm is , but becomes . The side effects can be construed in two ways: on one hand the attack automatically influence more contexts than just ; on the other hand they make it harder for the attacker to conceal an attack. The latter may be utilized to facilitate detection by a defender. In this section, we study the side effect of attack and provide insights into future research directions on defense.

The side effect is quantified by the fraction of contexts in the context space such that the chosen arm is changed by the attacker. Specifically, let be the context space and be a probability measure over . Let and be the pre-attack and post-attack chosen arm of a context . Then the side effect fraction is defined as:


One can compute an empirical side effect fraction as follows. First sample contexts from , and then let . It is easy to show using Chernoff bound that decays to at the rate of .

We now give some properties of the side effect. Specifically, we first show if is affected by the attack, is also affected by the attack for any .

Proposition 1

If a context satisfies , then for any , where and are the pre-attack and post-attack chosen arm of . Moreover, , i.e., the post-attack chosen arms for and are exactly the same.


First, for any , define


Note that is the best arm after attack, thus , . Therefore, for any , we have


which implies that . The same argument may be used to show . Therefore, .

Proposition 1 shows that if a context has a side effect, all contexts on the open ray also have the same side effect.

Proposition 2

If a context is strongly attacked, then is also strongly attacked for any .


First, for any , define


Since is strongly attacked, we have , . Therefore , which shows that is also strongly attacked.

The above propositions are weak in that they do not directly quantify the side effect fraction . They only tell us that when there is side effect, the affected contexts form a collection of rays. In the experiment section we empirically study the side effect fraction. Further theoretical understanding of the side effect is left as a future work.

6 Experiments

Our proposed attack algorithm works for any contextual bandit algorithm taking the form (3). Throughout the experiments, we choose to attack the OFUL algorithm that has a tight regret bound and can be efficiently implemented.

6.1 Attack Effectiveness and Effort: Toy Experiment

To study the effectiveness of the attack, we consider the following toy experiment. The bandit has arms, and each arm has a payoff parameter where , distributed uniformly on the -dimensional sphere, denoted . To generate , we first draw from a

-dimensional standard Gaussian distribution,

and then normalize: .

Next, we construct the historical data as follows. We generate historical context vectors again uniformly on . For each historical context , we pretend the world generates all rewards from the arms according to (1), where we set the noise level to . We then choose an arm randomly from a multinomial distribution: , where . This forms one data point , and we repeat it for all points. We then group the historical data to form the appropriate matrices for every . Note that the historical data generated in this way is off-policy with respect to the bandit algorithm. The regularization and confidence parameters are and , respectively.

In each attack trial, we draw a single target context uniformly from . Without attack, the bandit would have chosen the arm with the highest UCB based on historical data (3). To illustrate the attack, we will do the opposite and set the attack target arm as the one with the smallest UCB instead:


where is the UCB parameter of the OFUL algorithm [1]. We set the strong attack margin as . We then run the attack on with Algorithm 2.

We run attack trials. In each trial the arm parameters, historical data, and the target context are regenerated. We make two main observations:

  1. The attacker is effective. All -strongly attacks are successful.

  2. The attacker’s poisoning is small. The total poisoning can be measured by in each attack trial. However, this quantity depends on the scale of the original pre-attack rewards . It is more convenient to look at the poisoning effort ratio:


    Figure 1 shows the histogram for the poisoning effort ratio of the attack trials. The ratio tends to be small, with a median of , which demonstrates that the attacker needs to only manipulate about of the rewards.

These two observations indicate that poisoning attack in contextual bandit is easy to carry out.

Figure 1: Histogram of poisoning effort ratio in the toy experiment

We now analyze a single, representative attack trial to gain deeper insight into the attack strategy. In this trial, the UCBs of the arms without attack are

That is, arm 3 would have been chosen. As mentioned earlier, is chosen to be the target arm as it has the smallest pre-attack UCB. After attack, the UCBs of all arms become:

The attacker successfully forced the bandit to choose arm 2. It did so by poisoning the historical data to make arm 2 look better and arms 3 and 5 look worse. It left arms 1 and 4 unchanged.

Figure 7 shows the attack where each panel is the historical rewards where that arm was chosen. We show the original rewards (, blue circle) and post-attack rewards (, red cross) for all historical points where arm was chosen. Intuitively, to decrease the UCB of arm the attacker should reduce the reward if the historical context is “similar” to , and boost the reward otherwise. To see this, we sort the historical points by the inner product in ascending order. As shown in Figure 7LABEL:sub@reward3 and LABEL:sub@reward5, the attacker gave the illusion that these arms are not good for by reducing the rewards when is large. The attacker also increased the rewards when is very negative, which reinforces the illusion. In contrast, the attacker did the opposite on the target arm as shown in Figure 7LABEL:sub@reward2.

(a) arm 1
(b) arm 2
(c) arm 3
(d) arm 4
(e) arm 5
Figure 7: Original reward and post-attack reward for each arm.
(a) arm 1
(b) arm 2
(c) arm 3
(d) arm 4
(e) arm 5
Figure 13: The reward poisoning for each arm.

6.2 Attack on Real Data: Yahoo! News Recommendation

To further demonstrate the effectiveness of the attack algorithm in real applications, we now test it on the Yahoo! Front Page Today Module User Click Log Dataset (R6A).444URL: . The dataset contains a fraction of user click log for news articles displayed in the Featured Tab of the Today Module on Yahoo! Front Page ( during the first ten days in May 2009. Specifically, it contains about million user visits, where each user is represented as a -dimensional contextual vector. When a user arrives, the Yahoo! Webscope program selects an article (an arm) from a candidate article pool and displays it to the user. The system receives reward if the user clicks on the article and otherwise. Contextual information about users can be found in prior work [13].

To apply the attack algorithm, we require that the set of arms remain unchanged. However, the Yahoo! candidate article pool (i.e., the set of arms) varies as new articles are added and old ones are removed over time. Nonetheless, there are long periods of time where the set of arms is fixed. We restrict ourselves to such a stable time period for our experiment (specifically the period from 7:25 to 10:35 on May 1, 2009) in the Yahoo! data, which contains , user visits. During this period the bandit has fixed arms. We further split the time period such that the first user visits are used as the historical training data to be poisoned, and the remaining data points as the test data. The bandit learning algorithm uses regularization . The confidence parameter is . The subGaussian parameter is set to for binary rewards.

We simulate attacks on three target user context vectors: The most frequent user context vector , a middle user context vector , and the least frequent user context vector in the test data. These three user context vectors appeared , , and times, respectively, in the test data. Note that there are potentially many distinct real-world users that are mapped to the same user contextual vector, therefore the “user” in our experiment does not necessarily mean a real-world individual that appeared thousands of times.

We again choose as the target arm the worst arm on the target user as defined by (21). To determine the target arm, we first simulate the bandit algorithm on the original (pre-attack) training data, and then pick the arm with the smallest UCB for that user. For the three target users we consider, the target arms are , , and respectively. The attacker uses attack margin .

Different from the toy example where the reward can be any value in , the reward in the Yahoo! dataset must be binary, corresponding to a click-or-not outcome of the recommendation. Therefore, the attacker must enforce . However, this results in a combinatorial problem. To preserve convexity, we instead relax the attacked reward into a box constraint: . We add these new constraints to (8) and solve the following optimization:


After the real-valued is computed, the attacker performs rounding to turn into or . Specifically, the attacker thresholds with a constant , so that if , then let the post-attack reward be , otherwise let the post-attack reward be . Note that the poisoned rewards now correspond to “reward flipping” from to or vice versa by the attacker. In our experiment, we let the attacker try out thresholds equally distributed in . The attacker examines different thresholds for two concerns. First, there is no guarantee that the thresholded solution still triggers the target arm pull, thus the attacker needs to check if the selected arm for is . If not, the corresponding threshold is inadmissible. Second, among those thresholds that indeed trigger the target arm pull, the attacker selects the one that minimizes the number of flipped rewards, which corresponds to the smallest poisoning effort in the binary reward case.

In Table 1, we summarize the experimental results for attacking the three target users. Note that the attack is successful on all three target users. The best thresholds for , and are , , and , respectively. The number of flipped rewards is small compared to , which demonstrates that the attacker only needs to spend little cost in order to force the bandit to pull the target arm. Note that the poisoning effect ratio is relatively large. This is because most of the pre-attack rewards are 0, in which case the denominator in (22) is small.

strong attack successful? True True True
number [percentage] of flipped rewards [] [] []
poisoning effort ratio 0.572 0.189 0.275
Table 1: Results of experiments on Yahoo! data

In Figure 17, we show the reward poisoning on the historical data against the three target users, respectively. In all three cases, only a few rewards of the target arm are flipped from to by the attacker while those of the other arms remain unchanged. Therefore, we only show the reward poisoning on historical data restricted to the target arm (namely on ). The and flipped rewards overlap in Fig. 17 LABEL:sub@reward_diff_mfu and Fig. 17 LABEL:sub@reward_diff_lfu. Note that the contexts of those flipped rewards are highly correlated with .

(a) Most frequent user
(b) Medium frequent user
(c) Least frequent user
Figure 17: The reward poisoning on three target users.

6.3 Study on Feasibility

The attack feasibility depends on the historical contexts , the bandit algorithm-specific UCB parameter , the attack margin , the target arm , and the target context . To visualize the infeasible region of strong attack on context, we consider the following toy example.

The bandit has arms. The attacker’s target arm is , and the target context lies in . The historical context vectors are


The problem parameters are and . According to Theorem 4.1, any infeasible target context satisfies . Thus such must lie in the subspace spanned by the -axis and -axis. This allows us to show infeasible regions as 2D plots. In Figure 21LABEL:sub@infeasible:1_1, we show the infeasible regions. We distinguish the infeasible region due to each non-target arm by a different color. For example, the infeasible region due to arm 1 consists of all contexts on which the target arm can never be -better than arm 1 regardless of the attack. Note that the infeasible region due to arm 2 is a line segment of finite length, while that due to arm 3 is the whole line. The shape of the infeasible region due to each non-target arm varies because the historical data differs and therefore the conditions in theorem 4.1 characterizes different shapes. Note that the origin satisfies the conditions in Theorem 4.1 and therefore is always infeasible.

One important observation is that, if the bandit algorithm is trained on more historical data, more context vectors can potentially be strongly attacked. Formally, as indicated by Theorem 4.1 as the null space of historical context matrices shrinks, the infeasible region shrinks as well. To demonstrate this, in Figure 21LABEL:sub@infeasible:1_2 we add a context [0, 0, 0.5] to such that the historical contexts are:


Now that is reduced, the infeasibility region due to arm 1 shrinks from the circle in Figure 21LABEL:sub@infeasible:1_1 to a horizontal line segment in Figure 21LABEL:sub@infeasible:1_2. However the infeasible region may not shrink to a subset of itself, as indicated by the line segment having wider length along axis than the original circle, thus the shrink happens in the sense of being restricted to a lower-dimensional subspace.

Next we add a historical context to :

Then the infeasibility region due to arm 1 and arm 2 both shrink to the origin while arm 3 becomes a line segment, as shown in Figure 21LABEL:sub@infeasible:1_3.

(a) original data
(b) Context added to
(c) Context added to
Figure 21: Infeasible region due to each non-target arm.

In practice, historical data is often abundant so that , spans the whole space, and the only infeasible point is the origin. That is, the attacker can choose to attack essentially any context vector.

Another observation is that the infeasible region shrinks as the attack margin decreases, as shown in Figure 25. The historical data for each arm is the same as (24). The reason is that a smaller makes the constraints in (8) easier to satisfy and therefore more contexts are feasible. As the infeasible region converges to those contexts that cannot be weakly attacked, which in this example is the line in Figure 25LABEL:sub@infeasible:2_3. Note that the contexts that cannot be weakly attacked are those that make (6) infeasible. Therefore, we see that without abundant historical data, there will be some contexts that can never be strongly attacked even when . Also note that the origin can never be strongly attacked by definition.

Figure 25: Infeasible region shrinks as attack margin decreases.

6.4 Study on Side Effects

We first give an intuitive illustration of the side effect in 2D space. The bandit has arms, where the arm parameters are . We generate historical data same as before with noise . The target context is uniformly sampled from . The bandit algorithm uses regularization weight and confidence parameter . Without attack, the UCB for the three arms are


Therefore without attack arm 3 would have been chosen. By our design choice, the target arm is . The attacker uses margin . After attack the UCBs of all arms become:


As shown in Figure 26, the attacker forces the post-attack parameter of the best arm to deviate from while making closer to . Note that the attacker could also change the norm of the parameter. Note that arm 2 is not attacked, thus and overlap. The side effect is denoted by the brown arcs on the circle, where the arms chosen for those contexts are changed by the attacker. The side effect fraction for this example is .

Figure 26: Side effect shown in 2D context space.

Now we design a toy experiment to study how the side effect depends on the number of arms and the problem dimension. The context space is the -dimensional sphere and is uniform on the sphere. The bandit has arms, where the arm parameters are sampled from . Same as before, we generate historical data with noise . The bandit algorithm uses regularization weight . The target context is sampled from . The attacker’s margin is and the target arm is the worst arm on the target context . We sample contexts from to evaluate .

In Figure 30, we fix and show a histogram of as the number of arm varies. Note that the attack affects about users. The median for the three panels are , , and respectively, which shows that the side effect does not grow with the number of arms.

Figure 30: side effect fraction as arm number increases.

In Figure 34, we fix and show the side effect as the dimension varies. The median for the three panels are , , and , respectively, which implies that in higher dimensional space, the side effect tends to be smaller.

Figure 34: side effect fraction as dimension increases.

As the dimension increases, the attack has less side effect. This exposes the hazard that in real-world applications where the problem dimension is high, the attack will be hard to detect from side effects.

We also study the side effect for the real data experiment. There we use the test users to evaluate the side effect. The side effect fraction for the three users are , , and , respectively. Note that the most frequent user and the least frequent user have a large side effect, which makes the attack easy to detect. In contrast, the side effect of the medium frequent user is extremely small. This implies that the attack can induce different level of side effect for different target users.

7 Conclusions and Future Work

We studied offline data poisoning attack of contextual bandits. We proposed an optimization-based attack framework against contextual bandit algorithms. By manipulating the historical rewards, the attack can successfully force the bandit algorithm to pull a pre-specified arm for some target context. Experiments on both synthetic and real-world data demonstrate the effectiveness of the attack. This exposes a security concern in AI systems that involve contextual bandits.

There are several future directions that can be explored. For example, our current attack only targets a single context . Future work can characterize how to target a set of contexts simultaneously, i.e., force the bandit algorithm to pull the target arm for all contexts in some target set. In the simplest case where the set contains finitely many contexts, one can just replicate the constraint in (8) for each context in the set. The situation is more complicated if the target set is infinite or just too large. Another interesting question is how to develop defense mechanisms to protect the bandit from being attacked. As indicated in this paper, the defender can rely on the side effect to sense the existence of attacks. Conversely, it is also an open question how the attacker might attempt to minimize its side effect during the attack, so that the chances of being detected are minimized. Finally, in this paper we restrict the ability of the attacker to manipulating only the historical rewards. However, there are other types of attacks such as poisoning the historical contexts, adding additional data points, removing existing data points, or combinations of the above. The problem could become non-convex or even combinatorial depending on the type of the attack; some of these settings have been studied under the name “machine teaching” [18, 19]. Future work needs to identify how to extend our current attack framework to more general settings.

Acknowledgment This work is supported in part by NSF 1545481, 1704117, 1623605, 1561512, and the MADLab AF Center of Excellence FA9550-18-1-0166.