Online Data Poisoning Attack

03/05/2019 ∙ by Xuezhou Zhang, et al. ∙ University of Wisconsin-Madison 0

We study data poisoning attacks in the online learning setting where the training items stream in one at a time, and the adversary perturbs the current training item to manipulate present and future learning. In contrast, prior work on data poisoning attacks has focused on either batch learners in the offline setting, or online learners but with full knowledge of the whole training sequence. We show that online poisoning attack can be formulated as stochastic optimal control, and provide several practical attack algorithms based on control and deep reinforcement learning. Extensive experiments demonstrate the effectiveness of the attacks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As machine learning systems are increasingly applied in real-world applications, the security of machine learning algorithms is receiving growing attention. Adversarial machine learning studies the vulnerability in machine learning systems under malicious attacks

vorobeychik2018adversarial ; joseph2018adversarial ; zhu2018optimal . Formulating and understanding optimal attacks that might be carried out by an adversary is important, as it prepares us to manage the damage and helps us develop defenses. A particular line of work studies data poisoning attacks, in which the attacker aims to affect the learning process by contaminating the training data xiao2015support ; mei2015using ; burkard2017analysis ; chen2019optimal ; jun2018adversarial ; li2016data .

While there has been a long line of work on data poisoning attacks, they focused almost exclusively on offline learning where the victim machine learner performs batch learning on a training set biggio2012poisoning ; munoz2017towards ; xiao2015support ; mei2015using ; sen2018training ; chen2017targeted . Such attacking settings have been criticized of being unrealistic. For example, it may be practically hard for an attacker to modify training data saved on a company’s internal server. On the other hand, in applications such as product recommendation and e-commerce user-generated data arrives in a streaming fashion. Such applications are particularly susceptible to adversarial attacks because it is easier for the attacker to perform data poisoning from the outside.

In this paper we present the first principled methods for finding the optimal data poisoning attacks against online learning. Our key insight is to formulate online poisoning attacks as a stochastic optimal control problem. We then propose two algorithms – one based on traditional model-based optimal control, and the other based on deep reinforcement learning – and show that they achieve near-oracle performance in synthetic and real-data experiments.

Figure 1 depicts the threat model we consider in this paper. We distinguish three agents: the environment, the online learning victim, and the attacker.

Figure 1: threat model diagram

The environment produces a data point at time drawn from a time-invariant distribution . For example, can be a feature-label pair

in supervised learning or just the features

in unsupervised learning. Without an attacker, the victim receives

as is, and performs a learning update of its model


where is the victim’s learned parameter at time step . For example, can be one gradient step

when the learner performs online gradient descent with loss function

and step size .

The attacker sits in-between the environment and the victim. In this paper, we assume a white-box attacker who has knowledge of the victim’s update formula , the victim’s initial model , the data generated by the environment so far, and optionally a “prehistorical” data set . Importantly, however, at time the attacker does not have the clairvoyant knowledge of future data points for .

The attacker can perform only one type of action: it can choose to manipulate the data point into a different . The attacker incurs a perturbation cost , for example under an appropriate norm. The victim then receives the (potentially manipulated) data point and proceeds with model update (1).

The attacker’s goal is to force the victim’s learned model to satisfy certain properties while paying a small perturbation cost. This goal can be captured by a discounted cumulative cost on the attacker’s side:


where is a discounting factor and the running cost function at every step is defined by


Here is a weight chosen by the attacker to balance the attack loss function and the perturbation cost . The attack lost function is in fact concerned with rather than because is already known and fixed at time step , and what the attacker can affect, through the adversarial perturbation, is . The attack lost function can encode a variety of different attack properties considered in the literature such as

  • targeted attack , in which the goal is to drive the learned model to an attacker-defined target model ;

  • aversion attack (note the sign), in which the goal is to drive the learned model away from the “correct” model

    (such as one estimated from the pre-attack data);

  • backdoor attack , in which the goal is to plant a backdoor into the learned model such that the model predicts correctly for typical test items, but will behave unexpectedly on special examples li2016data ; sen2018training ; chen2017targeted .

2 Related Work

Data poisoning attacks have been studied against a wide range of learning systems. Poisoning attacks against SVM in both online and offline settings have been developed in biggio2012poisoning ; xiao2015support ; burkard2017analysis . Such attacks are generalized into a bilevel optimization framework against general offline learners with a convex objective function in mei2015using

. A variety of attacks against other learners have been developed, including neural networks

munoz2017towards ; koh2017understanding

, autoregressive models

chen2019optimal ; alfeld2016data , linear and stochastic bandits ma2018data ; jun2018adversarial , collaborative filtering li2016data

, and sentiment analysis

newell2014practicality , etc. However, this body of prior work has almost exclusively focused on the offline setting, where the whole training set is available to the attacker and the attacker can poison any part of the training set at once. In contrast, our paper focuses on the more difficult online setting where the attacker has to act sequentially.

There is an intermediate attack setting which we call clairvoyant online attacks, where the attacker performs attack actions sequentially but it has full knowledge of all future input data Two examples are burkard2017analysis and wang2018data . While some application scenarios may warrant the clairvoyant assumption, it is not representative of how online learning is applied in real-world scenarios. Our paper focuses instead on the more common and more difficult setting where the attacker has no knowledge of future data stream.

Finally, existing sequential attacks are narrowly focused. burkard2017analysis

only studied heuristic attacks against SVM learning from data streams, and 

wang2018data focused exclusively on binary classification with an online gradient descent (OGD) learner. Our paper advocate a general optimal control view that supersedes these prior settings. For example, the two “styles of attack" in wang2018data called semi-online and fully-online can be unified into a single framework using the language of optimal control. We discuss this optimal control view in details next.

3 An Optimal Control Formulation of Online Data Poisoning Attacks

For concreteness, we study the infinite time-horizon setting where in (2) goes to infinity. Generalizing our framework to the finite time-horizon setting is straightforward. We now show that online data poisoning attack can be formulated as a standard discrete-time stochastic optimal control problem. Specifically, an optimal control problem is defined by specifying a number of key components:

  • The plant to be controlled is the combined system of the victim and the environment.

  • The state of the plant at time consists of the learned parameter (generated by the victim) and the current data point (generated by the environment), i.e. .

  • The attacker’s control input is the perturbed training point .

  • From the attacker’s perspective, the system dynamics describes how the plant’s state evolves given the control input. In our problem the system evolution is discrete-timed since the online learning update is performed one step at a time. Formally our system dynamics can be written as


    in which the next environment data point is viewed as a stochastic “disturbance” that is sampled from the environment’s time-invariant distribution .

  • Finally, the quality of the control is specified by the running cost. In our problem, the running cost is precisely in (3).

The attacker’s optimal attack problem is characterized as a stochastic optimal control problem, namely finding a control policy that minimizes the expected future discounted cumulative running cost:


The expectation in equation (5) is over the randomness in future data points ’s sampled from , as well as in if it is a stochastic policy. Here and in the rest of the paper, we hide the auxiliary optimization variables (which should appear under ) to avoid clutter.

It is important to point out that (5) is not computable because the attacker does not know the data generating distribution , so it cannot evaluate the expectation. As we discuss in the next section, our strategy is to approximately and incrementally solve for the optimal policy as the attacker gathers more information about as the attack continues.

4 Near-Optimal Attacks via Model Predictive Control

The key obstacle that prevents the attacker from obtaining an optimal attack via (5) is the unknown data distribution . However, a useful resource that the attacker possesses is the growing observation of historical data points sampled from . Using this growing dataset, the attacker can build an increasingly accurate estimate that approaches the true distribution . In this paper we use the empirical distribution over the union of and any prehistorical data .

Specifically, at time with in place of , the attacker can solve for the optimal control policy of a surrogate control problem:


in which the only difference to (5) is the expectation. The attacker then uses the surrogate policy to perform one step attack:


The attacker iterates over (6) and (7) as time goes on. Such repeated procedure of (re)-planning ahead but only executing the immediate action is characteristic of Model Predictive Control (MPC). MPC is a common heuristic in stochastic nonlinear control. At each time step , MPC performs planning with incomplete information (in particular instead of ), and solves for an attack policy that optimizes the surrogate problem. However, MPC then carries out only the first control action . This allows MPC to adapt to future data inputs.

Note that this general MPC framework applies to any type of attack problem. We next propose two different methods that the attacker may use to solve the surrogate control problem (6). For concreteness, we will focus on describing each method in the setting of continuous action space, i.e. is a continuous set. We will also briefly discuss how each method can be adapted to solve discrete attack problems.

4.1 Nonlinear Programming (NLP)

In the NLP method the attacker further approximates the surrogate objective as


The first approximation introduces a time horizon steps ahead of , making it a finite-horizon control problem. The second approximation replaces the expectation by one random instantiation of the future input trajectory . It is of course possible to use the average of multiple future input trajectories to better approximate the expectation, though empirically we found that one trajectory is sufficient for MPC.

The attacker now solves the following optimization problem at every time , where the action sequence replaces the policy function as the optimizing variable:


The resulting attack problem (10) in general has a nonlinear objective stemming from and in (3), and nonconvex equality constraints stemming from the victim learner’s model update  (1). Nonetheless, the attacker can solve modest-sized attack problems using modern nonlinear programming solvers such as IPOPT wachter2006implementation . In cases where the action space is discrete, e.g. when attacking classification labels, (10) can still be formulated and will become a mixed integer nonlinear program (MINLP), which can be solved using commercial solvers such as KNITRO byrd2006k .

Recall that even though the attacker planned for the action sequence , it will then only execute the immediate action in accordance with MPC.

4.2 Policy Learning with Deep Deterministic Policy Gradient (DDPG)

Instead of further approximating (6) with a deterministic objective, one can directly solve (6) for the optimal policy using reinforcement learning. In this paper, we utilize deep deterministic policy gradient (DDPG) method lillicrap2015continuous to handle continuous action space, . In cases of discrete action space, other reinforcement learning methods such as REINFORCE sutton2000policy or DQN mnih2013playing can be used.

DDPG learns a deterministic policy with an actor-critic framework. Roughly speaking, it simultaneously learns an actor network parametrized by and a critic network parametrized by . The actor network represents the currently learned policy while the critic network estimate the action-value function of the current policy, whose functional gradient guides the actor network to improve its policy. Specifically, the policy gradient can be written as:


The critic network is updated using standard Temporal-Difference update. We refer the reader to the original paper lillicrap2015continuous

for a more detailed discussion of this algorithm and other deep learning implementation details.

There are two advantages of this policy learning approach to the direct approach NLP. Firstly, it actually learns a policy which can generalize to more than one step of attack. Secondly, it is a model-free method, which means that it doesn’t require knowledge of the analytical form of the system dynamic , which is necessary for the direct approach. Even though this paper focuses on the white-box setting, DDPG actually applies to the black-box setting as well, in which the attacker can only call the learner as a black-box.

In order to demonstrate the generalizability of the learned policy, in our experiments described later, we only allow the DDPG method to train once at the beginning of the attack using and the pre-historical data . The learned policy is then applied to all later attack rounds without retraining. Of course, in practice, the attacker can also periodically improve its attack policy as it accumulates more clean data to have a better estimate .

5 Experiments

In this section we empirically evaluate the performance of the aforementioned attack methods against a number of baselines on synthetic and real data. As an empirical measure of attack efficacy, we compare the different attack methods by the empirical discounted cumulative cost , defined as


Note is computed on the actual instantiation of the environment data stream . Better attack methods tend to have smaller .

5.1 Some Attack Baselines for Comparison

Null Attack:

This is the baseline without attack, namely for all . We expect the null attack to form an upper bound on any attack method’s empirical discounted cumulative cost .

Greedy Attack:

The greedy strategy is applied widely as a practical heuristic in solving sequential decision problems (liu2017iterative ; lessard2018optimal ). For our problem at time step the greedy attacker uses a time-invariant attack policy which minimizes the current step’s running cost . Specifically, the greedy attack policy can be written as


Both null attack and greedy attack can be viewed as time-invariant policies that do not utilize the information in .

Clairvoyant Attack:

A clairvoyant attacker is an idealized attacker who knows the number of steps of the attacks, and the exact past, present, and future data stream. As mentioned earlier, in most realistic online data poisoning settings an attacker would only have knowledge of the data stream up to , the present time. Therefore, the clairvoyant attacker has strictly more information, and we expect it to form a lower bound on any realistic attack method’s empirical discounted cumulative cost . The clairvoyant attacker solves a finite time-horizon optimal control problem, equivalent to the formulation in wang2018data but without the terminal cost:

s.t. (15)

It is then solved similarly as a nonlinear program.

5.2 Poisoning Task Specification

To specify a poisoning task is to define the victim learner in (1) and the attacker’s running cost in (3

). We evaluate all attacks on two types of victim learners: online logistic regression, a supervised learning algorithm, and online soft k-means clustering, an unsupervised learning algorithm.

Online logistic regression.

Online logistic regression performs a binary classification task. The incoming data takes the form of , where

is the feature vector and

is the binary label. In the experiments, we focus on attacking the feature part of the data, as is done in a number of prior works mei2015using ; wang2018data ; koh2017understanding . The learner’s update rule is one step of gradient descent on the log likelihood with step size :


The attacker wants to force the victim learner to stay close to a target parameter , i.e. this is a targeted attack. The attacker’s cost function is a weighted sum of two terms: the attack loss function

is the negative cosine similarity between the victim’s parameter and the target parameter, and the perturbation cost

is the L2 distance between the perturbed feature vector and the clean one:


Recall .

Online soft k-means.

Online soft k-means performs a k-means clustering task. The incoming data contains only the feature vector, i.e. . Its only difference from traditional k-means is that instead of updating only the centroid closest to the current data point, it updates all the centroids but the updates are weighted by their squared distances to the current data point using the softmax function bezdek1984fcm . Specifically, the learner’s update rule is one step of soft k-means update with step size on all centroids:


Recall . Similar to online logistic regression, we consider a targeted attack objective. The attacker wants to force the learned centroids to each stay close to the corresponding target centroid . The attacker’s cost function is a weighted sum of two terms: the attack loss function is the sum of the squared distance between each of the victim’s centroid and the corresponding target centroid, and the perturbation cost is the L2 distance between the perturbed feature vector and the clean one:


5.3 Synthetic Data Experiments

We first show a synthetic data experiment where the attack policy can be visualized and understood.

Experiment setup

We let the data generating distribution to be a mixture of two 1d Gaussian: . The victim learner is online soft k-means with . The victim’s initial parameter is set to and . The attack target model is and . In other words, the attacker wants to drag the victim’s learned parameters in the opposite direction against what the clean data from would otherwise indicate, namely and

. Other hyperparameters are: learning rate

, cost regularizer , discounting factor , evaluation length and look-ahead horizon for MPC .

For attack methods that requires solving a nonlinear program, including GREEDY, NLP and Clairvoyant, we use the JuMP modeling language dunning2017jump and the IPOPT interior-point solver wachter2006implementation . Following the above specification, we run each attack method on the same data stream and compare their behavior.

(a) Cumulative Attack Costs
(b) MPC Attack Trajectory
(c) DDPG Attack Trajectory

(d) NULL Attack Trajectory
(e) GREEDY Attack Trajectory
(f) Clairvoyant Attack Trajectory
Figure 2: The cumulative rewards and attack trajectory plots for the five attacks. In (b)-(f), the x-axis indicate the time step . the transparent blue and red dots indicate clean positive and negative data point at time step , the solid blue and red dots indicate attacker-perturbed data point , the transparent line in between indicate the perturbation.

Figure 2 shows the empirical discounted cumulative cost as the attacks go on. On this toy example, the null attack baseline achieves at . The greedy attacker is only slight effective at . NLP and DDPG achieve and , respectively, almost matching Clairvoyant’s . As expected, the null and clairvoyant attacks form upper and lower bounds on .

Figure 2b-f shows the attack trajectory as learning goes on. Without any attacks (null) the model parameter gradually converges to the true cluster centers and .

The greedy attack only perturbs each data point slightly, moving the blue points up and the red points down. As a result, the victim model parameters are forced away from the true cluster centers, but only slightly. This failure to attack is due to its greedy nature: the immediate cost at each round is indeed minimized, but the model parameters never get close enough to the target centroids. This distance is then penalized in all rounds cumulatively.

In contrast, the optimal control-based attack methods NLP and DDPG exhibit a different strategy in the earlier rounds. They are willing to inject larger perturbations to the data points and sacrifice larger immediate costs in order to drive the victim’s model parameters quickly towards the target centroids. They then only need to stabilize the victim’s parameters near the target centroids with smaller per-step cost.

The clairvoyant attack strategy exhibits another interesting behavior towards the end of the episode. Because the clairvoyant attacker knows ahead of time that the attack is only going to be evaluated up to , in the last few rounds after the clairvoyant attacker starts to behave greedily. It makes only smaller perturbations and allows the victim model parameters to drift back towards the clean centroids and away from the target centroids. This allows the clairvoyant attacker to gain an edge over the other two methods towards the end of the episode in terms of the cost .

5.4 Real Data Experiments

In the real data experiments, we run each attack method on 10 data sets across two victim learners.


We use 5 datasets for online logistic regression: Banknote Authentication (with feature dimension ), Breast Cancer(), Cardiotocography (), Sonar (), and MNIST 1 vs. 7 (), and 5 datasets for online k-means clustering: User Knowledge (), Breast Cancer (), Seeds(), posture (), MNIST 1 vs. 7 (). All datasets except for MNIST can be found in the UCI Machine Learning Repository Dua:2019 . Note that two datasets, Breast Cancer and MNIST, are shared across both tasks.


To reduce the running time, for datasets with dimensionality , we reduce the dimension to via PCA projection. We will discuss the challenges of scalability and potential solutions in section 6

. Then, all datasets are normalized so that each feature has mean 0 and variance 1. Each dataset is then turned into a data stream by random sampling. Specifically, each training data point

is sampled uniformly from the dataset with replacement.

Experiment Setup

In order to demonstrate the general applicability of our methods, we draw both the victim’s initial model and the attacker’s target

at random from a standard Gaussian distribution of the appropriate dimension, for both online logistic regression and online k-means in all 10 datasets. Across all data sets, we use the following hyperparameters:

. For online logistic regression while for online k-means .

For DDPG attacker we only perform policy learning at the beginning to obtain ; the learned policy is then fixed and used to perform all the attack actions in later rounds. In order to give it a fair chance, we give it a prehistorical dataset of size . For the sake of fair comparisons, we give the same prehistorical dataset to NLP as well. Note that NULL, GREEDY and Clairvoyant don’t utilize the historical data.

For NLP attack we set the look-ahead horizon such that the total runtime to perform attacks does not exceed the DDPG training time, which is 24 hours. This results in for online logistic regression on CTG, Sonar and MNIST, and in all other experiments.

(a) Banknote
(b) Breast
(c) CTG
(d) Sonar
(e) MNIST 1 vs. 7

(f) Knowledge
(g) Breast
(h) Seeds
(i) Posture
(j) MNIST 1 vs. 7
Figure 3: The empirical discounted cumulative reward for the five attack methods across 10 real datasets. The first row is on online logistic regression and the second row is on online k-means.

The experiment results are shown in figure 3. Interestingly, several consistent patterns emerge from the experiments: The clairvoyant attacker consistently achieves the lowest cumulative cost across all 10 datasets. This is not surprising, as the clairvoyant attacker has extra information of the future. The NLP attack achieves clairvoyant-matching performance on all 7 datasets in which it is given a large enough look-ahead horizon, i.e. . DDPG follows closely next to MPC and Clairvoyant on most of the datasets, indicating that the pretrained policy can achieve reasonably well attack performance in most cases. On the 3 datasets where for NLP, DDPG exceeds the short-sighted NLP, indicating that when the computational resource is limiting, DDPG has an advantage by avoiding the iterative retraining that NLP cannot bypass. GREEDY does not do well on any of the 10 datasets, achieving only a slightly lower cost than the NULL baseline. This matches our observations in the synthetic experiment.

Each of the attack methods also exhibits strategic behavioral patterns similar to what we observe in the synthetic experiment. In particular, the optimal-control based methods NLP and DDPG sacrifice larger immediate costs in earlier rounds in order to achieve smaller attack costs in later rounds. This is especially obvious in the online logistic regression plots 3b-e, where the cumulative costs rise dramatically in the first 50 rounds, becoming higher than the cost of NULL and GREEDY around that time. This early sacrifice pays off after where the cumulative cost starts to fall much faster. In 3c-e, however, the short-sighted NLP (with

) fails to fully pick up this long-term strategy, and exhibits a behavior close to an interpolation of greedy and optimal. This is not surprising, as NLP with horizon

is indeed equivalent to the GREEDY method. Thus, there is a spectrum of methods between GREEDY and NLP that can achieve various levels of performance with different computational costs.

6 Discussions and Conclusion

Our exercise in identifying the optimal online data poisoning attacks reveals some takeaway messages:

1. Optimal control-based methods NLP and DDPG achieve significantly better performance than heuristic methods such as GREEDY, and in some cases, they even achieve clairvoyant-level performance.

2. In the case that the learner’s dynamics is known to the attacker and is differentiable, and that the induced nonlinear program can be solved efficiently, NLP is a strong attack method.

3. DDPG, on the other hand, is able to learn a reasonable attack policy given enough prehistorical data. The attack policy can be fixed and deployed, which is advantageous when the data stream comes in quickly and no time to re-do planning in MPC/NLP. In our experiments, we did not allow DDPG to refine its policy based on newly available data. This could be one reason that DDPG is slightly behind MPC in most experiments.

4. As mentioned earlier, DDPG can work with a blackbox victim learner. All it needs is for the victim to produce when given . Therefore, even when the victim’s learning algorithm is unknown or not differentiable, DDPG is still a viable option.

There are many interesting directions for future work. Can the attacking methods scale to high dimensional datasets? Even though methods like deep reinforcement learning have been successful in dealing with high dimensional state spaces such as in the Atari games, high dimension continuous action space is still very challenging. As a concrete example, in lillicrap2015continuous

, the highest action dimension they were able to tackle is 12, whereas if we want to perform the attack on the original MNIST data set, the action space dimensionality is 784. There are possible heuristic ways to manually reduce the action space dimension. One way is to extract the principal components of the high dimensional data and only allow adversarial examples to lie within the subspace spanned by the first few principal components, as we did. How should one performs the attack if the clean training sequence is not i.i.d., but selectively sampled by the victim learner? This is true when the victim learner is a sequential decision-maker such as an active learning algorithm, a multi-armed bandit, or a reinforcement learning agent itself. Finally, how should the victim defend against such online poisoning attacks?

To conclude, in this paper we formulated online poisoning attacks as stochastic optimal control. We propose two high performing attack algorithms: one based on MPC and nonlinear programming that achieves near-clairvoyant performance, and the other based on reinforcement learning and learns an attack policy that can be applied without retraining. We compared these attack algorithms with several baseline methods via extensive real-data experiments and found that they achieve consistently better performances.


This work is supported in part by NSF 1837132, 1545481, 1704117, 1623605, 1561512, the MADLab AF Center of Excellence FA9550-18-1-0166, and the University of Wisconsin.