1 Introduction
As machine learning systems are increasingly applied in realworld applications, the security of machine learning algorithms is receiving growing attention. Adversarial machine learning studies the vulnerability in machine learning systems under malicious attacks
vorobeychik2018adversarial ; joseph2018adversarial ; zhu2018optimal . Formulating and understanding optimal attacks that might be carried out by an adversary is important, as it prepares us to manage the damage and helps us develop defenses. A particular line of work studies data poisoning attacks, in which the attacker aims to affect the learning process by contaminating the training data xiao2015support ; mei2015using ; burkard2017analysis ; chen2019optimal ; jun2018adversarial ; li2016data .While there has been a long line of work on data poisoning attacks, they focused almost exclusively on offline learning where the victim machine learner performs batch learning on a training set biggio2012poisoning ; munoz2017towards ; xiao2015support ; mei2015using ; sen2018training ; chen2017targeted . Such attacking settings have been criticized of being unrealistic. For example, it may be practically hard for an attacker to modify training data saved on a company’s internal server. On the other hand, in applications such as product recommendation and ecommerce usergenerated data arrives in a streaming fashion. Such applications are particularly susceptible to adversarial attacks because it is easier for the attacker to perform data poisoning from the outside.
In this paper we present the first principled methods for finding the optimal data poisoning attacks against online learning. Our key insight is to formulate online poisoning attacks as a stochastic optimal control problem. We then propose two algorithms – one based on traditional modelbased optimal control, and the other based on deep reinforcement learning – and show that they achieve nearoracle performance in synthetic and realdata experiments.
Figure 1 depicts the threat model we consider in this paper. We distinguish three agents: the environment, the online learning victim, and the attacker.
The environment produces a data point at time drawn from a timeinvariant distribution . For example, can be a featurelabel pair
in supervised learning or just the features
in unsupervised learning. Without an attacker, the victim receives
as is, and performs a learning update of its model(1) 
where is the victim’s learned parameter at time step . For example, can be one gradient step
when the learner performs online gradient descent with loss function
and step size .The attacker sits inbetween the environment and the victim. In this paper, we assume a whitebox attacker who has knowledge of the victim’s update formula , the victim’s initial model , the data generated by the environment so far, and optionally a “prehistorical” data set . Importantly, however, at time the attacker does not have the clairvoyant knowledge of future data points for .
The attacker can perform only one type of action: it can choose to manipulate the data point into a different . The attacker incurs a perturbation cost , for example under an appropriate norm. The victim then receives the (potentially manipulated) data point and proceeds with model update (1).
The attacker’s goal is to force the victim’s learned model to satisfy certain properties while paying a small perturbation cost. This goal can be captured by a discounted cumulative cost on the attacker’s side:
(2) 
where is a discounting factor and the running cost function at every step is defined by
(3) 
Here is a weight chosen by the attacker to balance the attack loss function and the perturbation cost . The attack lost function is in fact concerned with rather than because is already known and fixed at time step , and what the attacker can affect, through the adversarial perturbation, is . The attack lost function can encode a variety of different attack properties considered in the literature such as

targeted attack , in which the goal is to drive the learned model to an attackerdefined target model ;

aversion attack (note the sign), in which the goal is to drive the learned model away from the “correct” model
(such as one estimated from the preattack data);

backdoor attack , in which the goal is to plant a backdoor into the learned model such that the model predicts correctly for typical test items, but will behave unexpectedly on special examples li2016data ; sen2018training ; chen2017targeted .
2 Related Work
Data poisoning attacks have been studied against a wide range of learning systems. Poisoning attacks against SVM in both online and offline settings have been developed in biggio2012poisoning ; xiao2015support ; burkard2017analysis . Such attacks are generalized into a bilevel optimization framework against general offline learners with a convex objective function in mei2015using
. A variety of attacks against other learners have been developed, including neural networks
munoz2017towards ; koh2017understanding chen2019optimal ; alfeld2016data , linear and stochastic bandits ma2018data ; jun2018adversarial , collaborative filtering li2016data, and sentiment analysis
newell2014practicality , etc. However, this body of prior work has almost exclusively focused on the offline setting, where the whole training set is available to the attacker and the attacker can poison any part of the training set at once. In contrast, our paper focuses on the more difficult online setting where the attacker has to act sequentially.There is an intermediate attack setting which we call clairvoyant online attacks, where the attacker performs attack actions sequentially but it has full knowledge of all future input data Two examples are burkard2017analysis and wang2018data . While some application scenarios may warrant the clairvoyant assumption, it is not representative of how online learning is applied in realworld scenarios. Our paper focuses instead on the more common and more difficult setting where the attacker has no knowledge of future data stream.
Finally, existing sequential attacks are narrowly focused. burkard2017analysis
only studied heuristic attacks against SVM learning from data streams, and
wang2018data focused exclusively on binary classification with an online gradient descent (OGD) learner. Our paper advocate a general optimal control view that supersedes these prior settings. For example, the two “styles of attack" in wang2018data called semionline and fullyonline can be unified into a single framework using the language of optimal control. We discuss this optimal control view in details next.3 An Optimal Control Formulation of Online Data Poisoning Attacks
For concreteness, we study the infinite timehorizon setting where in (2) goes to infinity. Generalizing our framework to the finite timehorizon setting is straightforward. We now show that online data poisoning attack can be formulated as a standard discretetime stochastic optimal control problem. Specifically, an optimal control problem is defined by specifying a number of key components:

The plant to be controlled is the combined system of the victim and the environment.

The state of the plant at time consists of the learned parameter (generated by the victim) and the current data point (generated by the environment), i.e. .

The attacker’s control input is the perturbed training point .

From the attacker’s perspective, the system dynamics describes how the plant’s state evolves given the control input. In our problem the system evolution is discretetimed since the online learning update is performed one step at a time. Formally our system dynamics can be written as
(4) in which the next environment data point is viewed as a stochastic “disturbance” that is sampled from the environment’s timeinvariant distribution .

Finally, the quality of the control is specified by the running cost. In our problem, the running cost is precisely in (3).
The attacker’s optimal attack problem is characterized as a stochastic optimal control problem, namely finding a control policy that minimizes the expected future discounted cumulative running cost:
(5)  
s.t.  
The expectation in equation (5) is over the randomness in future data points ’s sampled from , as well as in if it is a stochastic policy. Here and in the rest of the paper, we hide the auxiliary optimization variables (which should appear under ) to avoid clutter.
It is important to point out that (5) is not computable because the attacker does not know the data generating distribution , so it cannot evaluate the expectation. As we discuss in the next section, our strategy is to approximately and incrementally solve for the optimal policy as the attacker gathers more information about as the attack continues.
4 NearOptimal Attacks via Model Predictive Control
The key obstacle that prevents the attacker from obtaining an optimal attack via (5) is the unknown data distribution . However, a useful resource that the attacker possesses is the growing observation of historical data points sampled from . Using this growing dataset, the attacker can build an increasingly accurate estimate that approaches the true distribution . In this paper we use the empirical distribution over the union of and any prehistorical data .
Specifically, at time with in place of , the attacker can solve for the optimal control policy of a surrogate control problem:
(6)  
s.t.  
in which the only difference to (5) is the expectation. The attacker then uses the surrogate policy to perform one step attack:
(7) 
The attacker iterates over (6) and (7) as time goes on. Such repeated procedure of (re)planning ahead but only executing the immediate action is characteristic of Model Predictive Control (MPC). MPC is a common heuristic in stochastic nonlinear control. At each time step , MPC performs planning with incomplete information (in particular instead of ), and solves for an attack policy that optimizes the surrogate problem. However, MPC then carries out only the first control action . This allows MPC to adapt to future data inputs.
Note that this general MPC framework applies to any type of attack problem. We next propose two different methods that the attacker may use to solve the surrogate control problem (6). For concreteness, we will focus on describing each method in the setting of continuous action space, i.e. is a continuous set. We will also briefly discuss how each method can be adapted to solve discrete attack problems.
4.1 Nonlinear Programming (NLP)
In the NLP method the attacker further approximates the surrogate objective as
(8)  
(9) 
The first approximation introduces a time horizon steps ahead of , making it a finitehorizon control problem. The second approximation replaces the expectation by one random instantiation of the future input trajectory . It is of course possible to use the average of multiple future input trajectories to better approximate the expectation, though empirically we found that one trajectory is sufficient for MPC.
The attacker now solves the following optimization problem at every time , where the action sequence replaces the policy function as the optimizing variable:
(10)  
s.t.  
The resulting attack problem (10) in general has a nonlinear objective stemming from and in (3), and nonconvex equality constraints stemming from the victim learner’s model update (1). Nonetheless, the attacker can solve modestsized attack problems using modern nonlinear programming solvers such as IPOPT wachter2006implementation . In cases where the action space is discrete, e.g. when attacking classification labels, (10) can still be formulated and will become a mixed integer nonlinear program (MINLP), which can be solved using commercial solvers such as KNITRO byrd2006k .
Recall that even though the attacker planned for the action sequence , it will then only execute the immediate action in accordance with MPC.
4.2 Policy Learning with Deep Deterministic Policy Gradient (DDPG)
Instead of further approximating (6) with a deterministic objective, one can directly solve (6) for the optimal policy using reinforcement learning. In this paper, we utilize deep deterministic policy gradient (DDPG) method lillicrap2015continuous to handle continuous action space, . In cases of discrete action space, other reinforcement learning methods such as REINFORCE sutton2000policy or DQN mnih2013playing can be used.
DDPG learns a deterministic policy with an actorcritic framework. Roughly speaking, it simultaneously learns an actor network parametrized by and a critic network parametrized by . The actor network represents the currently learned policy while the critic network estimate the actionvalue function of the current policy, whose functional gradient guides the actor network to improve its policy. Specifically, the policy gradient can be written as:
(11) 
The critic network is updated using standard TemporalDifference update. We refer the reader to the original paper lillicrap2015continuous
for a more detailed discussion of this algorithm and other deep learning implementation details.
There are two advantages of this policy learning approach to the direct approach NLP. Firstly, it actually learns a policy which can generalize to more than one step of attack. Secondly, it is a modelfree method, which means that it doesn’t require knowledge of the analytical form of the system dynamic , which is necessary for the direct approach. Even though this paper focuses on the whitebox setting, DDPG actually applies to the blackbox setting as well, in which the attacker can only call the learner as a blackbox.
In order to demonstrate the generalizability of the learned policy, in our experiments described later, we only allow the DDPG method to train once at the beginning of the attack using and the prehistorical data . The learned policy is then applied to all later attack rounds without retraining. Of course, in practice, the attacker can also periodically improve its attack policy as it accumulates more clean data to have a better estimate .
5 Experiments
In this section we empirically evaluate the performance of the aforementioned attack methods against a number of baselines on synthetic and real data. As an empirical measure of attack efficacy, we compare the different attack methods by the empirical discounted cumulative cost , defined as
(12) 
Note is computed on the actual instantiation of the environment data stream . Better attack methods tend to have smaller .
5.1 Some Attack Baselines for Comparison
Null Attack:
This is the baseline without attack, namely for all . We expect the null attack to form an upper bound on any attack method’s empirical discounted cumulative cost .
Greedy Attack:
The greedy strategy is applied widely as a practical heuristic in solving sequential decision problems (liu2017iterative ; lessard2018optimal ). For our problem at time step the greedy attacker uses a timeinvariant attack policy which minimizes the current step’s running cost . Specifically, the greedy attack policy can be written as
(13) 
Both null attack and greedy attack can be viewed as timeinvariant policies that do not utilize the information in .
Clairvoyant Attack:
A clairvoyant attacker is an idealized attacker who knows the number of steps of the attacks, and the exact past, present, and future data stream. As mentioned earlier, in most realistic online data poisoning settings an attacker would only have knowledge of the data stream up to , the present time. Therefore, the clairvoyant attacker has strictly more information, and we expect it to form a lower bound on any realistic attack method’s empirical discounted cumulative cost . The clairvoyant attacker solves a finite timehorizon optimal control problem, equivalent to the formulation in wang2018data but without the terminal cost:
(14)  
s.t.  (15)  
It is then solved similarly as a nonlinear program.
5.2 Poisoning Task Specification
To specify a poisoning task is to define the victim learner in (1) and the attacker’s running cost in (3
). We evaluate all attacks on two types of victim learners: online logistic regression, a supervised learning algorithm, and online soft kmeans clustering, an unsupervised learning algorithm.
Online logistic regression.
Online logistic regression performs a binary classification task. The incoming data takes the form of , where
is the feature vector and
is the binary label. In the experiments, we focus on attacking the feature part of the data, as is done in a number of prior works mei2015using ; wang2018data ; koh2017understanding . The learner’s update rule is one step of gradient descent on the log likelihood with step size :(16) 
The attacker wants to force the victim learner to stay close to a target parameter , i.e. this is a targeted attack. The attacker’s cost function is a weighted sum of two terms: the attack loss function
is the negative cosine similarity between the victim’s parameter and the target parameter, and the perturbation cost
is the L2 distance between the perturbed feature vector and the clean one:(17) 
Recall .
Online soft kmeans.
Online soft kmeans performs a kmeans clustering task. The incoming data contains only the feature vector, i.e. . Its only difference from traditional kmeans is that instead of updating only the centroid closest to the current data point, it updates all the centroids but the updates are weighted by their squared distances to the current data point using the softmax function bezdek1984fcm . Specifically, the learner’s update rule is one step of soft kmeans update with step size on all centroids:
(18)  
(19) 
Recall . Similar to online logistic regression, we consider a targeted attack objective. The attacker wants to force the learned centroids to each stay close to the corresponding target centroid . The attacker’s cost function is a weighted sum of two terms: the attack loss function is the sum of the squared distance between each of the victim’s centroid and the corresponding target centroid, and the perturbation cost is the L2 distance between the perturbed feature vector and the clean one:
(20) 
5.3 Synthetic Data Experiments
We first show a synthetic data experiment where the attack policy can be visualized and understood.
Experiment setup
We let the data generating distribution to be a mixture of two 1d Gaussian: . The victim learner is online soft kmeans with . The victim’s initial parameter is set to and . The attack target model is and . In other words, the attacker wants to drag the victim’s learned parameters in the opposite direction against what the clean data from would otherwise indicate, namely and
. Other hyperparameters are: learning rate
, cost regularizer , discounting factor , evaluation length and lookahead horizon for MPC .For attack methods that requires solving a nonlinear program, including GREEDY, NLP and Clairvoyant, we use the JuMP modeling language dunning2017jump and the IPOPT interiorpoint solver wachter2006implementation . Following the above specification, we run each attack method on the same data stream and compare their behavior.
Results
Figure 2 shows the empirical discounted cumulative cost as the attacks go on. On this toy example, the null attack baseline achieves at . The greedy attacker is only slight effective at . NLP and DDPG achieve and , respectively, almost matching Clairvoyant’s . As expected, the null and clairvoyant attacks form upper and lower bounds on .
Figure 2bf shows the attack trajectory as learning goes on. Without any attacks (null) the model parameter gradually converges to the true cluster centers and .
The greedy attack only perturbs each data point slightly, moving the blue points up and the red points down. As a result, the victim model parameters are forced away from the true cluster centers, but only slightly. This failure to attack is due to its greedy nature: the immediate cost at each round is indeed minimized, but the model parameters never get close enough to the target centroids. This distance is then penalized in all rounds cumulatively.
In contrast, the optimal controlbased attack methods NLP and DDPG exhibit a different strategy in the earlier rounds. They are willing to inject larger perturbations to the data points and sacrifice larger immediate costs in order to drive the victim’s model parameters quickly towards the target centroids. They then only need to stabilize the victim’s parameters near the target centroids with smaller perstep cost.
The clairvoyant attack strategy exhibits another interesting behavior towards the end of the episode. Because the clairvoyant attacker knows ahead of time that the attack is only going to be evaluated up to , in the last few rounds after the clairvoyant attacker starts to behave greedily. It makes only smaller perturbations and allows the victim model parameters to drift back towards the clean centroids and away from the target centroids. This allows the clairvoyant attacker to gain an edge over the other two methods towards the end of the episode in terms of the cost .
5.4 Real Data Experiments
In the real data experiments, we run each attack method on 10 data sets across two victim learners.
Datasets
We use 5 datasets for online logistic regression: Banknote Authentication (with feature dimension ), Breast Cancer(), Cardiotocography (), Sonar (), and MNIST 1 vs. 7 (), and 5 datasets for online kmeans clustering: User Knowledge (), Breast Cancer (), Seeds(), posture (), MNIST 1 vs. 7 (). All datasets except for MNIST can be found in the UCI Machine Learning Repository Dua:2019 . Note that two datasets, Breast Cancer and MNIST, are shared across both tasks.
Preprocessing
To reduce the running time, for datasets with dimensionality , we reduce the dimension to via PCA projection. We will discuss the challenges of scalability and potential solutions in section 6
. Then, all datasets are normalized so that each feature has mean 0 and variance 1. Each dataset is then turned into a data stream by random sampling. Specifically, each training data point
is sampled uniformly from the dataset with replacement.Experiment Setup
In order to demonstrate the general applicability of our methods, we draw both the victim’s initial model and the attacker’s target
at random from a standard Gaussian distribution of the appropriate dimension, for both online logistic regression and online kmeans in all 10 datasets. Across all data sets, we use the following hyperparameters:
. For online logistic regression while for online kmeans .For DDPG attacker we only perform policy learning at the beginning to obtain ; the learned policy is then fixed and used to perform all the attack actions in later rounds. In order to give it a fair chance, we give it a prehistorical dataset of size . For the sake of fair comparisons, we give the same prehistorical dataset to NLP as well. Note that NULL, GREEDY and Clairvoyant don’t utilize the historical data.
For NLP attack we set the lookahead horizon such that the total runtime to perform attacks does not exceed the DDPG training time, which is 24 hours. This results in for online logistic regression on CTG, Sonar and MNIST, and in all other experiments.
Results
The experiment results are shown in figure 3. Interestingly, several consistent patterns emerge from the experiments: The clairvoyant attacker consistently achieves the lowest cumulative cost across all 10 datasets. This is not surprising, as the clairvoyant attacker has extra information of the future. The NLP attack achieves clairvoyantmatching performance on all 7 datasets in which it is given a large enough lookahead horizon, i.e. . DDPG follows closely next to MPC and Clairvoyant on most of the datasets, indicating that the pretrained policy can achieve reasonably well attack performance in most cases. On the 3 datasets where for NLP, DDPG exceeds the shortsighted NLP, indicating that when the computational resource is limiting, DDPG has an advantage by avoiding the iterative retraining that NLP cannot bypass. GREEDY does not do well on any of the 10 datasets, achieving only a slightly lower cost than the NULL baseline. This matches our observations in the synthetic experiment.
Each of the attack methods also exhibits strategic behavioral patterns similar to what we observe in the synthetic experiment. In particular, the optimalcontrol based methods NLP and DDPG sacrifice larger immediate costs in earlier rounds in order to achieve smaller attack costs in later rounds. This is especially obvious in the online logistic regression plots 3be, where the cumulative costs rise dramatically in the first 50 rounds, becoming higher than the cost of NULL and GREEDY around that time. This early sacrifice pays off after where the cumulative cost starts to fall much faster. In 3ce, however, the shortsighted NLP (with
) fails to fully pick up this longterm strategy, and exhibits a behavior close to an interpolation of greedy and optimal. This is not surprising, as NLP with horizon
is indeed equivalent to the GREEDY method. Thus, there is a spectrum of methods between GREEDY and NLP that can achieve various levels of performance with different computational costs.6 Discussions and Conclusion
Our exercise in identifying the optimal online data poisoning attacks reveals some takeaway messages:
1. Optimal controlbased methods NLP and DDPG achieve significantly better performance than heuristic methods such as GREEDY, and in some cases, they even achieve clairvoyantlevel performance.
2. In the case that the learner’s dynamics is known to the attacker and is differentiable, and that the induced nonlinear program can be solved efficiently, NLP is a strong attack method.
3. DDPG, on the other hand, is able to learn a reasonable attack policy given enough prehistorical data. The attack policy can be fixed and deployed, which is advantageous when the data stream comes in quickly and no time to redo planning in MPC/NLP. In our experiments, we did not allow DDPG to refine its policy based on newly available data. This could be one reason that DDPG is slightly behind MPC in most experiments.
4. As mentioned earlier, DDPG can work with a blackbox victim learner. All it needs is for the victim to produce when given . Therefore, even when the victim’s learning algorithm is unknown or not differentiable, DDPG is still a viable option.
There are many interesting directions for future work. Can the attacking methods scale to high dimensional datasets? Even though methods like deep reinforcement learning have been successful in dealing with high dimensional state spaces such as in the Atari games, high dimension continuous action space is still very challenging. As a concrete example, in lillicrap2015continuous
, the highest action dimension they were able to tackle is 12, whereas if we want to perform the attack on the original MNIST data set, the action space dimensionality is 784. There are possible heuristic ways to manually reduce the action space dimension. One way is to extract the principal components of the high dimensional data and only allow adversarial examples to lie within the subspace spanned by the first few principal components, as we did. How should one performs the attack if the clean training sequence is not i.i.d., but selectively sampled by the victim learner? This is true when the victim learner is a sequential decisionmaker such as an active learning algorithm, a multiarmed bandit, or a reinforcement learning agent itself. Finally, how should the victim defend against such online poisoning attacks?
To conclude, in this paper we formulated online poisoning attacks as stochastic optimal control. We propose two high performing attack algorithms: one based on MPC and nonlinear programming that achieves nearclairvoyant performance, and the other based on reinforcement learning and learns an attack policy that can be applied without retraining. We compared these attack algorithms with several baseline methods via extensive realdata experiments and found that they achieve consistently better performances.
Acknowledgments
This work is supported in part by NSF 1837132, 1545481, 1704117, 1623605, 1561512, the MADLab AF Center of Excellence FA95501810166, and the University of Wisconsin.
References

(1)
Scott Alfeld, Xiaojin Zhu, and Paul Barford.
Data poisoning attacks against autoregressive models.
In
Thirtieth AAAI Conference on Artificial Intelligence
, 2016.  (2) James C Bezdek, Robert Ehrlich, and William Full. Fcm: The fuzzy cmeans clustering algorithm. Computers & Geosciences, 10(23):191–203, 1984.
 (3) Battista Biggio, Blaine Nelson, and Pavel Laskov. Poisoning attacks against support vector machines. arXiv preprint arXiv:1206.6389, 2012.
 (4) Cody Burkard and Brent Lagesse. Analysis of causative attacks against svms learning from data streams. In Proceedings of the 3rd ACM on International Workshop on Security And Privacy Analytics, pages 31–36. ACM, 2017.
 (5) Richard H Byrd, Jorge Nocedal, and Richard A Waltz. K nitro: An integrated package for nonlinear optimization. In Largescale nonlinear optimization, pages 35–59. Springer, 2006.
 (6) Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526, 2017.
 (7) Yiding Chen and Xiaojin Zhu. Optimal adversarial attack on autoregressive models. arXiv preprint arXiv:1902.00202, 2019.
 (8) Dheeru Dua and Casey Graff. Uci machine learning repository, 2017.

(9)
Iain Dunning, Joey Huchette, and Miles Lubin.
Jump: A modeling language for mathematical optimization.
SIAM Review, 59(2):295–320, 2017.  (10) Anthony D Joseph, Blaine Nelson, Benjamin IP Rubinstein, and JD Tygar. Adversarial Machine Learning. Cambridge University Press, 2018.
 (11) KwangSung Jun, Lihong Li, Yuzhe Ma, and Jerry Zhu. Adversarial attacks on stochastic bandits. In Advances in Neural Information Processing Systems, pages 3644–3653, 2018.
 (12) Pang Wei Koh and Percy Liang. Understanding blackbox predictions via influence functions. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1885–1894. JMLR. org, 2017.
 (13) Laurent Lessard, Xuezhou Zhang, and Xiaojin Zhu. An optimal control approach to sequential machine teaching. arXiv preprint arXiv:1810.06175, 2018.
 (14) Bo Li, Yining Wang, Aarti Singh, and Yevgeniy Vorobeychik. Data poisoning attacks on factorizationbased collaborative filtering. In Advances in neural information processing systems, pages 1885–1893, 2016.
 (15) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 (16) Weiyang Liu, Bo Dai, Ahmad Humayun, Charlene Tay, Chen Yu, Linda B Smith, James M Rehg, and Le Song. Iterative machine teaching. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2149–2158. JMLR. org, 2017.

(17)
Yuzhe Ma, KwangSung Jun, Lihong Li, and Xiaojin Zhu.
Data poisoning attacks in contextual bandits.
In
International Conference on Decision and Game Theory for Security
, pages 186–204. Springer, 2018.  (18) Shike Mei and Xiaojin Zhu. Using machine teaching to identify optimal trainingset attacks on machine learners. In TwentyNinth AAAI Conference on Artificial Intelligence, 2015.
 (19) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 (20) Luis MuñozGonzález, Battista Biggio, Ambra Demontis, Andrea Paudice, Vasin Wongrassamee, Emil C Lupu, and Fabio Roli. Towards poisoning of deep learning algorithms with backgradient optimization. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 27–38. ACM, 2017.
 (21) Andrew Newell, Rahul Potharaju, Luojie Xiang, and Cristina NitaRotaru. On the practicality of integrity attacks on documentlevel sentiment analysis. In Proceedings of the 2014 Workshop on Artificial Intelligent and Security Workshop, pages 83–93. ACM, 2014.
 (22) Ayon Sen, Scott Alfeld, Xuezhou Zhang, Ara Vartanian, Yuzhe Ma, and Xiaojin Zhu. Training set camouflage. In International Conference on Decision and Game Theory for Security, pages 59–79. Springer, 2018.
 (23) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
 (24) Yevgeniy Vorobeychik and Murat Kantarcioglu. Adversarial machine learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 12(3):1–169, 2018.
 (25) Andreas Wächter and Lorenz T Biegler. On the implementation of an interiorpoint filter linesearch algorithm for largescale nonlinear programming. Mathematical programming, 106(1):25–57, 2006.
 (26) Yizhen Wang and Kamalika Chaudhuri. Data poisoning attacks against online learning. arXiv preprint arXiv:1808.08994, 2018.
 (27) Huang Xiao, Battista Biggio, Blaine Nelson, Han Xiao, Claudia Eckert, and Fabio Roli. Support vector machines under adversarial label contamination. Neurocomputing, 160:53–62, 2015.
 (28) Xiaojin Zhu. An optimal control view of adversarial machine learning. arXiv preprint arXiv:1811.04422, 2018.
Comments
There are no comments yet.