1 Introduction
Machine learning has come to be widely deployed in a broad array of applications. An important class of applications of machine learning uses it to enable scalable security solutions, such as spam filtering, traffic analysis, and fraud detection androutsopoulos2000learning ; chan1998toward ; stolfo1997credit . In these applications, reliability of the machine learning system is crucial to ensure service quality and enforce security, but strong incentives exist to reduce learning efficacy (e.g., to bypass spam filters). Indeed, recent research demonstrates that existing systems are vulnerable in the presense of adversaries who can manipulate either the training (i.e. the poisoning attack) or test data (i.e. the evasion attack) xiao2015feature ; lowd2005adversarial ; li2014feature ; li2015scalable . Consequently, an important agenda in machine learning research is to develop learning algorithms that are robust to data manipulation. In this work, we focus on designing supervised learning algorithms that are robust to poisoning attacks.
Existing research on robust machine learning dates back to algorithms for robust PCA robustpca . Most of them assume that a portion of the underlying dataset is randomly, rather than adversarially, corrupted. Recently, Chen et al. chen2013robust and Feng et al. feng2014robust
considered recovery strategy when the corruption is adversarially chosen to achieve some malicious goal. The former considers a robust linear and the latter logistic regression models. However, both make an extremely strong assumption that each feature is subGaussian with vanishing variance (as
) and features are independent, rendering them impractical and severely limiting the scope of associated theoretical guarantees.In this work, we propose a novel algorithmic framework for making linear regression robust to data poisoning. Our framework does not require either subGaussian or independence assumptions on the feature matrix X. Instead, we assume that X is generated through adversarial corruption of an approximately lowrank matrix. Our goal is to make regression which uses dimensionality reduction, such as PCA, robust to adversarial manipulation. The technical challenge is twofold: first, we must make sure that the dimensionality reduction step can reliably recover the lowrank subspace, and second, that the resulting regression performed on the subspace can recover sufficiently accurate predictions, in both cases despite both noise and adversarial examples. While these problems have previously been considered in isolation, ours is the first integrated approach. More significantly, the effectiveness of our approach relies on weaker assumptions than prior art, and, as a result, our proposed practical algorithms significantly outperform stateoftheart alternatives.
Specifically, we assume that labels y are a linear function of the true feature matrix with additive zeromean noise. In addition,
is corrupted with noise, and the adversary subsequently adds a collection of corrupted rows to the training data. In this model, our approach involves two parts: first, we develop a novel robust matrix factorization algorithm which correctly recovers the subspace whenever this is possible, and second, a trimmed principle component regression, which uses the recovered basis and trimmed optimization to estimate linear model parameters.
Our main contributions are as follows:

[itemsep=0pt]

Novel algorithm for robust matrix factorization: We develop a novel algorithm that reliably recovers the lowrank subspace of the feature matrix despite both noise (about which we make few assumptions) and adversarial examples. We prove that our algorithm is effective iff subspace recovery is possible.

Novel robust regression algorithm with significantly weaker assumptions:
In contrast to prior robust regression work, we do not require either feature independence or lowvariance subGaussian distribution of features. We prove that our algorithm can reliably learn the lowdimensional linear model despite data corruption and noise.

Significant improvement in running time and accuracy: We present efficient algorithms which significantly outperform prior art in running time and prediction efficacy.
Related Work: Robust PCA is widely used as a statistical tool for data analysis and dimensionality reduction that is robust to i.i.d. noise robustpca . However, these methods cannot deal with “malicious” corruptions, where the sophisticated adversaries can manipulate rows from the subspace of the true feature matrix. In contrast, our approach handles both noise and malicious corruption. Recently, robust learning for several learning models, such as linear and logistic regression have also been proposed chen2013robust ; feng2014robust . The limitation of these approaches is their assumption that the feature matrix is subGaussian with vanishing variance, and that features are independent. Our approach, in contrast, only assumes that the true feature matrix (prior to corruption) is low rank. Yan et al.
proposed an outlier pursuit algorithm to deal with the matrix completion problem with corruptions
yan2013exact , and a similar algorithm is applied by Xu et al. to deal with the noisy version of feature matrix xu2010 . However, these methods only consider the matrix recovery problem and are not scalable. A more scalable algorithm based on the alternating minimization approach was recently proposed by Rodriguez et al. rodriguez2013fast; however, this method does not consider data noise or corruption. A number of heuristic techniques have also been proposed for poisoning attacks
barni2014source ; rubinstein2009antidote ; biggio2011baggingfor other problems, such as robust anomaly detection source identification.
2 Problem Setup and Solution Overview
We start with the pristine training dataset of labeled examples,
, which subsequently suffers from two types of corruption: noise is added to feature vectors, and the adversary adds
malicious examples (feature vectors and labels) to best mislead the learning. We assume that the adversary has full knowledge of the learning algorithm. The learner’s goal is to learn a model on the corrupted dataset which is similar to the true model. The feature space is highdimensional, and the learner will perform dimensionality reduction prior to learning. In particular, we assume that is lowrank with a basis B, and we assume that the true model is the associated lowdimensional linear regression.Formally, observed training data is generated as follows:

[itemsep=0pt]

Ground truth: , where is the true model, is its lowdimensional representation, and is the lowdimensional embedding of .

Noise: , where N is a noise matrix with ; , where is i.i.d. zeromean Gaussian noise with variance .

Corruption: The attacker adds adversarially crafted examples to get
, which maximally skews prediction performance of lowdimensional linear regression.
To formally characterize how well the learner performs in this setting, we define (1) a model function which is the model learned on ; (2) a loss function ; and (3) a threshold function which takes as input , and is increasing in . Our metric is tolerance:
Definition 1 (tolerance).
We say that learner is tolerant, if for any attacker, and any , we have
with probability at least
, for some constant .In our setting, return and is expected quadratic loss .
For convenience, we let denote the set of (unknown) indices of the samples in X coming from and the set of indices for adversarial samples in X. For an index set and matrix , denotes the submatrix containing only rows in ; similar notation is used for vectors. We define as the corruption ratio, or the ratio of corrupted and pristine data.
2.1 Solution overview and paper organization
Our goal is to design a learner to estimate the coefficients of the true model using lowdimensional embedding of a highdimensional model. We achieve this goal in two steps: (1) recover the subspace B of ; (2) project X onto B, and estimate using robust principle component regression. The key challenge is that an adversary can design corrupted data to interfere both with the first and second step of the process.
For the first step (Section 3), we develop a robust subspace recovery algorithm which can account for both noise N and adversarial examples in correctly recovering the subspace of . We characterize necessary and sufficient conditions for successful subspace recovery, showing that our algorithm succeeds whenever recovery is possible. The challenge in the second step (Section 4) is that the adversary can construct from the same subspace as , but with the different distribution of from . To address this, we propose the trimmed principle component regression algorithm to minimize the loss function over only a subset of the dataset ensuring that the adversary can have only a limited impact by adding arbitrary corrupted samples without having these instances being discarded. Our theoretical results demonstrate that the combined approach is tolerant learning algorithm. Finally, in Section 5, we present an efficient practical implementation of our methods, which we evaluate in Section 6.
In our analysis, we use the corruption parameter and the rank of the lowdimensional embedding to characterize the theoretical results. In our experiments, however, we show that we only require a lower bound on and an upper bound on for our techniques to work.
3 Robust Subspace Recovery
In this section, we discuss how to recover the lowrank subspace of from X. Our goal is to exactly recover the lowrank subspace, i.e., returning a basis for . We show sufficient and necessary conditions for this problem to be solvable, and provide algorithms when this is possible. As a warmup, we first discuss the noisefree version of the problem, and then present our results for the problem with noises. Proofs of the theorems presented in this section can be found in Appendix A. Formally, we consider the following problem:
Problem Definition 1 (Subspace Recovery).
Design an algorithm , which takes as input X, and returns a set of vectors which form the basis of .
3.1 Warmup: Noisefree Subspace Recovery
We first consider an easier version of Problem 1 with . In this case, we know that . We assume that we know (or have an upper bound on it). Below we demonstrate that there exists a sharp threshold on such that whenever , we can solve Problem 1 exactly with high probability, whereas if , Problem 1 cannot be solved. To characterize this threshold, we define the cardinality of the maximal rank subspace as the optimal value of the following problem:
Intuitively, the adversary can insert samples to form a rank subspace, which does not span . The following theorem shows that in this case, there is indeed no learner that can successfully recover the subspace of .
Theorem 1.
If , then there exists an adversary such that no algorithm solves Problem 1 with probability greater than .
On the other hand, when is below this threshold, we can use the following simple algorithm to recover the subspace of :
In fact, we can prove the following theorem.
Theorems 1 and 2 together give the necessary and sufficient conditions on when Problem 1 is solvable, and Algorithm 1 provides a solution. We further show an implication of these theorems on the corruption ratio . We can prove that (see Appendix A). Combining this with Theorem 1, we can have the following upper bound on .
Corollary 1.
If , then Problem 1 cannot be solved.
3.2 Dealing with Noise
We now consider Problem 1 with noise. Before we discuss the adversary, we first need to assume that the uncorrupted version is solvable. In particular, we assume that optimizes the following problem:
(1a)  
(1b) 
Without otherwise mentioned, we use to denote the Frobenius norm. We put no additional restrictions on N except above. Note that this assumption is implied by the classical PCA problem eckart1936approximation ; hotelling1933analysis ; jolliffe2002principal . We want to emphasize on the optimal value of the above problem. We denote this value to be noise residual, denoted as . Noise residual is a key component to characterize the necessary and sufficient condition for the solvability of Problem 1.
Characterization of the defender’s ability to accurately recover the true basis B of after the attacker adds malicious instances stems from the attacker’s ability to mislead the defender into thinking that some other basis, , better represents . Intuitively, since the defender does not know , , or which rows of the data matrix X are adversarial, this comes down to the ability to identify the rows that correspond to the correct basis (note that it will suffice to obtain the correct basis even if some adversarial rows are used, since the adversary may be forced to align malicious examples with the correct basis to evade explicit detection). As we show below, whether the defender can succeed is determined by the relationship between the noise residual and submatrix residual, denoted as , which we define as the value optimizing the following problem:
(2a)  
s.t.  (2b)  
(2c) 
We now explain the above optimization problem. and are and matrixes separately. Here is a basis which the attacker “targets”; for convenience, we require to be orthogonal (i.e., ). Since the attacker succeeds only if they can induce a basis different from the true B, we require that does not span of , which is equivalent to saying . Thus, this optimization problem seeks rows of , where is the corresponding index set. The objective is to minimize the distance between and the span space of the target basis , (i.e., ).
Solve the following optimization problem and get .
(3) 
return a basis of .
To understand the importance of , consider Algorithm 2 for recovering the basis of , B. If the optimal objective value of optimization problem (2), , exceeds the noise , it follows that the defender can obtain the correct basis B using Algorithm 2, as it yields a better lowrank approximation of X than any other basis. Else, it is, indeed, possible for the adversary to induce an incorrect choice of basis. The following theorem formalizes this argument.
Theorem 3.
To draw connection between the noisy case and the noisefree case, we can view Theorem 1 and 2 as special cases of Theorem 3.
Theorem 4.
When , if and only if .
4 Trimmed Principal Component Regression
In this section, we present trimmed principal component regression (TPCR) algorithm. The key idea is to leverage the principal component regression (PCR) approach to estimate , but during the process trimming out those malicious samples that try to deviate the estimator from the true ones. In the following, we present the approach, which is similar to the standard PCR approach, though we do not require computing the exact singular vectors of .
Assume we recover a basis B of . Without loss of generality, we assume that B is an orthogonal basis of row vectors. Since B is a basis for , we assume . Then we know that, by optimization (1), . We compute , and, by definition, we know . By OLS estimator, we know that , and thus .
To estimate , we assume . Since , we convert the estimation problem of from a high dimensional space to the estimation problem of from a low dimensional space, such that . After estimating for , we can convert it back to get . Notice that this is similar to principal component regression jolliffe1982note .
However, the adversary may corrupt rows in U to fool the learner to make a wrong estimation on , and thus on . To mitigate this problem, we design Algorithm 3.
Input:

[itemsep=0pt]

Use Algorithm 2 to compute a basis from X, and orthogonalize it to get B

Project X onto the span space of B and get

Solve the following minimization problem to get
(4) where denotes the th smallest element in sequence .

return .
Intuitively, during the training process, we trim out the top samples that maximize the difference between the observed response and the predicted response , where denotes the th row of . Since we know the variances of these differences are small (i.e., recall Section 2, is the variance of the random noise ), these samples corresponding to the largest differences are more likely to be the adversarial ones.
Next, we theoretically bound the prediction differences between our model and the linear regression model learnt on .
Lemma 1.
Algorithm 3 returns , such that for any real value with at least probability for some constant , we have
(5) 
We explain the intuition of this Lemma, and defers the detailed proof to Appendix B. If an adversary wants to fool Algorithm 3, it needs to generate samples , such that the loss function is among the smallest . Since for samples from , there loss functions are already bounded according to , the adversary does not have an ability to significantly skew the estimator. In particular, if , i.e., there is no error while generating from , then the adversary can do nothing when , and thus the estimator is the same as the linear regression’s estimator on the uncorrupted data.
As an immediate consequence of Lemma 1, we have
Theorem 5.
Given , Algorithm 3 is tolerant.
5 Practical Algorithms
Algorithms 1, 2, and 3 require enumerating a subset of indeces, and are thus all exponential time. To make our approach practical, we develop efficient implementations of Algorithms 2 and 3.
5.1 Efficient Robust Subspace Recovery
Consider the objective function (3). Since , we can rewrite where U’s and B’s shapes are , and respectively. Therefore, we can rewrite objective (3) as
which is equivalent to
(6) 
where and denote the th row of X and U respectively. Solving Objective 6 can be done using alternating minimization, which iteratively optimizes the objective for U and B while fixing the other. Specifically, in the th iteration, we optimize for the following two objectives:
Notice that the second step computes the entire U regardless of the submatrix restriction. This is because we need the entire U to be computed to update B. The key challenge is to compute in each iteration, which is, again, a trimmed optimization problem. The next section presents a scalable solution for such problems.
5.2 Efficient Algorithm for Trimmed Optimization Problems
Both robust subspace recovery and optimizing for (4) rely on solving optimization problems in the form
where computes the prediction over using parameter , and is the loss function. We refer to such problems as trimmed optimization problems. It is easy to see that solving this problem is equivalent to solving
We can use alternating minimization technique to solve this problem, by optimizing for , and respectively. We present this in Algorithm 4. In particular, the algorithm iteratively seeks optimal arguments for and respectively. Optimizing for is a standard least square optimization problem. When optimizing , it is easy to see that if is among the largest ; and otherwise. Therefore, optimizing for is a simple sorting step. While this algorithm is not guaranteed to converge to a global optimal, in our evaluation,we observe that a random start of typically yields nearoptimal solutions.
The following theorem shows that the algorithm converges.
Theorem 6.
Given a loss function such that a lower bound exists, i.e., , Algorithm 4 converges. That is, assuming is the loss after th iteration, then we have
The proof can be found in Appendix C.
6 Experimental Results
We evaluate the proposed algorithms in comparison with stateoftheart alternatives on synthetic data. For the subspace recovery problem, we compare to two stateoftheart approaches: Chen et al. chenicml11 and Xu et al. xu2010 . We compare the combined TPCR algorithm with the recent robust regression approach chen2013robust
, and the standard ridge regression algorithm. For most experiments, we set
and . The only exception is that, when we evaluate of the impact of on runtime, we vary from 1,000 to 8,000. We vary the rank of , and . Results are averages of 30 runs after dropping the largest and smallest values.Data: For a given , we generate as follows: sample two matrixes with shape and respectively, such that each element is sampled independently from , ensuring that both have rank , and we set . Next, we generate corruptions also as a low rank matrix by generating and as above. For , we set the first half of by choosing rows of , and generating the remaining rows randomly, ensuring that B has rank . We then concatenate and and shuffle the rows. We do not add noise to , unless explicitly stated. In doing so, we know that shares a common subspace of rank with , but the two subspaces are still different. This approach of data generation is significantly more adversarial than prior methods of generating adversarial instances, as we show below. To generate labels, we generate a random , and y is then constructed as described in Section 2, where we apply the method of Xiao et al. xiao2015feature to create adversarial labels.
Runtime: Figure 1(a) and (b) presents the runtime comparison results. Our approach is implemented in Scala without any special optimization, and both chenicml11 and xu2010 are implemented in Matlab leveraging Matlab’s builtin optimizations for matrix operations. In Figure 1(a), we vary the rank from 1 to 20, and fix . Our algorithm is significantly faster than chenicml11 and xu2010 . Figure 1(b) shows runtime as a function of , rank and are fixed to 20 and 50 respectively. We can observe that scalability of chenicml11 and xu2010 is quite poor in the size of the problem. In contrast, our algorithm remains quite efficient (with running time under seconds in all cases).
Identification rate of corrupted rows: Figure 1(c) presents the percentage of all corrupted rows identified by the algorithms. We generate the data fixing intrinsic rank to be , and varying inserted corruptions from 10 to 150, keeping the total data size to be . We evaluate our approach varying algorithm parameter from 10 to 20, and parameter from to . The results show that our approach achieves 100% accuracy regardless of the chosen parameters as long as is no less than the intrinsic rank, and is no bigger than the number of pristine rows. We also compare our approach with prior work, xu2010 and chenicml11 , which are identical. We refer to both as Xu et al. xu2010 . We can observe that the identification rate plummets for , even though only 5% of the rows are corrupted.
(a)  (b)  (c) 
Error on noisy data: We add noise to to evaluate performance on noisy data. Since chenicml11 cannot handle noise, we only compare with xu2010 . On each element of , we add a noise sampled from . Figure 2(a) and 2(b) show RMSE of the difference from recovered and the true . This metric is used by xu2010 as well. We use the grayscale to denote the RMSE: lighter color corresponds to smaller RMSE. On most test cases our algorithm successfully recovers the true subspace, while xu2010 fails on most cases. Particularly, when , our approach can completely recover the underlying lowrank matrix. When increases, the condition might not hold true, and Theorem 3 says that no algorithm can recover the true subspace with probability greater than . However, this theorem does not prevent our algorithm succeeding with probability , which is why we observe several white spots for high .
(a) TPCR  (b) Xu et al. xu2010  (c) 
Robust Regression: We evaluate our TPCR Algorithm 3 against the robust regression method of Chen et al. chen2013robust , which is the only alternative method for linear regression robust to adversarial data poisoning. As a baseline, we also present results for standard OLS linear regression with and without adversarial instances (LR(O+A) and LR(O), respectively). Results are evaluated using a ground truth test set not used for training. The results, shown in Figure 2(c), demonstrate that our algorithm significantly outperforms the alternatives. Indeed, while robust regression of Chen et al. chen2013robust does better than the nonrobust baseline, our method works nearly as well as linear regression without adversarial instances!
7 Conclusion
This paper considers the poisoning attack for linear regression problem with dimensionality reduction. We address the problem in two steps: 1) introducing a novel robust matrix factorization method to recover the true subspace, and 2) novel robust principle component regression to prune adversarial instances based on the basis recovered in step (1). We characterize necessary and sufficient conditions for our approach to be successful in recovering the true subspace, and present a bound on expected prediction loss compared to ground truth. Experimental results suggest that the proposed approach is extremely effective, and significantly outperforms prior art.
References
 (1) Ion Androutsopoulos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sakkis, Constantine D Spyropoulos, and Panagiotis Stamatopoulos. Learning to filter spam email: A comparison of a naive bayesian and a memorybased approach. arXiv preprint cs/0009009, 2000.
 (2) Philip K Chan and Salvatore J Stolfo. Toward scalable learning with nonuniform class and cost distributions: A case study in credit card fraud detection. In KDD, volume 1998, pages 164–168, 1998.
 (3) Salvatore Stolfo, David W Fan, Wenke Lee, Andreas Prodromidis, and P Chan. Credit card fraud detection using metalearning: Issues and initial results. In AAAI97 Workshop on Fraud Detection and Risk Management, 1997.

(4)
Huang Xiao, Battista Biggio, Gavin Brown, Giorgio Fumera, Claudia Eckert, and
Fabio Roli.
Is feature selection secure against training data poisoning?
In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pages 1689–1698, 2015.  (5) Daniel Lowd and Christopher Meek. Adversarial learning. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 641–647. ACM, 2005.
 (6) Bo Li and Yevgeniy Vorobeychik. Feature crosssubstitution in adversarial classification. In Advances in Neural Information Processing Systems, pages 2087–2095, 2014.
 (7) Bo Li and Yevgeniy Vorobeychik. Scalable optimization of randomized operational decisions in adversarial classification settings. In AISTATS, 2015.

(8)
Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright.
Robust principal component analysis?
Journal of the ACM (JACM), 58(3):11, 2011.  (9) Yudong Chen, Constantine Caramanis, and Shie Mannor. Robust high dimensional sparse regression and matching pursuit. arXiv preprint arXiv:1301.2725, 2013.
 (10) Jiashi Feng, Huan Xu, Shie Mannor, and Shuicheng Yan. Robust logistic regression and classification. In Advances in Neural Information Processing Systems, pages 253–261, 2014.
 (11) Ming Yan, Yi Yang, and Stanley Osher. Exact lowrank matrix completion from sparsely corrupted entries via adaptive outlier pursuit. Journal of Scientific Computing, 56(3):433–449, 2013.
 (12) Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust pca via outlier pursuit. In Advances in Neural Information Processing Systems, pages 2496–2504, 2010.
 (13) Paul Rodriguez and Brendt Wohlberg. Fast principal component pursuit via alternating minimization. In Image Processing (ICIP), 2013 20th IEEE International Conference on, pages 69–73. IEEE, 2013.
 (14) Mauro Barni and Benedetta Tondi. Source distinguishability under corrupted training. In Information Forensics and Security (WIFS), 2014 IEEE International Workshop on, pages 197–202. IEEE, 2014.
 (15) Benjamin IP Rubinstein, Blaine Nelson, Ling Huang, Anthony D Joseph, Shinghon Lau, Satish Rao, Nina Taft, and JD Tygar. Antidote: understanding and defending against poisoning of anomaly detectors. In Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference, pages 1–14. ACM, 2009.

(16)
Battista Biggio, Igino Corona, Giorgio Fumera, Giorgio Giacinto, and Fabio
Roli.
Bagging classifiers for fighting poisoning attacks in adversarial classification tasks.
In Multiple Classifier Systems, pages 350–359. Springer, 2011.  (17) Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218, 1936.
 (18) Harold Hotelling. Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6):417, 1933.
 (19) Ian Jolliffe. Principal component analysis. Wiley Online Library, 2002.
 (20) Ian T Jolliffe. A note on the use of principal components in regression. Applied Statistics, pages 300–303, 1982.
 (21) Yudong Chen, Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust matrix completion and corrupted columns. In Proc. of ICML 11, 2011.
Appendix A Proof of Theorems in Section 3
Since, this section does not involve y, we will omit y without loss of clarity.
a.1 Theorem 1
Proof of Theorem 1.
We prove by contradiction. Assume we have a learner , can solve Problem 1 with probability more than 1/2. We want to show that there exist two different spaces of rank, and one input X such that should return both two spaces with a probability , which is impossible. In the following, we construct such two spaces. Particularly, we will discuss how adversary can manipulate the matrix.
The adversary can choose which maximize such that while . We know . This means that .
Suppose be a set of basis for the row space of . The adversary then choose a vector which is orthogonal to . Then we know the span space of is different from . Then the adversary draws samples from the span space of , and insert them into to form X. Moreover, we denote to be a matrix of rows, so that the first rows are , and the rest rows sampled by the adversary. Therefore, we know is also a submatrix of X, and we know that there are at most rows in X not coming from .
In doing so, we know that X is constructed by corrupting . On the other hand, we can also see X as the result of corrupting by inserting rows. Therefore, should return both and with a probability greater than , which is impossible. Therefore, our conclusion holds true. ∎
a.2 Theorem 2
Proof of Theorem 2.
We show that Algorithm 1 recovers the subspace of exactly. Assume B is returned by Algorithm 1 over X. We only need to show that B is a basis of . By Algorithm 1, we know that B is a basis of rows in of X. Since we know any adversary can corrupt at most rows, thus . Therefore, by combining the assumption , we know that
(7) 
Therefore, we know B is a basis for . By the definition of and inequality (7), we know that
Therefore, we know that is exactly the same subspace as , and thus B is the basis of . ∎
a.3 Corollary 1
Lemma 2.
Proof.
We can choose the , then we have . Therefore, . ∎
Now, we can prove Corollary 1.
a.4 Theorem 3
Proof of Theorem 3.
The proof of this theorem is similar to the proof of Theorem 1 and 2. First, when , the adversary can construct X such that two subspaces should be recovered with a probability greater than . Particularly, we assume minimize objective 2, and thus . The adversary samples rows from the span space of B, which does not belong to the span of . We add a small noise over to get , such that (1) minimize ; and (2) . Then the adversary insert into to get X. In this case, we know that optimizes its distance from , while the optimizes its distance from , where we use to denote the concatenation of rows from and respectively. Further, by definition, we know both of these two distances is . Therefore, the learner should recover from X both and with probability greater than . This is impossible! Therefore the first part of the theorem holds true.
For the second part, we follow the proof of Theorem 2 verbatim, and present the difference. We show that Algorithm 2 recovers the subspace of exactly. Assume B is returned by Algorithm 2 over X. We only need to show that B is a basis of . By Algorithm 2, we know that B optimizes its pan distance from a subset of rows in X, which is denoted as . Since we know any adversary can corrupt at most rows, thus . Therefore, we know that
(8) 
If B is not a basis of , which means that , then we know that the distance between the span space of B and is greater than . This is impossible, since Algorithm 2 guarantees that this distance should be no greater than . Contradiction! Therefore the second part of the theorem holds true. ∎
a.5 Theorem 4
Appendix B Proof of Lemma 1
Proof.
Assume is the solution for this optimization problem. We assume the adversary wants to induce the regression system to compute . In this case, he has to corrupt rows in . W.L.O.G. we can assume . We denote . Since , we know that
Since optimize Eq (4), we assume are the smallest values for .
Then we have
Therefore we have
(9) 
Further, we know
(10)  
According to CauchySchwarz inequality, we have
We assume , then, we have
Substituting this inequality into (10) and combining with (9), we have
By simple rearrangement, we have
Since we know , we know that for any parameter , we have for some constant . Therefore, we know, with high probability (at least ), we have
Therefore, we have
and thus
Therefore, we know
We notice the right hand side of the above inequality does not depend on . Therefore, we take , and we know that
, and apply the law of large numbers, we have
where left hand side is the same as . Then the conclusion of Lemma 1 is a simple rearrangement of the above inequality. ∎
Appendix C Proof of Theorem 6
We consider
and
According to the algorithm, it is easy to see that
Therefore, is a monotonic decreasing sequence. Since a lower bound exists on the sequence, we assume is the inferior of the sequence . Therefore, we know that
Further, we know that
and thus
Therefore, we have
Q.E.D.
Comments
There are no comments yet.