Robust High-Dimensional Linear Regression

The effectiveness of supervised learning techniques has made them ubiquitous in research and practice. In high-dimensional settings, supervised learning commonly relies on dimensionality reduction to improve performance and identify the most important factors in predicting outcomes. However, the economic importance of learning has made it a natural target for adversarial manipulation of training data, which we term poisoning attacks. Prior approaches to dealing with robust supervised learning rely on strong assumptions about the nature of the feature matrix, such as feature independence and sub-Gaussian noise with low variance. We propose an integrated method for robust regression that relaxes these assumptions, assuming only that the feature matrix can be well approximated by a low-rank matrix. Our techniques integrate improved robust low-rank matrix approximation and robust principle component regression, and yield strong performance guarantees. Moreover, we experimentally show that our methods significantly outperform state of the art both in running time and prediction error.

Authors

• 60 publications
• 142 publications
• 36 publications
• 12 publications
• Approximation Algorithms for ℓ_0-Low Rank Approximation

We study the ℓ_0-Low Rank Approximation Problem, where the goal is, give...
10/30/2017 ∙ by Karl Bringmann, et al. ∙ 0

• Bayesian Learning for Low-Rank matrix reconstruction

We develop latent variable models for Bayesian learning based low-rank m...
01/23/2015 ∙ by Martin Sundin, et al. ∙ 0

• CUR Decompositions, Approximations, and Perturbations

03/22/2019 ∙ by Keaton Hamm, et al. ∙ 0

• On Universal Features for High-Dimensional Learning and Inference

We consider the problem of identifying universal low-dimensional feature...
11/20/2019 ∙ by Shao-Lun Huang, et al. ∙ 0

• On Approximation Guarantees for Greedy Low Rank Optimization

We provide new approximation guarantees for greedy low rank matrix estim...
03/08/2017 ∙ by Rajiv Khanna, et al. ∙ 0

• Confidence interval of singular vectors for high-dimensional and low-rank matrix regression

Let M∈R^m_1× m_2 be an unknown matrix with r= rank( M)≪(m_1,m_2) whose ...
05/24/2018 ∙ by Dong Xia, et al. ∙ 0

• Modelling Competitive Sports: Bradley-Terry-Élő Models for Supervised and On-Line Learning of Paired Competition Outcomes

Prediction and modelling of competitive sports outcomes has received muc...
01/27/2017 ∙ by Franz J Király, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning has come to be widely deployed in a broad array of applications. An important class of applications of machine learning uses it to enable scalable security solutions, such as spam filtering, traffic analysis, and fraud detection androutsopoulos2000learning ; chan1998toward ; stolfo1997credit . In these applications, reliability of the machine learning system is crucial to ensure service quality and enforce security, but strong incentives exist to reduce learning efficacy (e.g., to bypass spam filters). Indeed, recent research demonstrates that existing systems are vulnerable in the presense of adversaries who can manipulate either the training (i.e. the poisoning attack) or test data (i.e. the evasion attack) xiao2015feature ; lowd2005adversarial ; li2014feature ; li2015scalable . Consequently, an important agenda in machine learning research is to develop learning algorithms that are robust to data manipulation. In this work, we focus on designing supervised learning algorithms that are robust to poisoning attacks.

Existing research on robust machine learning dates back to algorithms for robust PCA robustpca . Most of them assume that a portion of the underlying dataset is randomly, rather than adversarially, corrupted. Recently, Chen et al. chen2013robust and Feng et al. feng2014robust

considered recovery strategy when the corruption is adversarially chosen to achieve some malicious goal. The former considers a robust linear and the latter logistic regression models. However, both make an extremely strong assumption that each feature is sub-Gaussian with vanishing variance (as

) and features are independent, rendering them impractical and severely limiting the scope of associated theoretical guarantees.

In this work, we propose a novel algorithmic framework for making linear regression robust to data poisoning. Our framework does not require either sub-Gaussian or independence assumptions on the feature matrix X. Instead, we assume that X is generated through adversarial corruption of an approximately low-rank matrix. Our goal is to make regression which uses dimensionality reduction, such as PCA, robust to adversarial manipulation. The technical challenge is two-fold: first, we must make sure that the dimensionality reduction step can reliably recover the low-rank subspace, and second, that the resulting regression performed on the subspace can recover sufficiently accurate predictions, in both cases despite both noise and adversarial examples. While these problems have previously been considered in isolation, ours is the first integrated approach. More significantly, the effectiveness of our approach relies on weaker assumptions than prior art, and, as a result, our proposed practical algorithms significantly outperform state-of-the-art alternatives.

Specifically, we assume that labels y are a linear function of the true feature matrix with additive zero-mean noise. In addition,

is corrupted with noise, and the adversary subsequently adds a collection of corrupted rows to the training data. In this model, our approach involves two parts: first, we develop a novel robust matrix factorization algorithm which correctly recovers the subspace whenever this is possible, and second, a trimmed principle component regression, which uses the recovered basis and trimmed optimization to estimate linear model parameters.

Our main contributions are as follows:

• [itemsep=0pt]

• Novel algorithm for robust matrix factorization: We develop a novel algorithm that reliably recovers the low-rank subspace of the feature matrix despite both noise (about which we make few assumptions) and adversarial examples. We prove that our algorithm is effective iff subspace recovery is possible.

• Novel robust regression algorithm with significantly weaker assumptions:

In contrast to prior robust regression work, we do not require either feature independence or low-variance sub-Gaussian distribution of features. We prove that our algorithm can reliably learn the low-dimensional linear model despite data corruption and noise.

• Significant improvement in running time and accuracy: We present efficient algorithms which significantly outperform prior art in running time and prediction efficacy.

Related Work: Robust PCA is widely used as a statistical tool for data analysis and dimensionality reduction that is robust to i.i.d. noise robustpca . However, these methods cannot deal with “malicious” corruptions, where the sophisticated adversaries can manipulate rows from the subspace of the true feature matrix. In contrast, our approach handles both noise and malicious corruption. Recently, robust learning for several learning models, such as linear and logistic regression have also been proposed chen2013robust ; feng2014robust . The limitation of these approaches is their assumption that the feature matrix is sub-Gaussian with vanishing variance, and that features are independent. Our approach, in contrast, only assumes that the true feature matrix (prior to corruption) is low rank. Yan et al.

proposed an outlier pursuit algorithm to deal with the matrix completion problem with corruptions

yan2013exact , and a similar algorithm is applied by Xu et al. to deal with the noisy version of feature matrix xu2010 . However, these methods only consider the matrix recovery problem and are not scalable. A more scalable algorithm based on the alternating minimization approach was recently proposed by Rodriguez et al. rodriguez2013fast

; however, this method does not consider data noise or corruption. A number of heuristic techniques have also been proposed for poisoning attacks

barni2014source ; rubinstein2009antidote ; biggio2011bagging

for other problems, such as robust anomaly detection source identification.

2 Problem Setup and Solution Overview

, which subsequently suffers from two types of corruption: noise is added to feature vectors, and the adversary adds

malicious examples (feature vectors and labels) to best mislead the learning. We assume that the adversary has full knowledge of the learning algorithm. The learner’s goal is to learn a model on the corrupted dataset which is similar to the true model. The feature space is high-dimensional, and the learner will perform dimensionality reduction prior to learning. In particular, we assume that is low-rank with a basis B, and we assume that the true model is the associated low-dimensional linear regression.

Formally, observed training data is generated as follows:

1. [itemsep=0pt]

2. Ground truth: , where is the true model, is its low-dimensional representation, and is the low-dimensional embedding of .

3. Noise: , where N is a noise matrix with ; , where is i.i.d. zero-mean Gaussian noise with variance .

, which maximally skews prediction performance of low-dimensional linear regression.

To formally characterize how well the learner performs in this setting, we define (1) a model function which is the model learned on ; (2) a loss function ; and (3) a threshold function which takes as input , and is increasing in . Our metric is -tolerance:

Definition 1 ((f,l,δ)-tolerance).

We say that learner is -tolerant, if for any attacker, and any , we have

with probability at least

, for some constant .

In our setting, return and is expected quadratic loss .

For convenience, we let denote the set of (unknown) indices of the samples in X coming from and the set of indices for adversarial samples in X. For an index set and matrix , denotes the sub-matrix containing only rows in ; similar notation is used for vectors. We define as the corruption ratio, or the ratio of corrupted and pristine data.

2.1 Solution overview and paper organization

Our goal is to design a learner to estimate the coefficients of the true model using low-dimensional embedding of a high-dimensional model. We achieve this goal in two steps: (1) recover the subspace B of ; (2) project X onto B, and estimate using robust principle component regression. The key challenge is that an adversary can design corrupted data to interfere both with the first and second step of the process.

For the first step (Section 3), we develop a robust subspace recovery algorithm which can account for both noise N and adversarial examples in correctly recovering the subspace of . We characterize necessary and sufficient conditions for successful subspace recovery, showing that our algorithm succeeds whenever recovery is possible. The challenge in the second step (Section 4) is that the adversary can construct from the same subspace as , but with the different distribution of from . To address this, we propose the trimmed principle component regression algorithm to minimize the loss function over only a subset of the dataset ensuring that the adversary can have only a limited impact by adding arbitrary corrupted samples without having these instances being discarded. Our theoretical results demonstrate that the combined approach is -tolerant learning algorithm. Finally, in Section 5, we present an efficient practical implementation of our methods, which we evaluate in Section 6.

In our analysis, we use the corruption parameter and the rank of the low-dimensional embedding to characterize the theoretical results. In our experiments, however, we show that we only require a lower bound on and an upper bound on for our techniques to work.

3 Robust Subspace Recovery

In this section, we discuss how to recover the low-rank subspace of from X. Our goal is to exactly recover the low-rank subspace, i.e., returning a basis for . We show sufficient and necessary conditions for this problem to be solvable, and provide algorithms when this is possible. As a warmup, we first discuss the noise-free version of the problem, and then present our results for the problem with noises. Proofs of the theorems presented in this section can be found in Appendix A. Formally, we consider the following problem:

Problem Definition 1 (Subspace Recovery).

Design an algorithm , which takes as input X, and returns a set of vectors which form the basis of .

3.1 Warmup: Noise-free Subspace Recovery

We first consider an easier version of Problem 1 with . In this case, we know that . We assume that we know (or have an upper bound on it). Below we demonstrate that there exists a sharp threshold on such that whenever , we can solve Problem 1 exactly with high probability, whereas if , Problem 1 cannot be solved. To characterize this threshold, we define the cardinality of the maximal rank subspace as the optimal value of the following problem:

 maxI|I|  s.t.  rank(X% I⋆)≤k−1

Intuitively, the adversary can insert samples to form a rank subspace, which does not span . The following theorem shows that in this case, there is indeed no learner that can successfully recover the subspace of .

Theorem 1.

If , then there exists an adversary such that no algorithm solves Problem 1 with probability greater than .

On the other hand, when is below this threshold, we can use the following simple algorithm to recover the subspace of :

In fact, we can prove the following theorem.

Theorem 2.

If , then Algorithm  1 solves Problem 1 for any adversary.

Theorems 1 and 2 together give the necessary and sufficient conditions on when Problem 1 is solvable, and Algorithm 1 provides a solution. We further show an implication of these theorems on the corruption ratio . We can prove that (see Appendix A). Combining this with Theorem 1, we can have the following upper bound on .

Corollary 1.

If , then Problem 1 cannot be solved.

3.2 Dealing with Noise

We now consider Problem 1 with noise. Before we discuss the adversary, we first need to assume that the uncorrupted version is solvable. In particular, we assume that optimizes the following problem:

 minX′||X0−X′|| (1a) s.t.  rank(X′)≤k. (1b)

Without otherwise mentioned, we use to denote the Frobenius norm. We put no additional restrictions on N except above. Note that this assumption is implied by the classical PCA problem eckart1936approximation ; hotelling1933analysis ; jolliffe2002principal . We want to emphasize on the optimal value of the above problem. We denote this value to be noise residual, denoted as . Noise residual is a key component to characterize the necessary and sufficient condition for the solvability of Problem  1.

Characterization of the defender’s ability to accurately recover the true basis B of after the attacker adds malicious instances stems from the attacker’s ability to mislead the defender into thinking that some other basis, , better represents . Intuitively, since the defender does not know , , or which rows of the data matrix X are adversarial, this comes down to the ability to identify the rows that correspond to the correct basis (note that it will suffice to obtain the correct basis even if some adversarial rows are used, since the adversary may be forced to align malicious examples with the correct basis to evade explicit detection). As we show below, whether the defender can succeed is determined by the relationship between the noise residual and sub-matrix residual, denoted as , which we define as the value optimizing the following problem:

 minI,B,U||XI0−U¯B|| (2a) s.t. rank(¯B)=k,¯B¯BT=Ik,X⋆¯BT¯B≠X⋆ (2b) I⊆{1,2,...,n},|I|=n−n1. (2c)

We now explain the above optimization problem. and are and matrixes separately. Here is a basis which the attacker “targets”; for convenience, we require to be orthogonal (i.e., ). Since the attacker succeeds only if they can induce a basis different from the true B, we require that does not span of , which is equivalent to saying . Thus, this optimization problem seeks rows of , where is the corresponding index set. The objective is to minimize the distance between and the span space of the target basis , (i.e., ).

To understand the importance of , consider Algorithm 2 for recovering the basis of , B. If the optimal objective value of optimization problem (2), , exceeds the noise , it follows that the defender can obtain the correct basis B using Algorithm 2, as it yields a better low-rank approximation of X than any other basis. Else, it is, indeed, possible for the adversary to induce an incorrect choice of basis. The following theorem formalizes this argument.

Theorem 3.

If , then no algorithm recovers the exact subspace of with probability greater than . If , then Algorithm 2 solves Problem 1.

To draw connection between the noisy case and the noise-free case, we can view Theorem 1 and 2 as special cases of Theorem  3.

Theorem 4.

When , if and only if .

4 Trimmed Principal Component Regression

In this section, we present trimmed principal component regression (T-PCR) algorithm. The key idea is to leverage the principal component regression (PCR) approach to estimate , but during the process trimming out those malicious samples that try to deviate the estimator from the true ones. In the following, we present the approach, which is similar to the standard PCR approach, though we do not require computing the exact singular vectors of .

Assume we recover a basis B of . Without loss of generality, we assume that B is an orthogonal basis of row vectors. Since B is a basis for , we assume . Then we know that, by optimization (1), . We compute , and, by definition, we know . By OLS estimator, we know that , and thus .

To estimate , we assume . Since , we convert the estimation problem of from a high dimensional space to the estimation problem of from a low dimensional space, such that . After estimating for , we can convert it back to get . Notice that this is similar to principal component regression jolliffe1982note .

However, the adversary may corrupt rows in U to fool the learner to make a wrong estimation on , and thus on . To mitigate this problem, we design Algorithm 3.

Intuitively, during the training process, we trim out the top samples that maximize the difference between the observed response and the predicted response , where denotes the -th row of . Since we know the variances of these differences are small (i.e., recall Section 2, is the variance of the random noise ), these samples corresponding to the largest differences are more likely to be the adversarial ones.

Next, we theoretically bound the prediction differences between our model and the linear regression model learnt on .

Lemma 1.

Algorithm 3 returns , such that for any real value with at least probability for some constant , we have

 Ex[(x(\savestack\tmpbox\stretchto\scaleto\scalerel∗[\widthofβ]⋀0.5ex\stackon[1pt]β\tmpbox−β⋆))2]≤4σ2(1+√11−γ)2logc (5)

We explain the intuition of this Lemma, and defers the detailed proof to Appendix B. If an adversary wants to fool Algorithm 3, it needs to generate samples , such that the loss function is among the smallest . Since for samples from , there loss functions are already bounded according to , the adversary does not have an ability to significantly skew the estimator. In particular, if , i.e., there is no error while generating from , then the adversary can do nothing when , and thus the estimator is the same as the linear regression’s estimator on the uncorrupted data.

As an immediate consequence of Lemma 1, we have

Theorem 5.

Given , Algorithm 3 is -tolerant.

5 Practical Algorithms

Algorithms 12, and 3 require enumerating a subset of indeces, and are thus all exponential time. To make our approach practical, we develop efficient implementations of Algorithms 2 and 3.

5.1 Efficient Robust Subspace Recovery

Consider the objective function (3). Since , we can rewrite where U’s and B’s shapes are , and respectively. Therefore, we can rewrite objective  (3) as

 minI,U,B||XI−UBT||  s.t.  |I|=n

which is equivalent to

 minU,Bn∑j=1{||xi−uiBT||%fori=1,...,n+n1}(j) (6)

where and denote the th row of X and U respectively. Solving Objective 6 can be done using alternating minimization, which iteratively optimizes the objective for U and B while fixing the other. Specifically, in the th iteration, we optimize for the following two objectives:

 Uw+1=argminU||X−U(Bw)T||
 Bw+1=argminBn∑j=1{||xi−uw+1iBT|| for i=1,...,n+n1}(j).

Notice that the second step computes the entire U regardless of the sub-matrix restriction. This is because we need the entire U to be computed to update B. The key challenge is to compute in each iteration, which is, again, a trimmed optimization problem. The next section presents a scalable solution for such problems.

5.2 Efficient Algorithm for Trimmed Optimization Problems

Both robust subspace recovery and optimizing for (4) rely on solving optimization problems in the form

 minθn∑j=1{l(yi,fθ(xi)) for i=1,...,n+n1}(j)

where computes the prediction over using parameter , and is the loss function. We refer to such problems as trimmed optimization problems. It is easy to see that solving this problem is equivalent to solving

 minθ,τ1,...,τn+n1∑n+n1i=1τil(yi,fθ(xi))s.t.  0≤τi≤1,∑n+n1i=1τi=n

We can use alternating minimization technique to solve this problem, by optimizing for , and respectively. We present this in Algorithm 4. In particular, the algorithm iteratively seeks optimal arguments for and respectively. Optimizing for is a standard least square optimization problem. When optimizing , it is easy to see that if is among the largest ; and otherwise. Therefore, optimizing for is a simple sorting step. While this algorithm is not guaranteed to converge to a global optimal, in our evaluation,we observe that a random start of typically yields near-optimal solutions.

The following theorem shows that the algorithm converges.

Theorem 6.

Given a loss function such that a lower bound exists, i.e., , Algorithm 4 converges. That is, assuming is the loss after -th iteration, then we have

 limj→+∞||lj+1−lj||=0

The proof can be found in Appendix C.

6 Experimental Results

We evaluate the proposed algorithms in comparison with state-of-the-art alternatives on synthetic data. For the subspace recovery problem, we compare to two state-of-the-art approaches: Chen et al. chen-icml11 and Xu et al. xu2010 . We compare the combined T-PCR algorithm with the recent robust regression approach chen2013robust

, and the standard ridge regression algorithm. For most experiments, we set

and . The only exception is that, when we evaluate of the impact of on runtime, we vary from 1,000 to 8,000. We vary the rank of , and . Results are averages of 30 runs after dropping the largest and smallest values.

Data: For a given , we generate as follows: sample two matrixes with shape and respectively, such that each element is sampled independently from , ensuring that both have rank , and we set . Next, we generate corruptions also as a low rank matrix by generating and as above. For , we set the first half of by choosing rows of , and generating the remaining rows randomly, ensuring that B has rank . We then concatenate and and shuffle the rows. We do not add noise to , unless explicitly stated. In doing so, we know that shares a common subspace of rank with , but the two subspaces are still different. This approach of data generation is significantly more adversarial than prior methods of generating adversarial instances, as we show below. To generate labels, we generate a random , and y is then constructed as described in Section 2, where we apply the method of Xiao et al. xiao2015feature to create adversarial labels.

Runtime: Figure 1(a) and (b) presents the runtime comparison results. Our approach is implemented in Scala without any special optimization, and both chen-icml11 and xu2010 are implemented in Matlab leveraging Matlab’s built-in optimizations for matrix operations. In Figure 1(a), we vary the rank from 1 to 20, and fix . Our algorithm is significantly faster than chen-icml11 and xu2010 . Figure 1(b) shows runtime as a function of , rank and are fixed to 20 and 50 respectively. We can observe that scalability of chen-icml11 and xu2010 is quite poor in the size of the problem. In contrast, our algorithm remains quite efficient (with running time under seconds in all cases).

Identification rate of corrupted rows: Figure 1(c) presents the percentage of all corrupted rows identified by the algorithms. We generate the data fixing intrinsic rank to be , and varying inserted corruptions from 10 to 150, keeping the total data size to be . We evaluate our approach varying algorithm parameter from 10 to 20, and parameter from to . The results show that our approach achieves 100% accuracy regardless of the chosen parameters as long as is no less than the intrinsic rank, and is no bigger than the number of pristine rows. We also compare our approach with prior work, xu2010 and chen-icml11 , which are identical. We refer to both as Xu et al. xu2010 . We can observe that the identification rate plummets for , even though only 5% of the rows are corrupted.

Error on noisy data: We add noise to to evaluate performance on noisy data. Since chen-icml11 cannot handle noise, we only compare with xu2010 . On each element of , we add a noise sampled from . Figure 2(a) and 2(b) show RMSE of the difference from recovered and the true . This metric is used by xu2010 as well. We use the grayscale to denote the RMSE: lighter color corresponds to smaller RMSE. On most test cases our algorithm successfully recovers the true subspace, while xu2010 fails on most cases. Particularly, when , our approach can completely recover the underlying low-rank matrix. When increases, the condition might not hold true, and Theorem 3 says that no algorithm can recover the true subspace with probability greater than . However, this theorem does not prevent our algorithm succeeding with probability , which is why we observe several white spots for high .

Robust Regression: We evaluate our T-PCR Algorithm 3 against the robust regression method of Chen et al. chen2013robust , which is the only alternative method for linear regression robust to adversarial data poisoning. As a baseline, we also present results for standard OLS linear regression with and without adversarial instances (LR(O+A) and LR(O), respectively). Results are evaluated using a ground truth test set not used for training. The results, shown in Figure 2(c), demonstrate that our algorithm significantly outperforms the alternatives. Indeed, while robust regression of Chen et al. chen2013robust does better than the non-robust baseline, our method works nearly as well as linear regression without adversarial instances!

7 Conclusion

This paper considers the poisoning attack for linear regression problem with dimensionality reduction. We address the problem in two steps: 1) introducing a novel robust matrix factorization method to recover the true subspace, and 2) novel robust principle component regression to prune adversarial instances based on the basis recovered in step (1). We characterize necessary and sufficient conditions for our approach to be successful in recovering the true subspace, and present a bound on expected prediction loss compared to ground truth. Experimental results suggest that the proposed approach is extremely effective, and significantly outperforms prior art.

References

• (1) Ion Androutsopoulos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sakkis, Constantine D Spyropoulos, and Panagiotis Stamatopoulos. Learning to filter spam e-mail: A comparison of a naive bayesian and a memory-based approach. arXiv preprint cs/0009009, 2000.
• (2) Philip K Chan and Salvatore J Stolfo. Toward scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection. In KDD, volume 1998, pages 164–168, 1998.
• (3) Salvatore Stolfo, David W Fan, Wenke Lee, Andreas Prodromidis, and P Chan. Credit card fraud detection using meta-learning: Issues and initial results. In AAAI-97 Workshop on Fraud Detection and Risk Management, 1997.
• (4) Huang Xiao, Battista Biggio, Gavin Brown, Giorgio Fumera, Claudia Eckert, and Fabio Roli.

Is feature selection secure against training data poisoning?

In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1689–1698, 2015.
• (5) Daniel Lowd and Christopher Meek. Adversarial learning. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 641–647. ACM, 2005.
• (6) Bo Li and Yevgeniy Vorobeychik. Feature cross-substitution in adversarial classification. In Advances in Neural Information Processing Systems, pages 2087–2095, 2014.
• (7) Bo Li and Yevgeniy Vorobeychik. Scalable optimization of randomized operational decisions in adversarial classification settings. In AISTATS, 2015.
• (8) Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright.

Robust principal component analysis?

Journal of the ACM (JACM), 58(3):11, 2011.
• (9) Yudong Chen, Constantine Caramanis, and Shie Mannor. Robust high dimensional sparse regression and matching pursuit. arXiv preprint arXiv:1301.2725, 2013.
• (10) Jiashi Feng, Huan Xu, Shie Mannor, and Shuicheng Yan. Robust logistic regression and classification. In Advances in Neural Information Processing Systems, pages 253–261, 2014.
• (11) Ming Yan, Yi Yang, and Stanley Osher. Exact low-rank matrix completion from sparsely corrupted entries via adaptive outlier pursuit. Journal of Scientific Computing, 56(3):433–449, 2013.
• (12) Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust pca via outlier pursuit. In Advances in Neural Information Processing Systems, pages 2496–2504, 2010.
• (13) Paul Rodriguez and Brendt Wohlberg. Fast principal component pursuit via alternating minimization. In Image Processing (ICIP), 2013 20th IEEE International Conference on, pages 69–73. IEEE, 2013.
• (14) Mauro Barni and Benedetta Tondi. Source distinguishability under corrupted training. In Information Forensics and Security (WIFS), 2014 IEEE International Workshop on, pages 197–202. IEEE, 2014.
• (15) Benjamin IP Rubinstein, Blaine Nelson, Ling Huang, Anthony D Joseph, Shing-hon Lau, Satish Rao, Nina Taft, and JD Tygar. Antidote: understanding and defending against poisoning of anomaly detectors. In Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference, pages 1–14. ACM, 2009.
• (16) Battista Biggio, Igino Corona, Giorgio Fumera, Giorgio Giacinto, and Fabio Roli.

In Multiple Classifier Systems, pages 350–359. Springer, 2011.
• (17) Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218, 1936.
• (18) Harold Hotelling. Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6):417, 1933.
• (19) Ian Jolliffe. Principal component analysis. Wiley Online Library, 2002.
• (20) Ian T Jolliffe. A note on the use of principal components in regression. Applied Statistics, pages 300–303, 1982.
• (21) Yudong Chen, Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust matrix completion and corrupted columns. In Proc. of ICML 11, 2011.

Appendix A Proof of Theorems in Section 3

Since, this section does not involve y, we will omit y without loss of clarity.

a.1 Theorem 1

Proof of Theorem 1.

We prove by contradiction. Assume we have a learner , can solve Problem 1 with probability more than 1/2. We want to show that there exist two different spaces of rank-, and one input X such that should return both two spaces with a probability , which is impossible. In the following, we construct such two spaces. Particularly, we will discuss how adversary can manipulate the matrix.

The adversary can choose which maximize such that while . We know . This means that .

Suppose be a set of basis for the row space of . The adversary then choose a vector which is orthogonal to . Then we know the span space of is different from . Then the adversary draws samples from the span space of , and insert them into to form X. Moreover, we denote to be a matrix of rows, so that the first rows are , and the rest rows sampled by the adversary. Therefore, we know is also a submatrix of X, and we know that there are at most rows in X not coming from .

In doing so, we know that X is constructed by corrupting . On the other hand, we can also see X as the result of corrupting by inserting rows. Therefore, should return both and with a probability greater than , which is impossible. Therefore, our conclusion holds true. ∎

a.2 Theorem 2

Proof of Theorem 2.

We show that Algorithm 1 recovers the subspace of exactly. Assume B is returned by Algorithm 1 over X. We only need to show that B is a basis of . By Algorithm 1, we know that B is a basis of rows in of X. Since we know any adversary can corrupt at most rows, thus . Therefore, by combining the assumption , we know that

 |I∩O|=|I|−|I∩A|≥n−n1>MSk−1(X⋆) (7)

Therefore, we know B is a basis for . By the definition of and inequality (7), we know that

 rank(XI∩O⋆)=k

Therefore, we know that is exactly the same subspace as , and thus B is the basis of . ∎

a.3 Corollary 1

Proof.

We can choose the , then we have . Therefore, . ∎

Now, we can prove Corollary 1.

Proof of Corollary 1.

Given , we have

 n1+(k−1)≥n

Combining , we know

 n1+MSk−1(X⋆)≥n1+(k−1)≥n

Applying Theorem 1, we can conclude this corollary. ∎

a.4 Theorem 3

Proof of Theorem 3.

The proof of this theorem is similar to the proof of Theorem 1 and 2. First, when , the adversary can construct X such that two subspaces should be recovered with a probability greater than . Particularly, we assume minimize objective  2, and thus . The adversary samples rows from the span space of B, which does not belong to the span of . We add a small noise over to get , such that (1) minimize ; and (2) . Then the adversary insert into to get X. In this case, we know that optimizes its distance from , while the optimizes its distance from , where we use to denote the concatenation of rows from and respectively. Further, by definition, we know both of these two distances is . Therefore, the learner should recover from X both and with probability greater than . This is impossible! Therefore the first part of the theorem holds true.

For the second part, we follow the proof of Theorem 2 verbatim, and present the difference. We show that Algorithm 2 recovers the subspace of exactly. Assume B is returned by Algorithm 2 over X. We only need to show that B is a basis of . By Algorithm 2, we know that B optimizes its pan distance from a subset of rows in X, which is denoted as . Since we know any adversary can corrupt at most rows, thus . Therefore, we know that

 |I∩O|=|I|−|I∩A|≥n−n1 (8)

If B is not a basis of , which means that , then we know that the distance between the span space of B and is greater than . This is impossible, since Algorithm 2 guarantees that this distance should be no greater than . Contradiction! Therefore the second part of the theorem holds true. ∎

a.5 Theorem 4

Proof of Theorem 4.

When , we know that if and only if . This means that for any , implies that (condition (2b)), which implies that for all . Therefore, we know , which concludes this theorem. ∎

Appendix B Proof of Lemma 1

Proof.

Assume is the solution for this optimization problem. We assume the adversary wants to induce the regression system to compute . In this case, he has to corrupt rows in . W.L.O.G. we can assume . We denote . Since , we know that

 y−X⋆β⋆=y−U⋆β⋆U

Since optimize Eq (4), we assume are the smallest values for .

Then we have

Therefore we have

 (9)

Further, we know

 n∑i=n1+1(yi−ui\savestack\tmpbox\stretchto\scaleto\scalerel∗[\widthofβU]⋀0.5ex\stackon[1pt]βU\tmpbox)2 (10) = ≥ = −2(n∑i=n1+1|ui(β⋆U−\savestack\tmpbox\stretchto\scaleto\scalerel∗[\widthofβU]⋀0.5ex\stackon[1pt]βU\tmpbox)|⋅|yi−uiβ⋆U|)

According to Cauchy-Schwarz inequality, we have

We assume , then, we have

 −2(n∑i=n1+1|ui(β⋆U−\savestack\tmpbox\stretchto\scaleto\scalerel∗[\widthofβU]⋀0.5ex\stackon[1pt]βU\tmpbox)|⋅|yi−uiβ⋆U|) ≥ = −2C√Σni=n1+1e2i

Substituting this inequality into (10) and combining with (9), we have

 n+n1∑i=n1+1e2i≥n∑i=n1+1e2i+C2−2C√Σni=n1+1e2i

By simple rearrangement, we have

 C2−2C ⎷n∑i=n1+1e2i≤n+n1∑i=n+1e2i

Since we know , we know that for any parameter , we have for some constant . Therefore, we know, with high probability (at least ), we have

 C2−2√n−n1C(2σ√logh) ≤ C2−2C ⎷n∑i=n+1e2i ≤ n+n1∑i=n+1e2i ≤ n1(2σ√logh)2

Therefore, we have

 (C−2σ√n−n1√logh)2≤n(2σ√logh)2

and thus

 C≤2σ(√n+√n−n1)√logh

Therefore, we know

We notice the right hand side of the above inequality does not depend on . Therefore, we take , and we know that

, and apply the law of large numbers, we have

where left hand side is the same as . Then the conclusion of Lemma 1 is a simple rearrangement of the above inequality. ∎

Appendix C Proof of Theorem 6

We consider

 ℓj=n+n1∑i=1τ(j)il(yi,fθ(j)(xi))

and

 ℓ′j=n+n1∑i=1τ(j)il(yi,fθ(j+1)(xi))

According to the algorithm, it is easy to see that

 ℓj≥ℓ′j≥ℓj+1

Therefore, is a monotonic decreasing sequence. Since a lower bound exists on the sequence, we assume is the inferior of the sequence . Therefore, we know that

 limj→+∞ℓj−¯¯¯¯B=limj→+∞|ℓj−¯¯¯¯B|=0

Further, we know that

 0≤|ℓj+1−ℓj|≤|ℓj−B|

and thus

 0≤limj→+∞|ℓj+1−ℓj|≤limj→+∞|ℓj−B|=0

Therefore, we have

 limj→+∞|ℓj+1−ℓj|=0

Q.E.D.