 # Robust Risk Minimization for Statistical Learning

We consider a general statistical learning problem where an unknown fraction of the training data is corrupted. We develop a robust learning method that only requires specifying an upper bound on the corrupted data fraction. The method is formulated as a risk minimization problem that can be solved using a blockwise coordinate descent algorithm. We demonstrate the wide range applicability of the method, including regression, classification, unsupervised learning and classic parameter estimation, with state-of-the-art performance.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Statistical learning problems encompass regression, classification, unsupervised learning and parameter estimation . The common goal is to find a model, indexed by a parameter

, that minimizes some loss function

on average, using training data . The loss function is chosen to target data from a class of distributions, denoted .

It is commonly assumed that the training data is drawn from a nominal distribution . In practice, however, training data is often corrupted by outliers, systematic mislabeling, or even an adversary. Under such conditions, standard learning methods degrade rapidly [2, 11, 14, 23]. See Figure 1 for an illustration. Here we consider the Huber contamination model which is capable of modeling the inherent corruption of data and is common in the robust statistics literature [16, 17, 19]. Specifically, the training data is assumed to be drawn from the unknown mixture distribution

 p(z)=(1−ϵ)po(z)+ϵq(z), (1)

so that roughly samples come from a corrupting distribution . The fraction of outliers, , may range between in routine datasets, but in data collected with less dedicated effort or under time constraints can easily exceed 10% [15, ch. 1].

For this setting, several robust methods have been proposed in the literature. A classical approach is to modify the loss function so as to be less sensitive to outliers [17, 19, 23]. Some examples of such functions are the Huber’s loss function  and Tukey’s loss function . Another approach is to try and identify the corrupted points in the training data based on some criteria and then remove them [18, 5, 4, 3, 20]. For example, for mean and covariance estimation of , the method presented in  identifies corrupted points by projecting the training data onto an estimated dominant signal subspace and then compares the magnitude of the projected data against some threshold. The main limitation of the above approaches is that they are problem-specific and must be tailored to each learning problem.

Recent work has been directed toward developing a more general method for robust statistical learning that is applicable to a wide range of loss functions [9, 21, 7]. These state-of-the-art methods do however exhibit some important limitations. Firstly, the cited methods assume that the fraction of corrupted data, i.e., is known. For example, the proposed algorithm in  scores each data point and removes a fraction of the training data based on the score and . Similarly, the algorithm in  has two steps where the first step solves a regularized problem in which the regularizer depends on . In practice, a user may not be able to precisely specify the percentage of the corrupted data. Secondly, the cited methods rely on removing data points based on a specified threshold. Since there are different means of scoring and thresholding, it is not clear which are better and how much problem-dependent the choices are. Moreover, the threshold against which the score of the data points is compared often depends on some additional user-defined parameters that are needed as input to the algorithm.

The main contribution of this paper is a general robust method with the following properties:

• it is applicable to any statistical learning problem that minimizes an expected loss function,

• it requires only specifying an upper bound on the corrupted data fraction ,

• it is formulated as a minimization problem that can be solved using a blockwise algorithm.

We illustrate and evaluate the robust method in several standard statistical learning problems.

## 2 Problem

Consider a set of models indexed by a parameter . The predictive loss of a model is denoted , where is a randomly drawn datapoint. The optimal model is obtained by minimizing the expected loss, or risk, i.e.,

 θ⋆=argminθ∈Θ E[ℓθ(z)], (2)

where . Because the distribution is typically unknown, a common learning strategy is to use independent samples to find the empirical risk minimizing (Erm

) parameter vector

 ˆθ=argminθ∈Θ 1nn∑i=1ℓθ(zi) (3)
###### Example 1

In regression problems, data consists of features and outcomes, , and parameterizes a predictor . The standard loss function targets distributions with thin-tailed noise and conditional means that fit the predictor.

###### Example 2

In general parameter estimation problems, a standard loss function is , which targets distributions spanned by . For this choice of loss function, (3) corresponds to the maximum likelihood estimator.

In real applications, a certain fraction of the data is corrupted. We model the unknown data generating process by (1). Under such corrupted data conditions, Erm degrades rapidly as diverges from or increases. While is unknown, it can typically be upper bounded, that is, [15, ch. 1]. Our goal is to formulate a general method of risk minimization, which given and , learns a model that is robust against corrupted training samples.

## 3 Method

Consider the following risk function

 R(θ,π)=Ez∼pπ(z) [ℓθ(z)], (4)

where is the following empirical distribution

 pπ(z)=n∑i=1πiδ(z−zi), (5)

and where and

belongs to the probability simplex

 Π={π∈Rn+:1⊤π = 1}

We denote the entropy of as

 H(π)≜−∑iπilnπi≥0.

The maximum entropy distribution is obtained when , in which case . Minimizing the risk then yields the Erm parameter in (3) and, if , the maximum entropy distribution would yield an asymptotically consistent estimate of under standard regularity conditions.

### 3.1 Robust risk minimization

To approximate the target distribution , we would like the support of to cover only the unknown uncorrupted samples in , in which case its maximum entropy would be . Therefore we seek a distribution with entropy no less than , using the bound . This leads to a joint optimization problem

 minθ∈Θ,π∈ΠR(θ,π)subject to  H(π)≥ln[( 1−˜ϵ )n]. (6)

Intuitively, the above minimization problem finds a model and assigns weights to a set of points, which jointly provide the lowest expected loss . Points in the data which fit the model class obtain higher weights and contribute more to the objective function, than points that do not fit. Furthermore, the entropy constraint mitigates overfitting to noise inherent even to the noncorrupted data. In this way, learning is robust against outliers in .

Note that the learned probability weights can automatically identify corrupted samples, as illustrated in Fig. 1. This capability can be useful as a diagnostic tool in certain applications.

### 3.2 Blockwise minimization algorithm

We now propose a practical computational method of finding a solution of (6). Given fixed parameters and , we define for given

 ˆπ(˜θ)=⎧⎨⎩argminπ∈Π  R(˜θ,π),s.t.  H(π)≥ln[( 1−˜ϵ )n], (7)

which is the solution to a convex optimization problem and can be computed efficiently using standard numerical packages  and (for a given )

 ˆθ(˜π)=argminθ∈Θ R(θ,˜π), (8)

which is the solution to a risk minimization problem. Solving both problems in a cyclic manner constitutes blockwise coordinate descent method which we summarize in Algorithm 1. When is closed and convex, the algorithm is guaranteed to converge to a critical point of (6), see .

The general form of the proposed method renders it applicable to a diverse range of learning problems in which Erm

is conventionally used. In the next section, we illustrate the performance and generality of the proposed method using numerical experiments for different supervised and unsupervised machine learning problems.

## 4 Numerical Experiments

We illustrate the generality of our framework by addressing four common problems in regression, classification, unsupervised learning and parameter estimation. For the sake of comparison, we also evaluate the recently proposed robust Sever method , which was derived on very different grounds as a means of augmenting gradient-based learning algorithms with outlier rejection capabilities. We use the same threshold settings for the Sever algorithm as were used in the experiments in , with in lieu of the unknown fraction .

### 4.1 Linear Regression

Consider data , where and denote feature vectors and outcomes, respectively. We consider a class of predictors , where , and a squared-error predictive loss . This loss function targets thin-tailed distributions with a linear conditional mean function.

We learn using i.i.d training samples drawn from

 p(x,y)=(1−ϵ)p(x)po(y|x)+ϵp(x)q(y|x), (9)

where

 po(y|x)=N(x⊤θ⋆, σ2),  q(y|x)=t(x⊤θ⋆,ν), (10)

and . The above data generator yields observations concentrated around a hyperplane, where roughly observations are corrupted by heavy-tailed t-distributed noise. Data is generated with

and noise standard deviation

.

We evaluate the distribution of estimation errors relative to using Monte Carlo runs. In the first experiment, we set to 20% and , in which case the tails of

are so heavy that the variance is undefined. We apply

Rrm with , which is a conservative upper bound. Note that is a weighted least-squares problem with a closed-form solution. The distribution of errors for Erm, Sever and Rrm are summarized in Figure 1(a). We also include the Huber method, which is tailored specifically for linear regression [23, ch. 2.6.2]. Both Rrm and Sever perform similarly in this case and are substantially better than Erm, reducing the errors by almost a half.

Next, we study the performance as the percentage of corrupted data increases from to . We set so that the variance of the corrupting distribution is defined. Figure 1(b) shows the expected relative error against for the different methods, where the robust methods, once again, perform similarly to one another, slightly better than Huber’s, and much better than Erm.

### 4.2 Logistic Regression

Consider data where is a feature vector and an associated class label. We consider the cross-entropy loss

 ℓθ(x,y)=−yln(σθ(x))−(1−y)ln(1−σθ(x)), (11)

where

 σθ(x)=(1+exp(ϕ⊤(x)θ))−1

and . Thus the loss function targets distributions with linearly separable classes.

We learn using i.i.d points drawn from

 p(x,y) = (1−ϵ)po(x)po(y|x)+ϵq(x)q(y|x), (12)

where with . An illustration of is given in Figure 2(a), where the separating hyperplane corresponds to . The corrupting distribution is given by and as illustrated in Figure 2(b).

Data is generated according to (12) with equal to 5%.

We apply Rrm with . Note that is readily computed using the standard iterative re-weighted least square or MM algorithms , with minor modifications to take into account the fact that the data points are weighted by . Figure 2(b) shows the learned separating planes, parameterized by , for a single realization. We observed that the plane learned by Erm and the robust Sever is shifted towards the outliers. By contrast, the proposed Rrm method is marginally affected by the corrupting distribution. Figure 2(c) summarizes the distribution of angles between and , i.e., , using Monte Carlo simulations. Rrm outperforms the other two methods in this case.

### 4.3 Principal Component Analysis

Consider data where we assume to have zero mean. Our goal is to approximate by projecting it onto a subspace. We consider the loss where is an orthogonal projection matrix. The loss function targets distributions where the data is concentrated around a linear subspace. In the case of a one-dimensional subspace , where .

We learn using i.i.d datapoints drawn from

 p(z)=(1−ϵ)po(z2|z1)po(z1)po(z)+ϵq(z), (13)

where

 po(z2|z1) = N(2z1, σ2),  po(z1) = N(0, 1) (14)

and for outliers. Note that in (13) corresponds to a subspace parameterized by .

Data is generated with , and is set to 20%. We apply Rrm with . Note that can be obtained as

 ˆθ(˜π)=argmaxθ∈Θ θ⊤Rθ, (15)

which is equivalent to maximizing the Rayleigh quotient and the solution is simply the dominant eigenvector of the covariance matrix

 R=n∑i=1˜πiziz⊤i. (16)

We evaluate the misalignment of the subspaces using the metric evaluated over Monte Carlo simulations. Figure 4 summarizes the distribution of errors for the three different methods. For this problem, Rrm outperforms both Erm and Sever. Figure 4: Principal component analysis. Box plot of subspace misalignment error 1−|cos(ˆθ⊤θ⋆)|.

### 4.4 Covariance Estimation

Consider data with an unknown mean and covariance . We consider the loss function

 ℓθ(z) =−(z−μ)⊤Σ−1(z−μ)+ln|Σ|

where

. This loss function targets sub-Gaussian distributions.

We learn using i.i.d samples drawn from

 p(z)=(1−ϵ)po(z)+ϵq(z) (17)

where and . Data is generated using (17) with and , and with . We set , which means that the corrupting distribution has no finite covariance matrix.

We apply Rrm with upper bound . Note that has a closed-form solution, given by the weighted sample mean and covariance matrix with the weight vector equal to . We evaluate the error relative to over Monte Carlo simulations and show it in Figure 5. We see that Sever is prone to break down due to the heavy-tailed outliers, whereas Rrm is stable. Figure 5: Covariance estimation. Box plot of distribution of relative errors ∥Σ∗−Σ∥F/∥Σ∗∥F. Note that the expected relative error for Sever is too large to be contained in the given plot.

## 5 Real Data

Finally, we test the performance of Rrm on real data. We use the Wisconsin breast cancer dataset from the UCI repository . The dataset consists of points, with features and labels . The class labels and correspond to ‘benign’ and ‘malignant’ cancers, respectively. of the data was used for training, which was subsequently corrupted it by flipping the labels of class datapoints to (). The goal is to estimate a linear separating plane to predict the class labels of test data. We use the cross-entropy loss function in (11) and apply the proposed Rrm method with . For comparison, we also use the standard Erm and the robust Sever methods.

Tables 1 for Erm, 2 for Sever and 3 for Rrm

summarize the results using the confusion matrix as the metric. The classification accuracy for the

Rrm method is visibly higher than that of Erm and Sever for class .

## 6 Conclusion

We proposed a general risk minimization approach which provides robustness to a wide range of statistical learning problems in cases where a fraction of the observed data comes from a corrupting/adversarial distribution. Unlike existing robust methods, our approach neither assumes knowledge of the said fraction nor depends on any specific scoring functions and thresholding techniques to remove the corrupting points from data as are used in existing literature. We illustrated the wide applicability and good performance of our method by testing it on several classical supervised and unsupervised statistical learning problems using both simulated and real data.

## References

•  Breast Cancer Wisconsin UCI repository.
•  Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust adversarial examples. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 284–293, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
•  Pranjal Awasthi, Maria Florina Balcan, and Philip M. Long. The power of localization for efficiently learning linear separators with noise. J. ACM, 63(6):50:1–50:27, January 2017.
•  Kush Bhatia, Prateek Jain, Parameswaran Kamalaruban, and Purushottam Kar. Consistent robust regression. In Advances in Neural Information Processing Systems, pages 2110–2119, 2017.
•  Kush Bhatia, Prateek Jain, and Purushottam Kar. Robust regression via hard thresholding. In Advances in Neural Information Processing Systems, pages 721–729, 2015.
•  Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
•  Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from untrusted data. In

Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

, STOC 2017, pages 47–60, New York, NY, USA, 2017. ACM.
•  Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
•  Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochastic optimization. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 1596–1606, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
•  Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pages 999–1008. JMLR.org, 2017.
•  Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
•  Michael Grant and Stephen Boyd. Cvx: Matlab software for disciplined convex programming, version 2.1, 2014.
•  Luigi Grippo and Marco Sciandrone. On the convergence of the block nonlinear gauss–seidel method under convex constraints. Operations research letters, 26(3):127–136, 2000.
•  Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain. IEEE, 2017.
•  Frank R Hampel, Elvezio M Ronchetti, Peter J Rousseeuw, and Werner A Stahel. Robust statistics: the approach based on influence functions, volume 196. John Wiley & Sons, 2011.
•  Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics, pages 492–518. Springer, 1992.
•  Peter J Huber. Robust statistics. Springer, 2011.
•  Adam R Klivans, Philip M Long, and Rocco A Servedio. Learning halfspaces with malicious noise. Journal of Machine Learning Research, 10(Dec):2715–2740, 2009.
•  Ricardo A Maronna, R Douglas Martin, Victor J Yohai, and Matías Salibián-Barrera. Robust statistics: theory and methods (with R). John Wiley & Sons, 2019.
•  Andrea Paudice, Luis Muñoz-González, Andras Gyorgy, and Emil C Lupu. Detection of adversarial training examples in poisoning attacks through anomaly detection. arXiv preprint arXiv:1802.03041, 2018.