# Outlier Detection Using Distributionally Robust Optimization under the Wasserstein Metric

We present a Distributionally Robust Optimization (DRO) approach to outlier detection in a linear regression setting, where the closeness of probability distributions is measured using the Wasserstein metric. Training samples contaminated with outliers skew the regression plane computed by least squares and thus impede outlier detection. Classical approaches, such as robust regression, remedy this problem by downweighting the contribution of atypical data points. In contrast, our Wasserstein DRO approach hedges against a family of distributions that are close to the empirical distribution. We show that the resulting formulation encompasses a class of models, which include the regularized Least Absolute Deviation (LAD) as a special case. We provide new insights into the regularization term and give guidance on the selection of the regularization coefficient from the standpoint of a confidence region. We establish two types of performance guarantees for the solution to our formulation under mild conditions. One is related to its out-of-sample behavior, and the other concerns the discrepancy between the estimated and true regression planes. Extensive numerical results demonstrate the superiority of our approach to both robust regression and the regularized LAD in terms of estimation accuracy and outlier detection rates.

## Authors

• 7 publications
• 23 publications
06/10/2020

### Robustified Multivariate Regression and Classification Using Distributionally Robust Optimization under the Wasserstein Metric

We develop Distributionally Robust Optimization (DRO) formulations for M...
03/21/2019

### Convergence of Parameter Estimates for Regularized Mixed Linear Regression Models

We consider Mixed Linear Regression (MLR), where training data have bee...
12/22/2020

### Probabilistic Outlier Detection and Generation

A new method for outlier detection and generation is introduced by lifti...
06/10/2020

### Robust Grouped Variable Selection Using Distributionally Robust Optimization

We propose a Distributionally Robust Optimization (DRO) formulation with...
09/27/2021

### Distributionally Robust Multiclass Classification and Applications in Deep CNN Image Classifiers

We develop a Distributionally Robust Optimization (DRO) formulation for ...
10/11/2019

### Robust Hierarchical-Optimization RLS Against Sparse Outliers

This paper fortifies the recently introduced hierarchical-optimization r...
08/20/2021

### Distributionally Robust Learning

This monograph develops a comprehensive statistical learning framework t...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider a linear regression model with response

, predictor vector

, regression coefficient and error :

 y=x′β∗+ϵ.

Given samples , we are interested in estimating . The minimizes the sum of squared residuals , and works well if all the samples are generated from the underlying true model. However, when faced with adversarial perturbations in the training data, the OLS estimator will deviate from the true regression plane to accommodate the noise. Alternatively, one can choose to minimize the sum of absolute residuals , as done in Least Absolute Deviation (LAD), to mitigate the influence of large residuals. Another commonly used approach for hedging against outliers is M-estimation (Huber, 1964, 1973)

, which minimizes a symmetric loss function

of the residuals in the form , that downweights the influence of samples with large absolute residuals. Several choices for include the Huber function (Huber, 1964, 1973), the Tukey’s Biweight function (Rousseeuw and Leroy, 2005), the logistic function (Coleman et al., 1980), the Talwar function (Hinich and Talwar, 1975), and the Fair function (Fair, 1974).

Both LAD and M-estimation are not resistant to large deviations in the predictors. For contamination present in the predictor space, high breakdown value methods are required. Examples include the Least Median of Squares (LMS) (Rousseeuw, 1984) which minimizes the median of the absolute residuals, the Least Trimmed Squares (LTS) (Rousseeuw, 1985) which minimizes the sum of the smallest squared residuals, and S-estimation (Rousseeuw and Yohai, 1984) which has a higher statistical efficiency than LTS with the same breakdown value. A combination of the high breakdown value method and M-estimation is the MM-estimation (Yohai, 1987). It has a higher statistical efficiency than S-estimation. We refer the reader to the book of Rousseeuw and Leroy (2005) for an elaborate description of these robust regression methods.

The aforementioned robust estimation procedures focus on modifying the objective function in a heuristic way with the intent of minimizing the effect of outliers. A more rigorous line of research explores the underlying stochastic program that leads to the sample-based estimation procedures. For example, the OLS objective can be viewed as minimizing the expected squared residual under the uniform empirical distribution over the samples. It has been well recognized that optimizing under the empirical distribution yields estimators that are sensitive to perturbations in the data and suffer from overfitting. The reason is that, when the data

is adversarially corrupted by outliers, the observed samples are not representative enough to encode the true underlying uncertainty of the data. But on the other hand, the samples are typically the only information available. Instead of equally weighting all the samples as in the empirical distribution, we may wish to include more informative distributions that “drive out” the corrupted samples. One way to realize this is to hedge the expected loss against a family of distributions that include the true data-generating mechanism with a high confidence, which is called Distributionally Robust Optimization (DRO). DRO minimizes the worst-case expected loss over a probabilistic ambiguity set that is constructed from the observed samples and characterized by certain known properties of the true data-generating distribution. For example, Mehrotra and Zhang (2014) study the distributionally robust least squares problem with

defined through either moment constraints, norm bounds with moment constraints, or a confidence region over a reference probability measure. Compared to the single distribution-based stochastic optimization, DRO often results in better out-of-sample performance due to its distributional robustness.

The existing literature on DRO can be split into two main branches according to the way in which is defined. One is through a moment ambiguity set, which contains all distributions that satisfy certain moment constraints (see Popescu, 2007; Delage and Ye, 2010; Goh and Sim, 2010; Zymler et al., 2013; Wiesemann et al., 2014). In many cases it leads to a tractable DRO problem but has been criticized for yielding overly conservative solutions (Wang et al., 2016). The other is to define as a ball of distributions using some probabilistic distance functions such as the -divergences (Bayraksan and Love, 2015), which include the Kullback-Leibler (KL) divergence (Hu and Hong, 2013; Jiang and Guan, 2015) as a special case, the Prokhorov metric (Erdoğan and Iyengar, 2006), and the Wasserstein distance (Esfahani and Kuhn, 2017; Gao and Kleywegt, 2016; Zhao and Guan, 2015; Luo and Mehrotra, 2017; Blanchet and Murthy, 2016). Deviating from the stochastic setting, there are also some works focusing on deterministic robustness. El Ghaoui and Lebret (1997) consider the least squares problem with unknown but bounded, non-random disturbance and solve it in polynomial time. Xu et al. (2010) study the robust linear regression problem with norm-bounded feature perturbation and show that it is equivalent to the -regularized regression. See Yang and Xu (2013); Bertsimas and Copenhaver (2017) which also use a deterministic robustness.

In this paper we consider a DRO problem with containing distributions that are close to the discrete empirical distribution in the sense of Wasserstein distance. The reason for choosing the Wasserstein metric is two-fold. On one hand, the Wasserstein ambiguity set is rich enough to contain both continuous and discrete relevant distributions, while other metrics such as the KL divergence, exclude all continuous distributions if the nominal distribution is discrete (Esfahani and Kuhn, 2017; Gao and Kleywegt, 2016). Furthermore, considering distributions within a KL distance from the empirical, does not allow for probability mass outside the support of the empirical distribution. On the other hand, measure concentration results guarantee that the Wasserstein set contains the true data-generating distribution with high confidence for a sufficiently large sample size (Fournier and Guillin, 2015). Moreover, the Wasserstein metric takes into account the closeness between support points while other metrics such as the

-divergence only consider the probabilities on these points. The image retrieval example in

Gao and Kleywegt (2016) suggests that the probabilistic ambiguity set constructed based on the KL divergence prefers the pathological distribution to the true distribution, whereas the Wasserstein distance does not exhibit such a problem. The reason lies in that -divergence does not incorporate a notion of closeness between two points, which in the context of image retrieval represents the perceptual similarity in color.

Our DRO problem minimizes the worst-case absolute residual over a Wasserstein ball of distributions, and could be relaxed to the following form:

 infβ1NN∑i=1|yi−x′iβ|+ϵ∥(−β,1)∥∗, (1)

where is the radius of the Wasserstein ball, and is the dual norm of the norm space where the Wasserstein metric is defined on. Formulation (1) incorporates a wide class of models whose specific form depends on the notion of transportation cost embedded in the Wasserstein metric (see Section 2

). Although the Wasserstein DRO formulation simply reduces to regularized regression models, we want to emphasize a few new insights brought by this methodology. First, the regularization term controls the conservativeness of the Wasserstein set, or the amount of ambiguity in the data, which differentiates itself from the heuristically added regularizers in traditional regression models that serve the purpose of preventing overfitting, error/variance reduction, or sparsity recovery. Second, the regularization term is determined by the dual norm of the regression coefficient, which controls the

growth rate of the -loss function, and the radius of the Wasserstein set. This connection provides guidance on the selection of the regularization coefficient and may lead to significant computational savings compared to cross-validation. DRO essentially enables new and more accurate interpretations of the regularizer, and establishes its dependence on the growth rate of the loss, the underlying metric space and the reliability of the observed samples.

The connection between robustness and regularization has been established in several works. The earliest one may be credited to El Ghaoui and Lebret (1997), who show that minimizing the worst-case squared residual within a Frobenius norm-based perturbation set is equivalent to Tikhonov regularization. In more recent works, using properly selected uncertainty sets, Xu et al. (2010) has shown the equivalence between robust linear regression with feature perturbations and the Least Absolute Shrinkage and Selection Operator (LASSO). Yang and Xu (2013) extend this to more general LASSO-like procedures, including versions of the grouped LASSO. Bertsimas and Copenhaver (2017) give a comprehensive characterization of the conditions under which robustification and regularization are equivalent for regression models with deterministic norm-bounded perturbations on the features. For classification problems, Xu et al. (2009)

show the equivalence between the regularized support vector machines (SVMs) and a robust optimization formulation, by allowing potentially correlated disturbances in the covariates.

consider a robust version of logistic regression under the assumption that the probability distributions under consideration lie in a Wasserstein ball, and they show that the regularized logistic regression is a special case of this robust formulation. Recently,

Shafieezadeh-Abadeh et al. (2017); Gao et al. (2017) have provided a unified framework for connecting the Wasserstein DRO with regularized learning procedures, for various regression and classification models.

Our work is motivated by the problem of identifying patients who receive an abnormally high radiation exposure in CT exams, given the patient characteristics and exam-related variables (Chen et al., 2018)

. This could be casted as an outlier detection problem; specifically, estimating a robustified regression plane that is immunized against outliers and learns the underlying true relationship between radiation dose and the relevant predictors. We focus on robust learning of the parameter in regression models under distributional perturbations residing within a Wasserstein ball. While the applicability of the Wasserstein DRO methodology is not restricted to regression analysis

(Sinha et al., 2017; Gao et al., 2017; Shafieezadeh-Abadeh et al., 2017), or a particular form of the loss function (as long as it satisfies certain smoothness conditions (Gao et al., 2017)), we focus on the absolute residual loss in linear regression in light of our motivating application and for the purpose of enhancing robustness. Our contributions may be summarized as follows:

1. We develop a DRO approach to robustify linear regression using an loss function and an ambiguity set around the empirical distribution of the training samples defined based on the Wasserstein metric. The formulation is general enough to include any norm-induced Wasserstein metric and incorporate additional regularization constraints on the regression coefficients (e.g., -norm constraints). It provides an intuitive connection between the amount of ambiguity allowed and a regularization penalty term in the robust formulation, which provides a natural way to adjust the latter.

2. We establish novel performance guarantees on both the out-of-sample loss (prediction bias) and the discrepancy between the estimated and the true regression coefficients (estimation bias). Our guarantees manifest the role of the regularizer, which is related to the dual norm of the regression coefficients, in bounding the biases and are in concert with the theoretical foundation that leads to the regularized problem. The generalization error bound, in particular, builds a connection between the loss function and the form of the regularizer via Rademacher complexity, providing a rigorous explanation for the commonly observed good out-of-sample performance of regularized regression. On the other hand, the estimation error bound corroborates the validity of the -loss function, which tends to incur a lower estimation bias than other candidates such as the and losses. Our results are novel in the robust regression setting and different from earlier work in the DRO literature, enabling new perspectives and interpretations of the norm-based regularization, and providing justifications for the -loss based learning algorithms.

3. We empirically explore three important aspects of the Wasserstein DRO formulation, including the advantages of the -loss function, the selection of a proper norm for the Wasserstein metric, and the implication of penalizing the extended regression coefficient , through comparing with a series of regression models on a number of synthetic datasets. We show the superiority of the Wasserstein DRO approach, presenting a thorough analysis, under four different experimental setups. We also consider the application of our methodology to outlier detection, and compare with M-estimation in terms of the ability of identifying outliers (ROC (Receiver Operating Characteristic) curves). The Wasserstein DRO formulation achieves significantly higher AUC (Area Under Curve) values.

The rest of the paper is organized as follows. In Section 2, we introduce the Wasserstein metric and derive the general Wasserstein DRO formulation in a linear regression framework. Section 3 establishes performance guarantees for both the general formulation and the special case where the Wasserstein metric is defined on the -norm space. The numerical experimental results are presented in Section 4. We conclude the paper in Section 5.

Notational conventions: We use boldfaced lowercase letters to denote vectors, ordinary lowercase letters to denote scalars, boldfaced uppercase letters to denote matrices, and calligraphic capital letters to denote sets. denotes expectation and probability of an event. All vectors are column vectors. For space saving reasons, we write to denote the column vector , where is the dimension of . We use prime to denote the transpose of a vector, for the general norm operator, for the norm, for the norm, and for the infinity norm. denotes the set of probability measures supported on . denotes the -th unit vector, the vector of ones, a vector of zeros, and

the identity matrix. Given a norm

on , the dual norm is defined as: . For a function , its convex conjugate is defined as: where denotes the domain of the function .

## 2 Problem Statement and Justification of Our Formulation

Consider a linear regression problem where we are given a predictor/feature vector

, and a response variable

. Our goal is to obtain an accurate estimate of the regression plane that is robust with respect to the adversarial perturbations in the data. We consider an -loss function , motivated by the observation that the absolute loss function is more robust to large residuals than the squared loss (see Fig. 1). Moreover, the estimation error analysis presented in Section 3.2 suggests that the -loss function leads to a smaller estimation bias than others. Our Wasserstein DRO problem using the -loss function is formulated as:

 (2)

where is the regression coefficient vector that belongs to some set . could be , or if we wish to induce sparsity, with being some pre-specified number. is the probability distribution of , belonging to some set which is defined as:

 Ω≜{Q∈P(Z):Wp(Q, ^PN)≤ϵ},

where is the set of possible values for ; is the space of all probability distributions supported on ; is a pre-specified radius of the Wasserstein ball; and is the order- Wasserstein distance between and (see definition in (3)), with the uniform empirical distribution over samples. The formulation in (2) is robust since it minimizes over the regression coefficients the worst case expected loss, that is, the expected loss maximized over all probability distributions in the ambiguity set .

Before deriving a tractable reformulation for (2), let us first define the Wasserstein metric. Let be a metric space where is a set and is a metric on . The Wasserstein metric of order defines the distance between two probability distributions and in the following way:

 (3)

where

is the joint distribution of

and with marginals and , respectively. The Wasserstein distance between and represents the cost of an optimal mass transportation plan, where the cost is measured through the metric . The order should be selected in such a way as to ensure that the worst-case expected loss is meaningfully defined, i.e.,

 EQ[hβ(x,y)]<∞, ∀Q∈Ω. (4)

Notice that the ambiguity set is centered at the empirical distribution and has radius . It may be desirable to translate (4) into:

 ∣∣EQ[hβ(x,y)]−E^PN[hβ(x,y)]∣∣<∞, ∀Q∈Ω. (5)

We want to relate (5) with the Wasserstein distance , which is no larger than for all . The LHS of (5) could be written as:

 ∣∣EQ[hβ(x,y)]−E^PN[hβ(x,y)]∣∣ (6) = ∣∣∣∫Zhβ(x1,y1)Q(d(x1,y1))−∫Zhβ(x2,y2)^PN(d(x2,y2))∣∣∣ = ∣∣∣∫Zhβ(x1,y1)∫ZΠ0(d(x1,y1),d(x2,y2))−∫Zhβ(x2,y2)∫ZΠ0(d(x1,y1),d(x2,y2))∣∣∣ ≤

where is the joint distribution of and with marginals and , respectively. Comparing (6) with (3), we see that for (5) to hold, the following quantity which characterizes the growth rate of the loss function needs to be bounded:

 GRhβ((x1,y1),(x2,y2))≜∣∣hβ(x1,y1)−hβ(x2,y2)∣∣(s((x1,y1),(x2,y2)))p, ∀(x1,y1),(x2,y2)∈Z. (7)

A formal definition of the growth rate is due to Gao and Kleywegt (2016), which takes the limit of (7) as , to eliminate its dependence on . One important aspect they have pointed out is that when the growth rate of the loss function is infinite, strong duality for the worst-case problem fails to hold, in which case the DRO problem (2) becomes intractable. Assuming that the metric is induced by some norm , the bounded growth rate requirement is expressed as follows:

 limsup∥(x1,y1)−(x2,y2)∥→∞|hβ(x1,y1)−hβ(x2,y2)|∥(x1,y1)−(x2,y2)∥p≤limsup∥(x1,y1)−(x2,y2)∥→∞|y1−x′1β−(y2−x′2β)|∥(x1,y1)−(x2,y2)∥p≤limsup∥(x1,y1)−(x2,y2)∥→∞∥(x1,y1)−(x2,y2)∥∥(−β,1)∥∗∥(x1,y1)−(x2,y2)∥p<∞, (8)

where is the dual norm of , and the second inequality is due to the Cauchy-Schwarz inequality. Notice that by taking , (8) is equivalently translated into the condition that , which we will see in Section 3 is an essential requirement to guarantee a good generalization performance for the Wasserstein DRO estimator. The growth rate essentially reveals the underlying metric space used by the Wasserstein distance. Taking leads to zero growth rate in the limit of (8), which is not desirable since it removes the Wasserstein ball structure from our formulation and renders it an optimization problem over a singleton distribution. This will be made more clear in the following analysis. We thus choose the order- Wasserstein metric with being induced by some norm to define our DRO problem.

Next, we will discuss how to convert (2) into a tractable formulation. Suppose we have independently and identically distributed realizations of , denoted by . We make the assumption that comes from a mixture of two distributions, with probability from the outlying distribution and with probability from the true distribution . Recall that

is the discrete uniform distribution over the

samples. Our goal is to generate estimators that are consistent with the true distribution . We claim that when is small, if the Wasserstein ball radius is chosen judiciously, the true distribution will be included in the set while the outlying distribution will be excluded. To see this, consider a simple example where is a discrete distribution that assigns equal probability to data points equally spaced between and , and assigns probability to two data points and . We generate samples and plot the Wasserstein distances from for both and .

From Fig. 2 we observe that for below , the true distribution is closer to whereas the outlying distribution is further away. If the radius is chosen between the red () and blue () lines, the Wasserstein ball that we are hedging against will exclude the outlying distribution and the resulting estimator will be robust to the adversarial perturbations. Moreover, as becomes smaller, the gap between the red and blue lines becomes larger. One implication from this observation is that as the data becomes purer, the radius of the Wasserstein ball tends to be smaller, and the confidence in the observed samples is higher. For large values, the DRO formulation seems to fail. However, as outliers are defined to be the data points that do not conform to the majority of data, we can safely claim that is the distribution of the minority and is always below .

We now look at the inner supremum in (2). Esfahani and Kuhn (2017, Theorem 6.3) show that when the set is closed and convex, and the loss function is convex in ,

 supQ∈ΩEQ[hβ(x,y)]≤κϵ+1NN∑i=1hβ(xi,yi), ∀ϵ≥0, (9)

where , with the convex conjugate function of . Through (9), we can relax problem (2) by minimizing the right hand side of (9) instead of the worst-case expected loss. Moreover, as shown in Esfahani and Kuhn (2017), (9) becomes an equality when . In Theorem 2.1, we compute the value of for the specific loss function we use. The proof of this Theorem and all results hereafter are included in Appendix A.

###### Theorem 2.1.

Define , where is the dual norm of , and is the conjugate function of . When the loss function is , we have .

Due to Theorem 2.1, (2) could be formulated as the following optimization problem:

 infβ∈Bϵ∥(−β,1)∥∗+1NN∑i=1|yi−x′iβ|. (10)

Note that the regularization term of (10) is the product of the growth rate of the loss and the Wasserstein ball radius. The growth rate is closely related to the way the Wasserstein metric defines the transportation costs on the data . As mentioned earlier, a zero growth rate diminishes the effect of the Wasserstein distributional uncertainty set, and the resulting formulation would simply be an empirical loss minimization problem. The parameter controls the conservativeness of the formulation, whose selection depends on the sample size, the dimensionality of the data, and the confidence that the Wasserstein ball contains the true distribution (see eq. (8) in Esfahani and Kuhn, 2017). Roughly speaking, when the sample size is large enough, and for a fixed confidence level, is inversely proportional to .

Formulation (10) incorporates a class of models whose specific form depends on the norm space we choose, which could be application-dependent and practically useful. For example, when the Wasserstein metric is induced by and the set is the intersection of a polyhedron with convex quadratic inequalities, (10) is a convex quadratic problem which can be solved to optimality very efficiently. Specifically, it could be converted to:

 mina, b1,…,bN, β aϵ+1NN∑i=1bi (11) s.t. ∥β∥22+1≤a2, yi−x′iβ≤bi, i=1,…,N, −(yi−x′iβ)≤bi, i=1,…,N, a, bi≥0, i=1,…,N, β∈B.

When the Wasserstein metric is defined using and the set is a polyhedron, (10

) is a linear programming problem:

 mina, b1,…,bN, β aϵ+1NN∑i=1bi (12) s.t. a≥β′ei, i=1,…,m−1, a≥−β′ei, i=1,…,m−1, yi−x′iβ≤bi, i=1,…,N, −(yi−x′iβ)≤bi, i=1,…,N, a≥1, bi≥0, i=1,…,N, β∈B.

More generally, when the coordinates of differ from each other substantially, a properly chosen, positive definite weight matrix could scale correspondingly different coordinates of by using the -weighted norm:

 ∥(x,y)∥M=√(x,y)′M(x,y).

It can be shown that (10) in this case becomes:

 infβ∈Bϵ√(−β,1)′M−1(−β,1)+1NN∑i=1|yi−x′iβ|. (13)

We note that this Wasserstein DRO framework could be applied to a broad class of loss functions and the tractable reformulations have been derived in Shafieezadeh-Abadeh et al. (2017); Gao et al. (2017) for regression and classification models. We adopt the absolute residual loss in this paper to enhance the robustness of the formulation, which is the focus of our work and serves the purpose of estimating robust parameters that are immunized against perturbations/outliers. Notice that (10) coincides with the regularized LAD models (Pollard, 1991; Wang et al., 2006), except that we are regularizing a variant of the regression coefficient. We would like to highlight several novel viewpoints that are brought by the Wasserstein DRO framework and justify the value and novelty of (10). First, (10) is obtained as an outcome of a fundamental DRO formulation, which enables new interpretations of the regularizer from the standpoint of distributional robustness, and provides rigorous theoretical foundation on why the -regularizer prevents overfitting to the training data. The regularizer could be seen as a control over the amount of ambiguity in the data and reveals the reliability of the contaminated samples. Second, the geometry of the Wasserstein ball is embedded in the regularization term, which penalizes the regression coefficient on the dual Wasserstein space, with the magnitude of penalty being the radius of the ball. This offers an intuitive interpretation and provides guidance on how to set the regularization coefficient. Moreover, different from the traditional regularized LAD models that directly penalize the regression coefficient , we regularize the vector , where the takes into account the transportation cost along the direction. Penalizing only on corresponds to an infinite transportation cost along . Our model is more general in this sense, and establishes the connection between the metric space on data and the form of the regularizer.

## 3 Performance Guarantees

Having obtained a tractable reformulation for the Wasserstein DRO problem, we next establish guarantees on the predictive power and estimation quality for the solution to (10). Two types of results will be presented in this section, one of which bounds the prediction bias of the estimator on new, future data (given in Section 3.1). The other one that bounds the discrepancy between the estimated and true regression planes (estimation bias), is given in Section 3.2.

### 3.1 Out-of-Sample Performance

In this subsection we investigate generalization characteristics of the solution to (10), which involves measuring the error generated by our estimator on a new random sample . We would like to obtain estimates that not only explain the observed samples well, but, more importantly, possess strong generalization abilities. The derivation is mainly based on Rademacher complexity (see Bartlett and Mendelson, 2002), which is a measurement of the complexity of a class of functions. We would like to emphasize the applicability of such a proof technique to general loss functions, as long as their empirical Rademacher complexity could be bounded. The bound we derive for the prediction bias depends on both the sample average loss (the training error) and the dual norm of the regression coefficient (the regularizer), which corroborates the validity and necessity of our regularized formulation. Moreover, the generalization result also builds a connection between the loss function and the form of the regularizer via Rademacher complexity, which enables new insights into the regularization term and explains the commonly observed good out-of-sample performance of regularized regression in a rigorous way. We first make several mild assumptions that are needed for the generalization result.

###### Assumption A.

The norm of the uncertainty parameter is bounded above almost surely, i.e.,

###### Assumption B.

The dual norm of is bounded above within the feasible region, namely,

 supβ∈B∥(−β,1)∥∗=¯B.

Under these two assumptions, the absolute loss could be bounded via the Cauchy-Schwarz inequality.

###### Lemma 3.1.

For every feasible , it follows

 |y−x′β|≤¯BR,almost % surely.

With the above result, the idea is to bound the generalization error using the empirical Rademacher complexity of the following class of loss functions:

 H={(x,y)↦hβ(x,y):hβ(x,y)=|y−x′β|, β∈B}.

We need to show that the empirical Rademacher complexity of , denoted by , is upper bounded. The following result, similar to Lemma 3 in Bertsimas et al. (2015), provides a bound that is inversely proportional to the square root of the sample size.

###### Lemma 3.2.
 RN(H)≤2¯BR√N.

Let be an optimal solution to (10), obtained using the samples , . Suppose we draw a new i.i.d. sample . In Theorem 3.3 we establish bounds on the error .

###### Theorem 3.3.

Under Assumptions A and B, for any , with probability at least with respect to the sampling,

 E[|y−x′^β|]≤1NN∑i=1|yi−x′i^β|+2¯BR√N+¯BR√8log(2/δ)N , (14)

and for any ,

 P(|y−x′^β|≥1NN∑i=1|yi−x′i^β|+ζ)≤1N∑Ni=1|yi−x′i^β|+2¯BR√N+¯BR√8log(2/δ)N1N∑Ni=1|yi−x′i^β|+ζ. (15)

There are two probability measures in the statement of Theorem 3.3. One is related to the new data , while the other is related to the samples . The expectation in (14) (and the probability in (15)) is taken w.r.t. the new data . For a given set of samples, (14) (and (15)) holds with probability at least w.r.t. the measure of samples. Theorem 3.3 essentially says that given typical samples, the expected loss on new data using our Wasserstein DRO estimator could be bounded above by the average sample loss plus extra terms that depend on the supremum of (our regularizer), and are proportional to . This result validates the dual norm-based regularized regression from the perspective of generalization ability, and could be generalized to any bounded loss function. It also provides implications on the form of the regularizer. For example, if given an -loss function, the dependency on for the generalization error bound will be of the form , which suggests using

as a regularizer, reducing to a variant of ridge regression

(Hoerl and Kennard, 1970) for induced Wasserstein metric.

We also note that the upper bounds in (14) and (15) do not depend on the dimension of . This dimensionality-free characteristic implies direct applicability of our Wasserstein approach to high-dimensional settings and is particularly useful in many real applications where, potentially, hundreds of features may be present. Theorem 3.3 also provides guidance on the number of samples that are needed to achieve satisfactory out-of-sample performance.

###### Corollary 3.4.

Suppose is the optimal solution to (10). For a fixed confidence level and some threshold parameter , to guarantee that the percentage difference between the expected absolute loss on new data and the sample average loss is less than , that is,

 E[|y−x′^β|]−1N∑Ni=1|yi−x′i^β|¯BR≤τ,

the sample size must satisfy

 N≥[2(1+√2log(2/δ) )τ]2. (16)
###### Corollary 3.5.

Suppose is the optimal solution to (10). For a fixed confidence level , some and , to guarantee that

 P(|y−x′^β|−1N∑Ni=1|yi−x′i^β|¯BR≥γ)≤τ,

the sample size must satisfy

 N≥[2(1+√2log(2/δ) )τ⋅γ+τ−1]2, (17)

provided that  .

In Corollaries 3.4 and 3.5, the sample size is inversely proportional to both and , which is reasonable since the more confident we want to be, the more samples we need. Moreover, the smaller is, the stricter a requirement we impose on the performance, and thus more samples are needed.

### 3.2 Discrepancy between Estimated and True Regression Planes

In addition to the generalization performance, we are also interested in the accuracy of the estimator. In this section we seek to bound the difference between the estimated and true regression coefficients, under a certain distributional assumption on . Throughout the section we will use to denote the estimated regression coefficients, obtained as an optimal solution to (18), and for the true (unknown) regression coefficients. The bound we will derive turns out to be related to the Gaussian width (see definition in the Appendix) of the unit ball in , the sub-Gaussian norm of the uncertainty parameter , as well as the geometric structure of the true regression coefficients. We note that this proof technique may be applied to several other loss functions, e.g., and losses, with slight modifications. However, we will see that the -loss function incurs a relatively low estimation bias compared to others, further demonstrating the superiority of our absolute error minimization formulation.

To facilitate the analysis, we will use the following equivalent form of problem (10):

 minβ ∥(−β,1)∥∗ (18) s.t. ∥(−β,1)′Z∥1≤γN, β∈B,

where is the matrix with columns , and is some exogenous parameter related to . One can show that for properly chosen , (18) produces the same solution with (10) (Bertsekas, 1999). (18) is similar to (11) in Chen and Banerjee (2016), with the difference lying in that we impose a constraint on the error instead of the gradient, and we consider a more general notion of norm on the coefficient. On the other hand, due to their similarity, we will follow the line of development in Chen and Banerjee (2016). Still, our analysis is self-contained and the bound we obtain is in a different form, which provides meaningful insights into our specific problem. We list below the assumptions that are needed to bound the estimation error.

###### Assumption C.

The norm of is bounded above within the feasible region, namely,

 supβ∈B∥(−β,1)∥2=¯B2.
###### Assumption D (Restricted Eigenvalue Condition).

For some set and some positive scalar , where is the unit sphere in the -dimensional Euclidean space,

 infv∈A(β∗)v′ZZ′v≥α––.
###### Assumption E.

The true coefficient is a feasible solution to (18), i.e.,

 ∥Z′(−β∗,1)∥1≤γN,β∗∈B.
###### Assumption F.

is a centered sub-Gaussian random vector (see definition in the Appendix), i.e., it has zero mean and satisfies the following condition:

###### Assumption G.

The covariance matrix of

has bounded positive eigenvalues. Set

; then,

 0<λmin≜λmin(Γ)≤λmax(Γ)≜λmax<∞.

Notice that both in Assumption D and in Assumption E are related to the random observation matrix . A probabilistic description for these two quantities will be provided later. We next present a preliminary result, similar to Lemma 2 in Chen and Banerjee (2016), that bounds the -norm of the estimation bias in terms of a quantity that is related to the geometric structure of the true coefficients. This result gives a rough idea on the factors that affect the estimation error, and shows the advantages of using the -loss from the perspective of its dual norm. The bound derived in Theorem 3.6 is crude in the sense that it is a function of several random parameters that are related to the random observation matrix . This randomness will be described in a probabilistic way in the subsequent analysis.

###### Theorem 3.6.

Suppose the true regression coefficient vector is and the solution to (18) is . For the set , under Assumptions A, D, and E, we have:

 ∥^β−β∗∥2≤2RγNα––Ψ(β∗), (19)

where .

Notice that the bound in (19) does not explicitly depend on the sample size . If we change to the -loss function, problem (18) will become:

 minβ ∥(−β,1)∥∗ s.t. ∥(−β,1)′Z∥2≤γN, β∈B.

The proof of Theorem 3.6 still applies with slight modification. We will find out that in the case of -loss, the estimation error bound is in the following form:

 ∥^β−β∗∥2≤2R√NγNα––Ψ(β∗).

Similarly, the -loss, which considers only the maximum absolute loss among the samples, turns (18) into:

 minβ ∥(−β,1)∥∗ s.t. ∥(−β,1)′Z∥∞≤γN, β∈B.

The corresponding bound becomes:

 ∥^β−β∗∥2≤2RNγNα––Ψ(β∗).

We see that by using either or -loss, an explicit dependency on is introduced. As a result, the estimation error bounds become worse. The reason is that for the -loss function, its dual norm operator only picks out the maximum absolute coordinate and thus avoids the dependence on the dimension, which in our case is the sample size (see Eq.(28)), whereas other norms, e.g., -norm, sum over all the coordinates and thus introduce a dependence on .

As mentioned earlier, (19) provides a random upper bound, revealed in and , that depends on the randomness in . We therefore would like to replace these two parameters by non-random quantities. The acts as the minimum eigenvalue of the matrix restricted to a subspace of , and thus a proper substitute should be related to the minimum eigenvalue of the covariance matrix of , i.e., the matrix (cf. Assumption G), given that is zero mean. See Lemmas 3.7, 3.8 and 3.9 for the derivation.

###### Lemma 3.7.

Consider the set , where is defined as in Theorem 3.6, and . Under Assumptions F and G, when the sample size , where , and is the Gaussian width of , with probability at least , we have

where and are positive constants.

Note that the sample size requirement stated in Lemma 3.7 depends on the Gaussian width of , where relates to . The following lemma shows that their Gaussian widths are also related. This relation is built upon the square root of the eigenvalues of , which measures the extent to which expands .

###### Lemma 3.8 (Lemma 4 in Chen and Banerjee (2016)).

Let be the -norm of a standard Gaussian random vector , and , be defined as in Lemma 3.7. Then, under Assumption G,

 w(AΓ)≤C3μ0√λmaxλmin(w(A(β∗))+3),

for some positive constant .

Combining Lemmas 3.7 and 3.8, and expressing the covariance matrix using its eigenvalues, we arrive at the following result.

###### Corollary 3.9.

Under Assumptions F and G, and the conditions in Lemmas 3.7 and 3.8, when , with probability at least ,

 v′ZZ′v≥Nλmin2,∀ v∈A(β∗),

where and are positive constants.

Next we derive the smallest possible value of such that is feasible. The derivation uses the dual norm operator of the -loss, resulting in a bound that depends on the Gaussian width of the unit ball in the dual norm space (). See Lemma 3.10 for details.

###### Lemma 3.10.

Under Assumptions C and F, for any feasible , with probability at least ,

 ∥(−β,1)′Z∥1≤Cμ¯B2w(Bu),

where is the unit ball of norm , , and positive constants.

We note that for other loss functions, e.g., the and losses, similar results can be obtained, where is defined to be the unit -ball in , with being the dual norm of the loss. Combining Theorem 3.6, Corollary 3.9 and Lemma 3.10, we have the following main performance guarantee result that bounds the estimation bias of the solution to (18).

###### Theorem 3.11.

Under Assumptions A, C, D, E, F, G, and the conditions of Theorem 3.6, Corollary 3.9 and Lemma 3.10, when , with probability at least ,

 ∥^β−β∗∥2≤¯CR¯B2μNλminw(Bu)Ψ(β∗). (20)

From (20) we see that the bias is decreased as the sample size increases and the uncertainty embedded in (revealed in and ) is reduced. The estimation error bound depends on the geometric structure of the true coefficients, defined using the dual norm space of the Wasserstein metric, the Gaussian width of the unit -ball in , and the minimum eigenvalue of the covariance matrix of , with a convergence rate for the -loss we applied. As mentioned earlier, other loss functions may incur a dependence on in the numerator of the bound, thus resulting in a slower convergence rate, which substantiates the benefit of using an -loss function.

## 4 Simulation Experiments on Synthetic Datasets

In this section we will explore the robustness of the Wasserstein formulation in terms of its Absolute Deviation (AD) loss function and the dual norm regularizer on the extended regression coefficient . Recall that our Wasserstein formulation is in the following form:

 infβ∈B1NN∑i=1|yi−x′iβ|+ϵ∥(−β,1)∥∗. (21)

We will focus on the following three aspects of this formulation:

1. How to choose a proper norm for the Wasserstein metric?

2. Why do we penalize the extended regression coefficient rather than ?

3. What is the advantage of the AD loss compared to the Squared Residuals (SR) loss?

To answer Question 1, we will connect the choice of for the Wasserstein metric with the characteristics/structures of the data . Specifically, we will design two sets of experiments, one with a dense regression coefficient , where all coordinates of play a role in determining the value of the response , and another with a sparse implying that only a few predictors are relevant/important in predicting . Two Wasserstein formulations will be tested and compared, one induced by the (Wasserstein ), which leads to an -regularizer in (21), and the other one induced by the (Wasserstein ) and resulting in an -regularizer in (21). Intuitively, and based on the past experience in implementing the regularization techniques, the Wasserstein should outperform the Wasserstein in the dense setting, while in the sparse setting, the reverse is true. Researchers have well identified the sparsity inducing property of the -regularizer and provided a nice geometrical interpretation for it (Friedman et al., 2001). Here, we try to offer a different explanation from the perspective of the Wasserstein DRO formulation, through projecting the sparsity of onto the space and establishing a sparse distance metric that only extracts a subset of coordinates from to measure the closeness between samples.

For the second question, we first note that if the Wasserstein metric is induced by the following metric :

 sc(x,y)=∥(x,cy)∥2,

for a positive constant , then as , the resulting Wasserstein DRO formulation becomes:

 infβ∈B1NN∑i=1|yi−x′iβ|+ϵ∥β∥2,

which is the -regularized LAD. This can be proved by recognizing that , with a diagonal matrix whose diagonal elements are , and then applying (13). Alternatively, if we let

 sc(x,y)=∥(x,cy)∥∞,

it can be shown that as , the corresponding Wasserstein formulation becomes:

 infβ∈B1NN∑i=1|yi−x′iβ|+ϵ∥β∥1,

which is the -regularized LAD (see proof in the Appendix). It follows that regularizing over implies an infinite transportation cost along . In other words, for two data points and , if , then they are considered to be infinitely far away. By contrast, our Wasserstein formulation, which regularizes over the extended regression coefficient , stems from a finite cost along that is equally weighted with . We will see the disadvantages of penalizing only in the analysis of the experimental results.

To answer Question 3, we will compare with several commonly used regression models that employ the SR loss function, e.g., ridge regression (Hoerl and Kennard, 1970), LASSO (Tibshirani, 1996), and Elastic Net (EN) (Zou and Hastie, 2005). We will also compare against M-estimation (Huber, 1964, 1973), which uses a variant of the SR loss and is equivalent to solving a weighted least squares problem, where the weights are determined by the residuals. These models will be compared under two different experimental setups, one involving adversarial perturbations in both and , and the other with perturbations only in . The purpose is to investigate the behavior of these approaches when the noise in is substantially reduced. As shown by Fig. 1, compared to the SR loss, the AD loss is less vulnerable to large residuals, and hence, it is advantageous in the scenarios where large perturbations appear in . We are interested in studying whether its performance is consistently good when the corruptions appear mainly in .

We next describe the data generation process. Each training sample has a probability of being drawn from the outlying distribution, and a probability of being drawn from the true (clean) distribution. Given the true regression coefficient , we generate the training data as follows: