1 Ordinary text
Most of the existing work on robust statistical inference mainly address distributional robustness issues such as outliers or model uncertainties (Huber, 1964; Huber and Ronchetti, 2009; Hampel et al., 2009)
. As machine learning and statistical inference algorithms are being increasingly used in safety critical applications and security related applications(Huval et al., 2015; Buczak and Guven, 2016; Litjens et al., 2016; Nelson et al., 2008; Soule et al., 2005; Stamp, 2018; Suthaharan, 2014; Vallon et al., 2017; Hoermann et al., 2018)
, there is a growing interest in investigating the robustness of statistical inference algorithms in adversarial environments. In these adversarial environments, we are facing more severe situations than those addressed in the classic robust statistical inference problems. One such scenario is where an adversary can observe the whole dataset and then devise its attack vector to modify all entries in the data point hoping to cause the maximum inference error or to control the inference results. For example, the adversarial example phenomenon in the deep neural network(Pimentel-Alarcon et al., 2017; Szegedy et al., 2013; Nguyen et al., 2015; Goodfellow et al., 2015; Carlini and Wagner, 2017)
where an adversary can observe the whole picture and then carefully modify the pixels in the picture with the goal of fooling the classifier. As another example, it was shown in(Jagielski et al., 2018)
that an adversary can modify the training data so that the model produced by the linear regression is controlled by the adversary. The existence of such powerful adversaries calls for new models and methodologies for adversarially robust inference.
In a canonical statistical inference problem, one infers parameters of interest from given data points . The classic robust inference mainly deals with distributional robustness, i.e., the shape of the true underlying distribution deviates from the assumed model (Huber, 1964; Huber and Ronchetti, 2009; Hampel et al., 2009). More specifically, let be the cumulative density function (CDF) of the assumed model with being parameter, then the classic robust inference deals with the situation where the data points are independently and identically generated by an unknown CDF in the -neighborhood of the assumed model . The goal of the classic robust inference is to design inference algorithms that perform well for any . For example, can be a Lvy neighborhood or contamination neighborhood in which
can be any probability measure. The contamination neighborhood model can be viewed as havingfraction of the data as outliers, while the L
vy neighborhood (or other related neighborhood) is useful in scenarios with model uncertainties. Various concepts such as influence function (IF), breakdown point, and change of variance etc were developed to quantify the robustness of estimators against the presence of outliers.
In this paper, we consider a setup with more powerful adversaries than those considered in the classic robust inference and investigate adversarial robustness. In particular, in the considered setup, after data points are generated, the adversary can observe the whole dataset and then modify all data points to where each is carefully designed and depends on the whole data set. It is easy to see that the adversary in the adversarially robust model is more powerful. In particular, for any in the classic model, the adversary in the adverarially robust model can mimic the behavior of by simply replacing the dataset with i.i.d samples generated from . Clearly, this is not optimal strategy that the adversary in our model will adopt, as the adversary is not limited to this type of i.i.d attacks after observing the whole dataset. It can construct correlated attack signals that are based on the whole dataset. As the result, it is important to understand the following questions: 1) What is the attacker’s optimal attack strategy in choosing ?; 2) What are the impacts of these attacks?; 3) How shall we design inference algorithms to minimize the impact?
In our recent work (Lai and Bayraktar, 2018), we made some progress in addressing these problems for the case of scalar parameter estimation, in which the parameter to be estimated is a scalar and each sample is also a scalar. In particular, given a data set with
being i.i.d realizations of random variablethat has CDF with unknown scalar parameter , we would like to estimate the unknown parameter . There is an adversary who can observe the whole dataset and can modify the dataset to , in which is the attack vector chosen by the adversary after observing . Certain restrictions need to be put on , otherwise the estimation problem will not be meaningful. In (Lai and Bayraktar, 2018), we assume that in which is the norm. This type of constraints are reasonable and are motivated by real life examples. For example, in generating adversary examples in images (Pimentel-Alarcon et al., 2017; Szegedy et al., 2013; Nguyen et al., 2015; Goodfellow et al., 2015; Carlini and Wagner, 2017), the total distortion should be limited, otherwise human eyes will be able to detect such changes. The classic setup with contamination model (Huber, 1964; Huber and Ronchetti, 2009; Hampel et al., 2009) can be viewed as a special case of our formulation by letting , i.e., the classic setup has a constraint on the total number of data points that the attacker can modify. For a given estimator, we would like to characterize how sensitive the estimator is with respect to the adversarial attack. In (Lai and Bayraktar, 2018), we considered a scenario where the goal of the attacker is to maximize the estimation error caused by the attack. We introduced a concept named “adversarial influence function” (AIF) to quantify the asymptotic rate at which the attacker can introduce estimation error through its optimal attack. From the defender’s perspective, the smaller AIF is, the more adversarially robust the estimator is. In (Lai and Bayraktar, 2018), building on the characterization of AIF, we characterized the optimal estimator, among a certain class of estimators, that minimizes AIF. From this characterization, we show that there is a tradeoff between the robustness against adversarial attacks and robustness against outliers. In (Lai and Bayraktar, 2018), we further designed optimal estimator that achieve the optimal tradeoff among these quantities for the scalar case.
In this paper, we extend our work in (Lai and Bayraktar, 2018) to multivariate setup, in which the goal is to jointly estimate multiple parameters from vector observations. The multivariate setup includes many important cases such as the joint location-scale estimation and robust linear regression etc. In this multivariate setup, we have data points with each data point being a vector. These data points are realizations of a random variable that has CDF with unknown parameter vector . We use matrix to denote the given data matrix. From this given data set, we would like to estimate the unknown parameter . The adversary will modify the data to , in which is the attack matrix chosen by the adversary after observing . Similar to the situation in the classic robust inference problem (Hubert et al., 2008), the multivariate adversarial robustness setup is significantly more challenging than the scalar case.
Firstly, the characterization of the optimal attack strategy is much more difficult. There are many more degrees of freedom for the attacker to choose from, as the dimension ofis . Furthermore, each modification will affect all components of the estimated vector in a different but coupled manner. In this paper, we focus on the class of -estimators specified by -dimension functions . For this class of estimators, we characterize the optimal attack vector and the corresponding AIF. We further simplify this general formula for robust linear regression and evaluate the adversarial robustness of various existing robust algorithms.
Secondly, the characterization of the optimal defense strategy is also much harder. For example, in the -estimator case, now is a -dimension function, and the corresponding optimization problem of maximizing AIF becomes a coupled multi-dimension calculus of variation problem, which is in general very challenging. In this paper, for the important case of joint location-scale estimation problem, we show that we can decouple the characterization of optimal defense problem into two scalar problems. Building on this, we identify the optimal -estimator that minimizes AIF. In addition, similar to the scalar problem, we show that there exist a tradeoff between the robustness against adversarial attack and robustness again outliers. We further characterize the optimal -estimator that achieves the optimal tradeoff between these robustness metrics.
The remainder of the paper is organized as follows. In Section 2, we introduce the considered model. In Section 3, we derive general AIF results and simplify the results for robust linear regression problems. In Section 4, we focus on the special case of joint location-scale estimation problem and characterize the optimal estimators that achieve the optimal AIF for both cases with and without constraints on the robustness against outliers. In Section 5, we use several numerical examples to illustrate results derived in this paper. Section 6 provides concluding remarks.
In this section, we first introduce our problem formulation. We will then briefly review results from classic robust estimation that are directly related to our study.
2.1 Problem Formulation
We have data points with . These data points are i.i.d realizations of a random variable that has CDF with unknown parameter . Here, is a compact set. We will use
to denote the corresponding probability density function (pdf). We usematrix to denote the given data matrix. From this given data set, we would like to estimate the unknown parameter . However, as the adversary has access to the whole dataset, it will modify the data to , in which is the attack matrix chosen by the adversary after observing . We will discuss the attacker’s optimal attack strategy in choosing in the sequel. In this work, we consider the case where the attacker can modify all data points, which is a more suitable setup for recent data analytical applications. However, certain restrictions need to be put on , otherwise the estimation problem will not be meaningful. In this paper, we assume that
in which is the entry-wise matrix -norm:
with being the vectorization of matrix . In (2.1), we have the normalization term as the matrix is of size . The normalization factor implies that the per-dimension change (on average) is upper-bound by . As mentioned in the introduction, this type of constraints are reasonable and are motivated by real life examples.
Following notation used in robust statistics (Huber and Ronchetti, 2009; Hampel et al., 2009), we will use , a dimensional vector, to denote an estimator. For a given estimator , we would like to characterize how sensitive the estimator is with respect to the adversarial attack. In this paper, we consider a scenario where the goal of the attacker is to maximize the estimation error caused by the attack. In particular, the attacker aims to choose by solving the following optimization problem
in which is the norm.
We use to denote the optimal value obtained from the optimization problem (2.2), and define the adversarial influence function (AIF) of estimator at under norm constraint as
This quantity, a generalization of the concept of influence function (IF) used in classic robust estimation (a brief review of IF will be provided in Section 2.2), quantifies the asymptotic rate at which the attacker can introduce estimation error through its attack.
From the defender’s perspective, the smaller AIF is, the more robust the estimator is. In this paper, building on the characterization of , we will characterize the optimal estimator , among a certain class of estimators , that minimizes . In particular, we will investigate
Note that depends on the data matrix . Based on the characterization of AIF for a given data realization matrix with columns (each column representing one data point), we will then study the population version of AIF where each column of is i.i.d generated by . We will examine the behavior of as increases. We will see that for a large class of estimators has a well-defined limit as . We will use to denote this limit when it exists.
From the defense’s perspective, we would like to design an estimator that is least sensitive to the adversarial attack. Again, we will characterize the optimal estimator , among a certain class of estimators , that minimizes . That is, for a certain class of estimators , we will solve
It will be clear in the sequel that the solution to the optimization problem (2.3), even though is robust against adversarial attacks, has poor performance in guarding against outliers. This motivates us to design estimators that strike a desirable tradeoff between these two robustness measures. In particular, we will solve (2.3) with an additional constraint on IF. After the corresponding quantities are introduced in later sections, precise formulation of this optimization problem with additional IF constraint will be given in (4.7).
2.2 M-Estimator and Influence Function (IF)
In this paper, we will mainly focus on a class of commonly used estimator in robust statistic: -estimator (Huber, 1964), in which one obtains an estimate of by solving
Here is a vector function of data point and parameter to be estimated. We use , , to denote each component of . Different choices of lead to different robust estimators. For example, the most likely estimator (MLE) can be obtained by setting .
As the form of determines , in the remainder of the paper, we will use and interchangeably. For example, we will denote as .
It is typically assumed that is continuous and almost everywhere differentiable. This assumption is valid for all ’s that are commonly used. It is also typically required that the estimator is Fisher consistent (Hampel et al., 2009):
in which means expectation under . Intuitively speaking, this implies that the true parameter is the solution of the -estimator if there are increasingly more i.i.d. data points generated from .
In the contamination model of the classic robust estimation setup, it is assumed that a fraction of data points are outliers, while the remainder of data points are generated from the true distribution . For a given estimator , the concept of IF introduced by Hamper (Hampel, 1968) is defined
Here, is a dimensional vector. In this definition, is a distribution that puts mass 1 at point . In addition, is the obtained estimate when all data points are generated i.i.d from , and is the obtained estimate when fraction of data points are generated i.i.d from while fraction of the data points are at . Hence, measures the asymptotic influence of having outliers at point as . Similar as above, as is determined by in M-estimator, in the following, we will also denote as .
Furthermore, to characterize the impact of the worst outliers, Hamper (Hampel et al., 2009) introduced the (unstandardized) gross-error sensitivity:
in which is the norm.
3 Characterizing AIF
In this section, for a given data matrix , we analyze the AIF for any given M-estimator as specified in (2.4). As is -dimension vector, there are equations. To simplify the presentation, we write each equation as
and denote . Using this notation, (2.4) can be written as
3.1 General Case
To proceed further, we write
We have the following characterization.
For , suppose is invertible, we have
in which is the set of length vectors with each entry being either or .
First, from (3.1), we have
Using Taylor expansion, we have
When is small, the adversary can focus on the following optimization problem to obtain an optimal solution
Let be the optimal value obtained in the optimization problem (P1). Let be the optimal value of the following optimization problem
In Appendix A, we show that . Hence, we can focus on problem (P2).
The inner maximization problem in (P2) is the same as
Using (3.3), we have
To simplify the notation, we denote
which is a row vector with entries. Even though is only a row vector, we denote these elements as for and to better connect with each elements of . Hence, corresponds to . Using this notation, the optimization problem can be written as
, this is a linear programing problem, whose solution is simple. In particular, let, and be the corresponding index, it is easy to check that we have
and for other s. Hence,
For , (3.1) is a convex optimization problem. To solve this, we form Lagrange
The corresponding optimality conditions are:
From (3.5), we know that , hence
From (3.7) and the fact that is positive, we know , and hence we have
which can be simplified further to
Combining these with (3.6), we obtain the value of :
As the result, we have
Hence, the optimal value of the inner maximization of (P2) is
which finishes the proof. ∎
3.2 Robust Regression
In this section, we use robust linear regression, an important multivariate parameter estimation problem, as an example to illustrate the result derived in Section 3.1. In linear regression problems, the data points are , with and . Hence, . In the following, we let , and still denote . From the data, we would like to fit a linear model, i.e., we would like to find such that is a good approximation of . Hence, the parameters to be estimated are . Furthermore, each data point , hence . We denote
as the residual error.
The commonly used ordinary least square (OLS) approach findsby solving
which is equivalent to solving
The solution is well known In the subsequent discussion, we will use a related quantity named hat matrix
It is known that OLS solution is not robust to outliers (Hampel et al., 2009). Various robust linear regression schemes were proposed (Wilcox, 2005; Mallows, 1975; Huber, 1973; Merrill and Schweppe, 1971). They generally set in the form with function and weights and . That is, for robust linear regression, one obtains the estimate of by solving
The weights and can be chosen to not only depend on but also the whole data matrix . For example, it is common (Wilcox, 2005; Mallows, 1975) to use , in which is the th diagonal element of the hat matrix (3.9). It is known (Huber and Ronchetti, 2009) that . Comparing (3.10) with (3.8), we can see that one replaces in (3.8) with and replaces in (3.8) with . The main idea is to use to limit the impact of outliers in and use to limit the impact of outliers in the residual while taking the location of into consideration. From (3.10), we have
Different choices of functions lead to different classes of robust linear regression methods. For example:
leads to Mallows’s proposal.
is Schewppe’s approach (Merrill and Schweppe, 1971).
In the following, we calculate and .
First, we compute . For ; ; and , we have
in which is the indicator function and is the th element of . In addition,
in which we denote
Furthermore, each entry of can be computed as
From this, we know that
Using the result in Theorem 3.1, assuming is invertible, we have the following characterization of AIF of robust linear regression.
For robust linear regression,
with defined in (3.11).
We now apply Proposition 3.1 to various specific (robust) linear regression approaches.
OLS: For OLS, we have , and . In this case, we have
and Furthermore, and hence
Mallow’s proposal: In Mallow’s proposal, we have . In this case, we have
and Furthermore, with .
Schewppe’s approach (Merrill and Schweppe, 1971): In Schewppe’s approach, we have . In this case, we have
and Furthermore, and
We will compare these methods numerically in Section 5.
4 Optimal Adversarial Robustness vs Outlier Robustness Tradeoff
In this section, we specialize the results to the joint estimation of location and scale. Building on these results, we will design -estimators that minimizes AIF or achieves the optimal tradeoff between AIF, i.e., adversarial robustness, and IF, i.e., outlier robustness.
In the joint location-scale estimation, given with , the goal is to jointly estimate the location parameter and the scale parameter . We will that assume is bounded and there is a constant such that . Hence, the dimension of each data point and the dimension of parameter . We focus on a large class of model named the location-scale model (Hampel et al., 2009). In the location-scale model, we have
in which is a random variable with symmetric pdf and CDF , and means the distribution of . This class of model includes many important models in statistics. For example the joint estimation of mean and variance of Gaussian random variable belongs to this model. Another important example is the linear regression discussed in Section 3.2.
Following the convention, we use to denote the estimation of location and to denote the estimation of the scale . In the location-scale model, one typically obtains by solving the following equations (Huber and Ronchetti, 2009):
with properly chosen and . In the following, for presentation and notation convenience, we denote
For this class of model, one typically focuses on equivariant -estimator (Hampel et al., 2009) with
This implies that the first component of
is an odd function, while the second component ofis an even function. Furthermore, and are assumed to be monotone functions in . Without loss of generality, we will focus on monotone increasing functions, hence and for .
4.1 Given Sample Case
We first use the results derived in Section 3 to characterize the AIF for a given data matrix . We note that in this joint location-scale estimation problem, , , and
Since, , . Due to symmetry, we only need to consider