Valid Inference Corrected for Outlier Removal

11/29/2017 ∙ by Shuxiao Chen, et al. ∙ 0

Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to first identify and remove outliers by looking at the data then to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. We show in this paper that this "detect-and-forget" approach can lead to invalid inference, and we propose a framework that properly accounts for outlier detection and removal to provide valid confidence intervals and hypothesis tests. Our inferential procedures apply to any outlier removal procedure that can be characterized by a set of quadratic constraints on the response vector, and we show that several of the most commonly used outlier detection procedures are of this form. Our methodology is built upon recent advances in selective inference (Taylor & Tibshirani 2015), which are focused on inference corrected for variable selection. We conduct simulations to corroborate the theoretical results, and we apply our method to two classic data sets considered in the outlier detection literature to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

Code Repositories

outference

A Tool for Valid Inference Corrected for Outlier Removal in Linear Regressions


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Linear regression is routinely used in just about every field of science. In introductory statistics courses, students are shown cautionary examples of how even a single outlier can wreak havoc in ordinary least squares (OLS). Outliers can arise for a variety of reasons, including recording errors and the occurrence of rare phenomena, and they often go unnoticed without careful inspection (see, e.g., Belsley et al., 2005). Given this reality, one route taken by statisticians is robust regression

, in which the least-squares loss function is replaced with other functions that are less sensitive to outliers, such as the Huber loss

(Huber & Ronchetti, 1981) or the least absolute deviation loss (see, e.g., Bloomfield & Steiger, 1984). Yet in practice OLS remains the predominant approach, presumably due to its accompanying inferential procedures, which are elegant and easy to use. Thus, the most common approach is a two-step procedure, which we will refer to as detect-and-forget:

  1. detect and then remove outliers;

  2. fit OLS and perform inference on the remaining data as if this were the original data set.

While this common approach is attractive for its simplicity, we will show that it can lead to confidence intervals and hypothesis tests with incorrect operating characteristics. In particular, this detect-and-forget

 approach is problematic for its use of the same data twice. While the term “outlier removal” might lead one to think of Step 1 as a clear-cut, essentially deterministic step, in fact Step 1 should instead be thought of as “potential outlier removal,” an imperfect process in which one has some probability of removing non-outliers, a process that can alter the distribution of the data. The act of searching for and removing potential outliers must be considered as part of the data-fitting procedure and thus must be considered in Step 2 when inference is being performed.

Similar concerns over “double dipping” are well-known in prediction problems, in which sample splitting (into training and testing sets) is a common remedy. However, such a strategy does not translate in an obvious way to the outlier problem: If one removes outliers on a subset of observations and performs inference on the remaining observations, then one is of course left vulnerable to outliers in the second set that could throw off the inference stage.

To illustrate how the detect-and-forget strategy can be problematic, consider the situation shown in Figure 1, in which there are 19 “normal” points (in black), and a single “outlier” point (in red) has been shifted upward by different magnitudes. For this illustration, we use a well-known approach for outlier detection called Cook’s distance (Cook, 1977):

(1)

where is the -th residual from OLS on the entire data set, is the scaled sum of squares, and is the -th diagonal entry of the hat matrix . We declare the observation with the largest Cook’s distance to be the outlier (indicated in the figure by a yellow cross) and then refit the regression model with this point removed (black regression line). We then construct confidence intervals for the regression surface in two different ways: first, using the traditional detect-and-forget strategy, which ignores the outlier removal step, and second using corrected-exact, a method we will introduce in this paper, which properly corrects for the removal. When the outlier is obvious (leftmost panel), our method makes no discernible correction. With such a pronounced separation between the outlier and non-outliers, Step 1 is unlikely to have removed a non-outlier, and thus the distribution of the data for inference is likely unaltered. However, when the outlier is less easily distinguished from the data, our corrected confidence intervals are noticeably different from the classical ones. In particular, the corrected-exact intervals are pulled in the direction of the removed data point, thereby accounting for the possibility that the removed point may not in fact have been an outlier.

Figure 1: Confidence intervals for the regression surface. Normal data are in black while the only outlier is marked in red. The point with a yellow cross is the detected outlier which has the largest Cook’s distance. The black line is the regression line fitted using the data in which the detected outlier is removed.

While Figure 1 shows only a single realization of the two intervals in four different scenarios, Figure 2 shows the empirical coverage probability, averaged over 2000 realizations, of these two types of confidence intervals along the regression surface for the same four scenarios. We see that when the outlier signal is strong (leftmost panel), both detect-and-forget and corrected intervals achieve coverage, as desired. However, as the outlier signal decreases, the detect-and-forget intervals begin to break down, while our corrected intervals remain unaffected. Indeed, we will show in this paper how all sorts of inferential statements (confidence intervals for regression coefficients, coefficient -tests, -tests, etc.) can be thrown off using a detect-and-forget strategy but can be corrected with a proper accounting for the outlier detection and removal step.

Figure 2: Empirical coverage probability along the regression surface (across realizations). The dashed line represents coverage.

The machinery underlying our methodology is built on recent advances in selective inference (Fithian et al., 2014; Taylor & Tibshirani, 2015), specifically the framework introduced in Lee et al. (2016); Loftus & Taylor (2015), and it fits within the framework of inferactive data analysis introduced by Bi et al. (2017). We give a brief introduction to the philosophy of selective inference in the context of outlier detection and refer readers to Fithian et al. (2014) for more details.

We assume a standard regression setting, with , where for . Here is the set of non-outliers. If is the set of detected non-outliers, then the detect-and-forget strategy forms the OLS estimator on the subset of observations in , (where and are formed by taking rows indexed by , and is the Moore-Penrose pseudoinverse of ), and then proceeds with inference assuming that

However, the above assumes that is non-random (or at least independent of ) and that , i.e. all true outliers have been successfully removed. However, in practice the set of declared non-outliers is in fact a function of the data,

, and thus to perform inference would in principle require an understanding of the distribution of the much more complicated random variable

For general outlier removal procedures , such as “make plots and inspect by eye,” the above distribution may be completely unobtainable. However, in this paper we define a class of outlier removal procedures for which the conditional distribution

can be precisely characterized. Access to this conditional distribution will allow us to construct confidence intervals and -values that are valid conditional on the set of outliers selected. For example, we will produce a procedure for forming outlier-removal-aware confidence intervals such that

for all subsets that do not include a true outlier. If one could be certain that (i.e., the procedure is adjusted to be sufficiently conservative and outliers are known to be sufficiently large), then such conditional coverage statements can be translated into a marginal (i.e., traditional) coverage statement:

However, in practice we do not know if all true outliers have been successfully removed. If

, then OLS is no longer guaranteed to produce an unbiased estimate of

. OLS performed on the observations in instead estimates a parameter , which depends on both and on , the mean of the true outliers that were not detected:

(2)

The goal of this paper is not to improve the performance of outlier removal procedures—certainly there is already extensive work in the literature on outlier removal. Rather, our goal is to provide valid inferential statements for someone who has chosen to use a particular outlier removal procedure, . Thus, to stay within the scope of this problem, we will simply acknowledge that if a procedure is prone to failing to identify outliers, then one cannot hope to estimate but must instead focus on estimating and performing inference for , which reflects more accurately than the relationship between and in the data that is provided to us by . For example, we will provide intervals with guaranteed coverage of :

We will likewise provide all the standard confidence intervals and hypothesis tests for regression but focused on in place of .

The rest of the paper is organized as follows: in Section 2 we formulate the problem more precisely and describe the class of outlier detection procedures over which our framework applies; Section 3 describes our methodology for forming confidence intervals and extracting -values that are properly corrected for outlier removal; Section 4 provides empirical comparisons of the naive detect-and-forget strategy and our method, both through comprehensive simulations and a re-analysis of two data sets previously studied in the outlier removal literature; Section 5 gives a discussion and possible next steps.

We conclude this section by introducing some notation that will be used throughout this paper. For , we let . For a matrix , we let be its column space and be its trace. We let be the submatrix formed by rows and columns indexed by and , respectively, and we let be the submatrix formed by rows indexed by . We denote as its Moore-Penrose pseudoinverse. We let be the projection matrix onto and . For a submatrix , we write when there is no ambiguity. We use to denote statistical independence.

2 Problem Formulation

2.1 The General Setup

We elaborate on the framework described in the previous section, introducing some additional notation. We assume , where and consider the mean-shift model,

(3)

where , and

is a non-random matrix of predictors. The set

is the index set of true non-outliers; equivalently, is the index set of true outliers. By definition of , and for . We denote a data-dependent outlier removal procedure, , as a function mapping the data to the index set of detected non-outliers (for notational ease, we suppress the dependence of on since is treated as non-random). We will assume throughout that has linearly independent columns.

For a fixed subset of the observations , the parameter defined in (2) represents the best linear approximation of using the predictors in . In what follows, we will provide hypothesis tests and confidence intervals for conditional on the event .

Combining (2) and (3) with the assumption that has linearly independent columns,

Since , it follows that when . This result makes it clear that if one wishes to make statements about , then one must ensure that the procedure is screening out all outliers.

Our focus will be on performing inference on conditional on the event . Importantly, such inferential procedures in fact provide asymptotically valid inferences for as long as one’s outlier removal procedure asympotically detects all outliers. For example, the next proposition establishes that confidence intervals providing conditional coverage of given do in fact achieve traditional (i.e., unconditional) coverage of asymptotically if one is using an outlier detection procedure that is guaranteed to screen out all outliers as .

Proposition 2.1.

For generated through the mean-shift model (3), consider intervals , satisfying . If the outlier detection procedure satisfies as , then we have .

Proof.

See Appendix D.1 for a detailed proof. ∎

This proposition is based on two simple observations: first, that conditional coverage of implies unconditional coverage of ; second, that implies that .

Such a screening property is reasonable to demand of an outlier detection procedure, and related results exist in the literature (Zhao et al., 2013). For example, consider using Cook’s distance (1) to detect outliers:

(4)

where is a prespecified cutoff. In Appendix A, we provide conditions (based on a result of Zhao et al. 2013) under which for an appropriate choice of . While would trivially satisfy the screening property, we of course need a procedure that leaves sufficient observations for estimation and inference.

2.2 Quadratic Outlier Detection Procedures

In this section we define a general class of outlier detection procedures for which our methodology will apply. We then show that this class includes several of the most famous outlier detection procedures.

Definition 2.2.

We say an outlier detection procedure is quadratic if the event , where

(5)
(6)

for some finite index set and some , and denotes a general set operator that maps a finite family of sets to a single set.

Generally, should be thought of as taking finite unions, intersections, and complements. The above definition is a direct generalization of Definition 1.1 of Loftus & Taylor (2015), in which . We will see that many outlier detection procedures are quadratic in the sense of Definition 2.2. While most of the time the definition in Loftus & Taylor (2015) will apply, there are certain cases that require our generalization (see Appendix C.2 for a specific example). The next proposition shows that outlier detection using Cook’s distance is quadratic in the sense of Definition 2.2.

Proposition 2.3.

Outlier detection using Cook’s distance (4) is quadratic with

(7)
(8)
(9)
Proof.

We may write , where is -th standard basis for and . Plugging this expression to and gives the desired result. ∎

As a second example, we consider the soft-IPOD method (She & Owen, 2011), which identifies outliers using a lasso penalty (Tibshirani, 1996) within the context of the mean-shift model (3):

(10)

The -penalty induces sparsity in , and one takes as the detected non-outliers. In Appendix C, this approach is shown to be a quadratic outlier detection procedure. Interestingly, can be equivalently expressed as the minimizer of the Huber loss function (as observed in She & Owen 2011):

with the elements of corresponding to those residuals in the quadratic (rather than linear) region of Huber’s loss function. Thus, applied in this example, the methodology of this paper provides valid “inference after robust regression.”

One more outlier detection method, called DFFITS (Welsch & Kuh, 1977), is described in Appendix C, where it is shown to be quadratic. The only outlier detection method we know of that is not quadratic is the IPOD procedure of She & Owen (2011) when is not soft-thresholding.

3 Inference Corrected for Outlier Removal

In this section, we describe how the standard inferential tools of OLS can be corrected to account for outlier removal. The only requirement is that the outlier detection procedure be quadratic (as defined in the previous section). The inferential statements are made conditional on the event and are about the parameter . As previously discussed, such statements translate to unconditional statements about when , that is, when all true outliers are removed. Section 3.1 treats the case in which is known. Section 3.2 provides procedures for the case when is unknown.

3.1 Confidence Intervals and Hypothesis Tests When Is Known

In this section, we suppose that is known and provide confidence intervals and hypothesis tests. In the classical setting, inference is based on the normal and distributions and typically involves individual regression coefficients , the regression surface , or groups of regression coefficients . We begin by observing that both and are of the form for some vector that depends on : and . The next theorem gives a unified treatment of these two cases that will allow us to construct confidence intervals and -values that properly account for outlier removal.

Theorem 3.1.

Assume the outlier detection procedure is quadratic as in Definition 2.2. Let be a vector that may depend on . Define

We have

(11)

where denotes a random variable truncated to the set . And we can compute by finding the roots of a finite set of quadratic polynomials:

(12)
(13)

Thus, letting be the CDF of a random variable, we have

(14)
Proof.

See Appendix D.3. ∎

The classical analogue to the above theorem is the (much simpler!) statement that .

This theorem is essentially a generalization of Lee et al. (2016, Theorem 5.2) and a special case of Loftus & Taylor (2015, Theorem 3.1); however, a key difference is that these works are focused on accounting for variable selection rather than outlier removal (which, in essence, is “observation selection”).

3.1.1 Corrected Confidence Intervals

We begin by applying Theorem 3.1 to get confidence intervals corrected for outlier removal.

Corollary 3.2.

Under the conditions and notation of Theorem 3.1, if we find and such that

(15)
(16)

then is a valid selective confidence interval for . That is,

(17)
Proof.

See Appendix D.3. ∎

This result encompasses the two most common types of confidence intervals arising in regression: intervals for the regression coefficients and intervals for the regression surface .

Corollary 3.3.

We write and , where and . Then Theorem 3.1 and Corollary 3.2 apply.

A third type of interval common in regression is the prediction interval, intended to cover , where is a new data point and is independent of . While is a truncated normal random variable, is not, so the strategy adopted in Theorem 3.1 does not directly apply to this case. Instead we employ a simple (but conservative) strategy.

Proposition 3.4.

Let be the noise independent of . For a given significance level , let . Given , let be the selective confidence intervals for as defined in (15) and (16). Then we have

(18)

where

is the CDF of a standard normal distribution.

Proof.

See Appendix D.4 for a detailed proof. ∎

Remark.

In practice, we can optimize over so that the length of the interval is minimized.

3.1.2 Corrected Hypothesis Tests

Theorem 3.1 allows us to form selective hypothesis tests about the parameter where may depend on the selected index set of observations .

Corollary 3.5.

Under the conditions and notation of Theorem 3.1, the quantity

gives a valid selective -value for testing .

Proof.

See Appendix D.3. ∎

The most common application of the above would be for testing whether a specific regression coefficient is zero, conditional on being the selected set of non-outliers: for .

As a generalization, we next focus on testing for . We begin with an alternative characterization of .

Proposition 3.6.

Set , where is the projection matrix onto . Let be the projection matrix onto . Then we have

(19)

Further, define

Then is an orthogonal projection matrix (it is symmetric and idempotent), and we have

(20)
Proof.

See Appendix D.5 for a detailed proof. ∎

Remark.

This proposition characterizes as testing the projection of . In the non-selective case, testing for some projection matrix can be done based on under . We would expect that in the selective case, such tests can be done based on a truncated distribution.

Theorem 3.7.

Assume the outlier detection procedure is quadratic as in Definition 2.2. Define

(21)
(22)
(23)

Under , we have

(24)

where the R.H.S is a central random variable with truncated to the set . And we can compute by finding the roots of a finite set of quadratic polynomials:

(25)
(26)

Further, letting be the CDF of a random variable, we have

(27)

which is a valid selective -value for testing .

Proof.

See Appendix D.6 for a detailed proof. ∎

Remark.

This theorem is adapted from Loftus & Taylor (2015, Theorem 3.1) to the outlier detection context. In the special case where is a single index, direct computation can show that , so that

Then this theorem nearly reduces to Theorem 3.1, except that in this theorem, we need to condition on the sign of .

3.2 Extension to Unknown Case

In this section, we extend results in Section 3.1.2 to the unknown case. In the non-selective case, the hypothesis is equivalent to . Hence under whichever , and will both be centered normal random variables, and the test can be done based on . By analogy, we might expect to be equivalent to , which would suggest that the test should be done based on a truncated distribution; however, we will see in the rest of the section that this is only partially true.

Proposition 3.8.

We have but . Moreover, if , then .

Proof.

See Appendix D.7 for a detailed proof. ∎

In order to form an statistic, we need both the numerator and the denominator to be composed of centered random variables. So it is necessary to assume . Hence this proposition says that testing is the best we can do. Our next result adapts a truncated significance test from Loftus & Taylor (2015) to our purposes.

Theorem 3.9.

Assume the outlier detection procedure is quadratic as in Definition 2.2. Let , where

Define

(28)
(29)
(30)

Under , we have

(31)

where the R.H.S. is a central random variable with truncated to the set . And we can compute by

(32)
(33)

Further, letting be the CDF of a random variable, we have

(34)

which is a valid selective -value for testing .

Proof.

See Appendix D.8 for a detailed proof. ∎

Computing the truncation set in the unknown case is non-trivial since each slice is no longer a quadratic function in . We adopt the strategy suggested by Loftus & Taylor (2015, Section 4.1). For completeness, we provide the details of their strategy (adapted to our notation) in Appendix E.1.

We conclude this section by noting that Theorem 3.9 does not give us a way to construct confidence intervals for . In order to form confidence intervals for , one would need to be able to test for for some non-zero constant . Under this null, does not necessarily reduce to the square of a truncated distribution: First, does not necessarily hold, and as a result, may not even be centered; second, the independence between and may not hold. Hence the construction of confidence intervals does not follow directly from Theorem 3.9 and is left as future work.

4 Empirical Examples

We provide simulations and real data examples in this section. We notice that our method requires evaluation of survival functions (equivalently, the CDFs) of truncated normal, , and distributions. We refer readers to Appendix E.2 for implementation details.

4.1 Simulations

In this section, we focus on the case where the outlier detection is done by Cook’s distance, and we assume is unknown. We refer the readers to the supplementary materials for more detailed and comprehensive simulations. We compare the performance of the following three inferential procedures:

  • detect-and-forget: After outlier detection, refit an OLS regression model using the remaining data and do inference based on the classical (non-selective) theory (we use and distributions since is unknown);

  • corrected-est: Do selective inference as developed in Section 3.1.1 and 3.1.2, with estimated , and the estimation of is done by

    where we fit a lasso regression of

    on to get , and is the support of . Reid et al. (2013) demonstrate that such a strategy gives a reasonably good estimate of in a wide range of situations.

  • corrected-exact: Do selective inference assuming unknown as developed in Section 3.2 (note: this method does not give confidence intervals).

We fix . Our indexing of variables starts from (i.e. corresponds to the intercept). The first column of is set to be and the rest of the columns are generated from i.i.d. and scaled to have norm . We fix .

To examine the coverage of confidence intervals for , we let and . We then fix , and we vary . Outliers are then detected using Cook’s distance with different cutoffs as introduced in Equation (4). For each configuration, we do the following times: we generate the response , where ; we then detect outliers and form confidence intervals. The detect-and-forget confidence intervals are set to be

where (note that is different from and, as noted in Fithian et al. 2014, is generally not considered a good estimate of ). Figure 3 shows the empirical coverage probability for and . As our theories predict, corrected-est intervals give