outliersTests
The R package for identifying outliers in data
view repo
We propose a simple multiple outlier identification method for parametric location-scale and shape-scale models when the number of possible outliers is not specified. The method is based on a result giving asymptotic properties of extreme z-scores. Robust estimators of model parameters are used defining z-scores. An extensive simulation study was done for comparing of the proposed method with existing methods. For the normal family, the method is compared with the well known Davies-Gather, Rosner's, Hawking's and Bolshev's multiple outlier identification methods. The choice of an upper limit for the number of possible outliers in case of Rosner's test application is discussed. For other families, the proposed method is compared with a method generalizing Gather-Davies method. In most situations, the new method has the highest outlier identification power in terms of masking and swamping values. We also created R package outliersTests for proposed test.
READ FULL TEXT VIEW PDF
Outlier detection aims to identify unusual data instances that deviate f...
read it
Background: Reference intervals are essential to interpret diagnostic te...
read it
Outlier detection is a major topic in robust statistics due to the high
...
read it
Robust PCA, the problem of PCA in the presence of outliers has been
exte...
read it
Abdominal multi-organ segmentation of computed tomography (CT) images ha...
read it
The aim this study is discussed on the detection and correction of data
...
read it
In a network meta-analysis, some of the collected studies may deviate
ma...
read it
The R package for identifying outliers in data
The problem of multiple outliers identification received attention of many authors. The majority of outlier identification methods define rules for the rejection of the most extreme observations. The bulk of publications have been concentrated on the normal distribution (see
[bolshev_chauvenets_1975, davies_identification_1993, dixon_analysis_1950, grubbs_sample_1950, rosner_detection_1975, tietjen_grubbs-type_1972], see surveys in [barnett1974outliers, zerbet_statistical_2002]. For non-normal case, the most of the literature pertains to the exponential and gamma distributions, see
[chikkagoudar_distributions_1983, kabe_testing_1970, kimber_testing_1988, lalitha_multiple_2012, lewis_recursive_1979, likes_distribution_1967, lin_exact_2009, lin_tests_2014, zerbet_new_2003].Constructing outlier identification methods, the most of authors suppose that the number of observations suspected to be outliers is specified. These methods have a serious drawback: only two possible conclusions are done: exactly observations are admitted as outliers or it is concluded that outliers are absent. More natural is to consider methods which do not specify the number of suspected observations or at least specify the upper limit for it. Such methods are not very numerous and they concern normal or exponential samples. These are [rosner_detection_1975, bolshev_chauvenets_1975, hawkins_identification_1980] methods for normal samples, [kimber_tests_1982, lin_exact_2009, lin_tests_2014]) methods for exponential samples. The only method which does not specify the upper limit is the [davies_identification_1993] method for normal samples.
We give a competitive and simple method for outlier identification in samples from location-scale and shape-scale families of probability distributions. The upper limit
is not specified, as in the the case of Davies-Gather method. The method is based on a theorem giving asymptotic properties of extreme z-scores. Robust estimators of model parameters are used defining z-scores.In Section 2 we present a short overview of the notion of the outlier region given by [davies_identification_1993]. In Section 3 we give asymptotic properties of extreme z-scores based on equivariant estimators of model parameters, and introduce a new outlier identification method for parametric models based on the asymptotic result and robust estimators. In section 4 we consider rather evident generalizations of Davies-Gather tests for normal data to location-scale families. In Section 5 we give a short overview of known multiple outlier identification methods for normal samples which do not specify an exact number of suspected outliers. In Section 6 we compare performance of the new and existing methods.
Let
be a parametric family of absolutely continuous cumulative distribution functions with continuous unimodal densities
on the support of the c.d.f. .Suppose that if the data are not contaminated with unusual observations, then the following null hypothesis
is true: there exist such that(1) |
There are two different definitions of an outlier. In the first case outlier is an observation which falls into some outlier region . The outlier region is a set such that the probability for at least one observation from a sample to fall into it is small if the hypothesis is true. In such a case the probability that a specified observation falls into is very small. If an observation has distribution different from that under then this probability may be considerably higher.
In the second case the value of is an outlier if the probability distribution of is different from that under , formally . In this case outliers are often called contaminants.
So in the first case exists a very small probability to have an outlier under . In the second case outliers (contaminants) do not exist under and outliers (contaminants) do not necessary fall into the outlier region. Both definitions give approximately the same outliers if the alternative distribution is concentrated in the outlier region. Namely such contaminants can be called outliers in the sense that outliers are anomalous extreme observations. In such a case it is possible to compare outlier and contaminant search methods.
In this paper, we consider location-scale and shape-scale families. Location-scale families have the form with the completely specified baseline c.d.f and p.d.f. . Shape-scale families have the form with completely specified baseline c.d.f and p.d.f. . By logarithmic transformation the shape-scale families are transformed to location-scale family, so we concentrate on location-scale families. Methods for such families are easily modified to methods for shape-scale families.
The right-sided -outlier region for a location-scale family is
and the left-sided -outlier region is
The two-sided -outlier region has the form
(2) |
If is symmetric, then the two-sided outlier region is simpler:
The value of is chosen depending on the size of a sample: . The choice is based on assumption that under for some close to zero
(3) |
The equality (3) means that under the probability that none of falls into - outlier region is . It implies that
(4) |
The sequence decreases from to as goes from to .
The first definition of an outlier is as follows: for a sample size a realization of is called outlier if ; is called right outlier if .
The number of outliers under
has the binomial distribution
and the expected number of outliers in the sample under is Note that as . For example, if , then and for the expected number of outliers is approximately , i.e. it practically does not depend on . So under the expected number of outliers is negligible with respect to the sample size .Suppose that a c.d.f. belongs also to the domain of attraction , (see [de2007extreme]).
If , then there exist normalizing constants and such that Similarly, if , , then ,
One of possible choices of the sequences and is
(5) |
In the particular case of the normal distribution equivalent form can be used. Expressions of and for some most used distributions are given in Table 1.
Distribution | |||
---|---|---|---|
Normal | |||
Type I extreme value | |||
Type II extreme value | |||
Logistic | |||
Laplace | |||
Cauchy |
Condition A.
Consider a model that satisfies the following conditions:
[label=)]
and are consistent estimators of and ;
the limit distribution of ( is non-degenerate;
Condition A 3 is satisfied for many location-scale models including the normal, type I extreme value, type II extreme value, logistic, Laplace (), Cauchy ().
Set , . The random variables are called z-scores. Denote by and the respective order statistics
The following theorem is useful for right outliers detection test construction.
If and Conditions A hold, then for fixed
as , where are i.i.d. standard exponential random variables.
Note that . It implies that if , then for fixed , ,
(6) |
Similarly, if , , then for fixed , ,
(7) |
The following theorem is useful for construction of outlier detection tests in two-sided case when is symmetric. For any sequence denote by the ordered absolute values .
Suppose now that the function is not symmetric. Set . The c.d.f. and p.d.f. of are and , respectively. Set
(10) |
For example, if type I extreme value distribution is considered, then
For the type II extreme value distribution have the same expressions as for the Type I extreme value distribution, respectively.
Similarly as in Theorem 3.1 we have that if is fixed and , then for fixed , ,
(11) |
and if , , then for fixed , ,
(12) |
The choice of the estimators and is important when outlier detection problem is considered. The ML estimators from the complete sample are not stable when outliers exist.
In the case of location-scale families highly efficient robust estimators of the location and scale parameters and are (see [rousseeuw_alternatives_1993])
(13) |
where is the empirical median, are absolute values of the differences and is the th order statistic from .
The constant has the form , where is the inverse of the c.d.f. of , .
Expressions of and values for some well-known location-scale families are given in Table 2.
Distribution | d | |
---|---|---|
Normal | 2.2219 | |
Type I extr.val. | 1.9576 | |
Type II extr.val. | 1.9576 | |
Logistic | 1.3079 | |
Laplace | 1.9306 | |
Cauchy | 1.2071 |
The above considered estimators are equivariant under , i.e. for any , the following equalities hold:
Equivariant estimators have the following property: the distribution of , and is parameter-free.
The distribution of the statistic is parameter-free for any fixed
Denote by the critical value of the statistic . Note that it is exact, not asymptotic critical value: under .
Theorem 3.1 implies that the limit distribution (as ) of the random variable coincides with the distribution of the random variable where , are i.i.d. standard exponential random variables. The random variables are dependent identically distributed and the distribution of each is uniform: .
Denote by the critical values of the random variable . They are easily found by simulation many times generating i.i.d. standard exponential random variables and computing values of the random variables .
Our simulations showed that the below proposed outlier identification methods based on exact and approximate critical values of the statistic give practically the same results, so for samples of size we recommend to approximate the -critical level of the statistic by the critical values which depend only on . We shall see that for the purpose of outlier identification only the critical values are needed. We found that the critical values are: , , .
Our simulations showed that the performances of the below proposed outlier identification method based on exact and approximate critical values of the statistic are similar for samples of size .
We write shortly BP-method for the below considered method. BP method for right outliers. Begin outlier search using observations corresponding to the largest values of . We recommend begin with five largest. So take and compute the values of the statistics
If , then it is concluded that outliers are absent and no further investigation is done. Under the probability of such event is approximately .
If , then it is concluded that outliers exist.
Note that (see the classification scheme below) that if , then minimum one observation is declared as an outlier. So the probability to declare absence of outliers does not depend on the following classification scheme.
If it is concluded that outliers exist then search of outliers is done using the following steps.
Step 1. Set . Note that the maximum exists because .
If , then classification is finished at this step: observations are declared as right outliers because if the value of is declared as an outlier, then it is natural to declare values of as outliers, too.
If , then it is possible that the number of outliers is higher than . Then the observation corresponding to (i.e corresponding to ) is declared as an outlier and we proceed to the step 2.
Step 2. The above written procedure is repeated taking instead of ; here
Set . If , the classification is finished and observations are declared as outliers.
If , then it is possible that the number of outliers is higher than . Then the observation corresponding to the largest is declared as an outlier, in total observations (i.e. the observations corresponding to (i.e corresponding to and ) are declared as outliers and we proceed to the step 3. And so on. Classification finishes at the th step when . So we declare outliers in the previous steps and outliers in the last one. The total number of observations declared as outliers is . These observations are values of .
Let be the normalizing constants defined by (10). If , , then set
If , , then replace by . Denote by the critical value of the statistic .
Theorem 3.1 and Remark 3 imply that the limit distribution (as ) of the random variable coincides with the distribution of the random variable . So the critical values are approximated by the critical values .
The left outliers search method coincides with the right outliers search method replacing to in all formulas.
Let be defined by (5). If , , then set
If , , then replace by . Denote by the critical value of the statistic .
Theorem 3.1 and Remark 2 imply that the limit distribution (as ) of the random variable coincides with the distribution of the random variable . So the critical values are approximated by the critical values .
The outliers search method coincides with the right outliers search method skipping upper index in all formulas.
Suppose now that the function is not symmetric. Let be defined by (10).
Begin outlier search using observations corresponding to the largest and the smallest values of . We recommend begin with five smallest and five largest. So compute the values of the statistics and . If and , then it is concluded that outliers are absent and no further investigation is done.
If or , then it is concluded that outliers exist. If , then left outliers are searched as in Section 3.3. If , then right outliers are searched as in Section 3.2. The only difference is that is replaced by in all formulas.
If shape-scale families of the form with specified are considered then the above given tests for location-scale families could be used because if is a sample from shape scale family then , , is a sample from location-scale family with , , .
To illustrate simplicity of the BP-method, let us consider an illustrative example of its application (sample size , outliers). The sample of size from standard normal distribution was generated. The 1st-3rd and 17th-20th observations were replaced by outliers. The observations , the absolute values of the z-scores , and the ranks of are presented in Table 3.
i | (i) | i | (i) | ||||
---|---|---|---|---|---|---|---|
1 | 6.10 | 3.18 | 16 | 11 | -0.69 | 0.28 | 9 |
2 | 10 | 5.17 | 18 | 12 | -0 | 0.07 | 5 |
3 | 6.20 | 3.23 | 17 | 13 | 0.05 | 0.10 | 6 |
4 | -0.08 | 0.03 | 2 | 14 | -0.20 | 0.03 | 1 |
5 | 0.63 | 0.39 | 11 | 15 | -0.25 | 0.06 | 4 |
6 | -0.54 | 0.21 | 7 | 16 | -0.64 | 0.25 | 8 |
7 | 1.37 | 0.77 | 13 | 17 | -6.30 | 3.14 | 15 |
8 | 0.46 | 0.30 | 10 | 18 | -5.50 | 2.73 | 14 |
9 | -0.22 | 0.04 | 3 | 19 | -12.10 | 6.10 | 19 |
10 | 0.94 | 0.55 | 12 | 20 | -20 | 10.13 | 20 |
1.000000 | 1.000000 | 1.000000 | 0.999998 | 1.000000 | 1.000000 |
0.999685 | 0.999998 | 0.999916 | 0.999998 | 1.000000 | 1.000000 |
0.998046 | 0.996970 | 0.999893 | 0.999997 | 0.999997 | 0.999997 |
0.924219 | 0.996446 | 0.999871 | 0.999940 | 0.084290 | 0.999940 |
In Table 4 we present steps of the classification procedure by the BP method. First, we compute (see line 1 of Table 4) value of the statistic^{1}^{1}1In fact the value of the statistic is 0.999999999. It is rounded to 1. . Since , we reject the null hypothesis, conclude that outliers exist and begin the search of outliers.
Step 1. The inequality (note that corresponds to the fifth largest observation in absolute value) implies that . So it is possible that the number of outliers might be greater than 5. We reject the largest in absolute value 20th observation as an outlier and continue the search of outliers.
Step 2. The inequality (note that corresponds to the fifth largest observation in absolute value from the remaining 19 observations) implies that . So it is possible that the number of outliers might be greater than 6. We declare the second largest in absolute value observation as an outlier. So two observations (19th and 20th) are declared as outliers. We continue the search of outliers.
Step 3. The inequality implies that . We declare the third largest in absolute value observation as an outlier. So three observations (2nd, 19th and 20th) are declared as outliers. We continue the search of outliers.
Step 4. The inequalities and imply that . So four additional observations (the fourth, fifth, sixth and seventh largest in absolute value observations), namely the 3d, 1st, 17th, and 7th are declared as outliers, The outlier search is finished. In all, 7 observations were declared as outliers: 1-3,17-20, as was expected. Note that since the outlier search procedure was done after rejection of the null hypothesis, the significance level did not change.
We created R package outliersTests^{2}^{2}2 The R package outliersTest package can be accessed: https://github.com/linas-p/outliersTests to be able to use the proposed BP test in practice within R package.
Let us consider location-scale families. Following the idea of Davies-Gather [davies_identification_1993] define an empirical analogue of the right outlier region as a random region
(15) |
where is found using the condition
(16) |
and are robust equivariant estimators of the parameters .
Set
The distribution of is parameter-free under .
The equation (16) is equivalent to the equation equation
So is the upper critical value of the random variable . It is easily computed by simulation. Generalized Davies-Gather method for right outliers identification: if , then it is concluded that right outliers are absent. The probability of such event is . If , then it is concluded that right outliers exist. The value of the random variable is admitted as an outlier if , i.e. if . Otherwise it is admitted as a non-outlier.
An empirical analogue of the left outlier region as a random region
(17) |
where is found using the condition
(18) |
Set
The distribution of is parameter-free under .
The equation (18) is equivalent to the equation equation
So is the upper critical value of the random variable . It is easily computed by simulation. Generalized Davies-Gather method for left outliers identification: if , then it is concluded that left outliers are absent. The probability of such event is . If , then it is concluded that left outliers exist. The value of the random variable is admitted as an outlier if , i.e. if . Otherwise it is admitted as a non-outlier.
Let us consider two-sided case.
If the distribution of is symmetric, then the empirical analogue of the outlier region is the random region
(19) |
In this case
Comments
There are no comments yet.