    # Bias Reduced Peaks over Threshold Tail Estimation

In recent years several attempts have been made to extend tail modelling towards the modal part of the data. Frigessi et al. (2002) introduced dynamic mixtures of two components with a weight function π = π(x) smoothly connecting the bulk and the tail of the distribution. Recently, Naveau et al. (2016) reviewed this topic, and, continuing on the work by Papastathopoulos and Tawn (2013), proposed a statistical model which is in compliance with extreme value theory and allows for a smooth transition between the modal and tail part. Incorporating second order rates of convergence for distributions of peaks over thresholds (POT), Beirlant et al. (2002, 2009) constructed models that can be viewed as special cases from both approaches discussed above. When fitting such second order models it turns out that the bias of the resulting extreme value estimators is significantly reduced compared to the classical tail fits using only the first order tail component based on the Pareto or generalized Pareto fits to peaks over threshold distributions. In this paper we provide novel bias reduced tail fitting techniques, improving upon the classical generalized Pareto (GP) approximation for POTs using the flexible semiparametric GP modelling introduced in Tencaliec et al. (2018). We also revisit and extend the secondorder refined POT approach started in Beirlant et al. (2009) to all max-domains of attraction using flexible semiparametric modelling of the second order component. In this way we relax the classical second order regular variation assumptions.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Extreme value (EV) methodology starts from the assumption that the distribution of the available sample belongs to the domain of attraction of a generalized extreme value distribution, i.e. there exists sequences and such that as

 max(X1,X2,…,Xn)−bnan→dYξ, (1)

where , for some with . The parameter is termed the extreme value index (EVI). It is well-known (see e.g. Beirlant et al., 2004, and de Haan and Ferreira, 2006)) that (1) is equivalent to the existence of a positive function , such that

 P(X−tσt>y|X>t)=¯F(t+yσt)¯F(t)→t→x+¯HGPξ(y)=(1+ξy)−1/ξ, (2)

where and denotes the endpoint of the distribution of . The conditional distribution of given is called the peaks over threshold (POT) distribution, while is the survival function of the generalized Pareto distribution (GPD).
In case , the limit in (1) holds if and only if is of Pareto-type, i.e.

 ¯F(x)=x−1/ξℓ(x), (3)

for some slowly varying function , i.e. satisfying as , for every . Pareto-type distributions satisfy a simpler POT limit result: as

 P(Xt>y|X>t)→¯HPξ(y):=y−1/ξ,y>1. (4)

Estimation of and tail quantities such as return periods is then based on fitting a GPD to the observed excesses given , respectively a simple Pareto distribution with survival function to given in case . The main difficulty in such an EV application is the choice of the threshold . Most often, the threshold is chosen as one of the top data points for some where denotes the ordered sample. The limit results in (2) and (4) require to be chosen as large as possible (or, equivalently, as small as possible) for the bias in the estimation of

and other tail parameters to be limited. However, in order to limit the estimation variance,

should be as small as possible, i.e. the number of data points used in the estimation should be as large as possible. Several adaptive procedures for choosing or have been proposed, but mainly in the Pareto-type case with under further second-order specifications of (3) or (4), see for instance Chapter 3 in Beirlant et al. (2004), or Matthys and Beirlant (2000).
In case of a real-valued EVI, the selection of an appropriate threshold is even more difficult and only a few methods are available. Dupuis (1999) suggested a robust model validation mechanism to guide the threshold selection, assigning weights between 0 and 1 to each data point where a high weight means that the point should be retained since a GPD model is fitting it well. However, thresholding is required at the level of the weights and hence the method cannot be used in an unsupervised manner.

Another approach consists of proposing penultimate limit distributions in (2) and (4). In case , under the mathematical theory of second-order slow variation, i.e. assuming that

 ℓ(yt)ℓ(t)−1=δt(y−β−1), (5)

where , with and slowly varying at infinity (see section 2.3 in de Haan and Ferreira, 2006), the left hand side of (4) equals

 ¯F(yt)¯F(t)=y−1/ξℓ(yt)ℓ(t)=y−1/ξ(1+δt(y−β−1)),y>1.

This then leads to the extension of the Pareto distribution (EPD) to approximate the distribution of given as :

 ¯HEPξ,δ(y):=y−1/ξ(1+δt((y−1/ξ)βξ−1)),y>1, (6)

with satisfying as . In cases where the second order model (5) holds, such a mixture model will improve the approximation of for values of which are smaller than the appropriate -values when modelling the POTs using . So the extension can work when modelling large and moderate extremes. As a byproduct however, at instances, it may even work for the full sample.
In Beirlant et al. (2009), using an external estimator of , the parameters are estimated fitting the EPD (slightly adapted, with survival function and ) by maximum likelihood on excesses over a random threshold , . The result of this procedure is two-fold:

• First, the estimates of are more stable as a function of compared to the original ML estimator derived by Hill (1975)

 Hk,n=1kk∑j=1logXn−j+1,nXn−k,n

which is obtained by fitting the Pareto distribution to the excesses following (4). Indeed, the bias in the simple POT model (4) is estimated when fitting and it is shown that, under the assumption that the EP model for the excesses is correct and that is estimated consistently, the asymptotic bias of is 0 as long as as , while the asymptotic bias of is only 0 when .

• On the other hand, the asymptotic variance of equals , where is the asymptotic variance of .

As an example Figure 1 shows both the Hill estimates and the bias reduced estimates , obtained from maximum likelihood fitting of (6) using and , as a function of for a dataset of Belgian ultimate car insurance claims from 1995 and 2010 discussed in more detail in Albrecher et al. (2017). Note that the bias reduced estimates helps to interpret the original Hill ”horror” plot. Here from the bias reduced estimator a level around 0.5 becomes apparent for and a lower value between 0.3 and 0.4 for smaller values of . In fact in insurance claim data mixtures in the ultimate tail do appear quite often. Moreover the EPD fit appears to extend quite well down to the lower threshold value, i.e. with up to 600 (but not when using almost all data, ). In this sense, classical first order extreme value modelling can in some cases be extended using mixture modelling in order to capture the characteristics of the bulk of the data. Figure 1: Ultimates of Belgian car insurance claims: bias reduction of Hill estimator (full line) using ¯HEPξ,δ with ρ=−0.25 (dashed line), ρ=−0.5 (dotted line) and ρ=−1 (dash-dotted line).

Other bias reduction techniques in the Pareto-type case have been proposed among others in Feuerverger and Hall (1999), Gomes et al. (2000), Beirlant et al. (1999, 2002) and Gomes and Martins (2002). In Caeiro and Gomes (2011) methods are proposed to limit the variance of bias-reduced estimators to the level of the variance of the Hill estimator . The price to pay is then to assume a third-order slow variation model specifying (5) even further. These methods focus on the distribution of the

-spacings of high order statistics. Other construction methods for asymptotically unbiased estimators of

were introduced in Peng (1998) and Drees (1996).

In this paper we concentrate on bias reduction when using the GPD approximation to the distribution of POTs , on which the literature is quite limited. This allows to extend bias reduction to the general case . We apply the flexible semiparametric GP modelling introduced in Tencaliec et al. (2018) to the POT distributions. We also extend the second-order refined POT approach using from (6) to all max-domains of attraction. Here the corresponding basic second order regular variation theory can be found in Theorem 2.3.8 in de Haan and Ferreira (2006) stating that

 limt→x+P(X−t>yσt|X>t)−(1+ξy)−1/ξδ(t)=(1+ξy)−1−1/ξΨξ,~ρ((1+ξy)1/ξ), (7)

with as and which for the cases and is understood to be equal to the limit as and . We further allow more flexible second-order models than the ones arising from second-order regular variation theory such as in (7

) using non-parametric modelling of the second-order component. These new methods are also applied to the specific case of Pareto-type distributions.

In the next section we propose our transformed and extended GPD models, and detail the estimation methods. Some basic asymptotic results are provided in section 3. In the final section we discuss simulation results of the proposed methods and some practical case studies. We then also discuss the evaluation of the overall goodness-of-fit behaviour of the fitted models.

## 2 Transformed and extended GPD models

Recently, Naveau et al. (2016), generalizing Papastathopoulos and Tawn (2013), proposed to use full models for rainfall intensity data that are able to capture low, moderate and heavy rainfall intensities without a threshold selection procedure. These authors, considering only applications with a positive EVI however, propose to model all data jointly using transformation models with survival function

 ¯F(x)=1−¯G0(HGPξ(xσ))=:G0(¯HGPξ(xσ)), (8)

with and distribution functions on linked by (), and satisfying constraints to preserve the classical tail GPD fit and a power behaviour for small rainfall intensities:

• , for some ,

• , for some and .

In Naveau et al. (2016) the authors propose parametric examples for , such as , with . In Tencaliec et al. (2018) a non-parametric approach is taken using Bernstein polynomials of degree to approximate , i.e. using with beta densities

 bj,m(u)=(mj)uj(1−u)m−j,u∈(0,1).

In Naveau et al. (2016) and Tencaliec et al. (2018) the primary goal is the search for a model fitting the whole outcome set, while the fit of the proposed model to POT values

for extrapolation purposes in order to estimate extreme quantiles and tail probabilities is imposed using the condition

. However the bias and MSE properties of the estimators of and are still to be analyzed.

To encompass the above mentioned methods from Beirlant et al. (2009), Naveau et al. (2016) and Tencaliec et al. (2018) we propose to approximate with a transformation model with right tail function of the type

 (T):¯FTt(y)=Gt(¯HGPξ(yσ))

where for all as . Note here that for and
,

 Gt(u)=P(¯HGPξ(Yσ)≤u). (9)

We also consider a submodel of , approximating the POT distribution with an extended GPD model

 (E):¯FEt(y)=¯HGPξ(yσ){1+δtBη(¯HGPξ(yσ))},

where

• as ,

• and for every ,

• is twice continously differentiable.

Here the parameter represents a second order (nuisance) parameter. For negative -values one needs to obtain a valid distribution. At the function then corresponds to in (8), while as leads to the GPD survival function at large thresholds.
Note that model () is a direct generalization of the EPD model (6) replacing the Pareto distribution by the GPD and considering a general function rather than .

Now several possibilities for bias reduction appear:

1. Estimation under the transformed model (). Modelling the distribution of with model () and estimating and for every , we propose to use the algorithm from Tencaliec et al. (2018) for every or . This approach is further denoted with ().
Here we apply Bernstein approximation and estimation of which is the distribution function of . The Bernstein approximation of order of a continuous distribution function on is given by

 G(m)(u)=m∑j=0G(jm)(mj)uj(1−u)m−j,u∈[0,1].

As in Babu et al. (2002) one then replaces the unknown distribution function itself with the empirical distribution function of the available data in order to obtain a smooth estimator of :

 ^G(m)n(u)=m∑j=0^Gn(jm)(mj)uj(1−u)m−j.

In the present application, data from

are only available after imputing a value for

. This then leads to the iterative algorithm from Tencaliec et al. (2018), which is applied to every threshold , or every number of top data. We here detail the algorithm for excesses , using the reparametrization with :

Algorithm ()

1. Set starting values (). Here one can use () from using .

2. Iterate for until the difference in loglikelihood taken in () and () is smaller than a prescribed value

1. Given () construct rv’s

2. Construct Bernstein approximation based on ()

 ^G(m)k(u)=m∑j=0^Gk(jm)(mj)uj(1−u)m−j

with the empirical distribution function of

3. Obtain new estimates () with ML:

 (^ξ(r+1)k,^τ(r+1)k) = argmax{k∑j=1log{^g(m)k((1+τ^Zj,k)−1/ξ)} +k∑j=1log{τξ(1+τ^Zj,k)−1−1/ξ}}

with denoting the derivative of .

The final estimates of and are denoted here by and . As noted in Tencaliec et al. (2018) the theoretical study of these estimates is difficult. In the simulation study the finite sample characteristics of these estimators are given using with a fixed value of using in order to stabilize the plots of the estimates of as much as possible. Note that this estimation procedure is computationally demanding.

Estimates of small tail probabilities can be obtained through

 ^PTk(X>c)=kn^GTk⎛⎝¯H^ξTk(^τTk^ξTk(c−Xn−k,n))⎞⎠.

Finally, bias reduced estimators of extreme quantiles for small are obtained by setting the above formulas equal to and solving for .

2. Estimation under the extended model (). Modelling the distribution of the exceedances with model () leads to maximum likelihood estimators based on the excesses :

 (^ξEk,^τEk,^δk) = argmax{k∑j=1log(1+δkbη((1+τYj,k)−1/ξ)) (10) +k∑j=1log{τξ(1+τYj,k)−1−1/ξ}}

with for a given choice of .
Estimates of small tail probabilities are then obtained through

 ^PEk(X>c)=kn¯HGP^ξEk⎛⎝^τEk^ξEk(c−Xn−k,n)⎞⎠⎛⎝1+^δEk^Bη⎛⎝¯HGP^ξEk(^τEk^ξEk(c−Xn−k,n)⎞⎠⎞⎠.

As in Naveau et al. (2016), respectively Tencaliec et al. (2018), two approaches can be taken towards the bias function : a parametric approach, respectively a non-parametric approach.

• In the parametric approach, denoted (), the second-order result (7) leads to the parametric choice in case and .
Model () allows for bias reduced estimation of under the assumption that the corresponding second-order model (7) is correct for the POTs . Note that () generalizes the approach taken in Beirlant et al. (2009) to all max-domains of attraction. When model () is used as a model for all observations, i.e. taking , this model directly encompasses the models from Frigessi et al. (2002) and Naveau et al. (2016).
Here

 bη(u)=u−~ρ(1−~ρ~ρ(ξ+~ρ))+uξ(1−ξξ(ξ+~ρ))−1ξ~ρ,

in which the classical estimator of (with ), or an appropriate value , is used to substitute , next to an appropriate value of . One can also choose a value of () minimizing the variance in the plot of the resulting estimates of as a function of .

• Alternatively, a non-parametric approach (denoted ) can be performed using the Bernstein polynomial algorithm from Tencaliec et al. (2018). In fact in practice a particular distribution probably follows laws of nature, environment or business and does not have to follow the second-order regular variation assumptions as in (7). Moreover in the case of a real-valued EVI, the function can take different mathematical forms depending on () and being close to 0 or not.
Here a Bernstein type approximation is obtained for from obtained through algorithm (), and reparametrizing by with an appropriate value of the number of top data used. The function is then substituted by .

The methods described above of course can be rewritten for the specific case of Pareto-type distributions where the distribution of POTs are approximated by transformed Pareto distributions. The models are then rephrased as

 (T+):¯FEt(y)=Gt(¯HPξ(y)),

where for

 Gt(u)=P(¯HPξ(Y)≤u),

and

The above algorithms, now based on the exceedances (), are then adapted as follows:
In algorithm () step (ii.c) is replaced by

 ^ξ(r+1)k=argmax{k∑j=1log{^g(m)k((^Zj,k)−1/ξ)}+k∑j=1log{1ξ(^Zj,k)−1−1/ξ}},

with . The resulting estimates are denoted with and .
In approach () the likelihood solutions are given by

 (^ξE+k,^δE+k)=argmax{k∑j=1log(1+δkbη(Y−1/ξj,k))+k∑j=1log{1ξ(Yj,k)−1−1/ξ}}. (11)

Note that the () approach using the parametric version for a particular fixed equals the EPD method from Beirlant et al. (2009), while () is new.

Estimators of tail probabilities are then given by

 ^PT+k(X>c)=kn^GT+k(¯H^θT+k(c/Xn−k,n)),

respectively

## 3 Basic asymptotics under model (E)

We discuss here in detail the asymptotic properties of the maximum likelihood estimators solving (10) and (11). To this end, as in Beirlant et al. (2009) we develop the likelihood equations up to linear terms in since with decreasing value of . Below we set when using extended GPD modelling, while when using extended Pareto modelling under .

Extended Pareto POT modelling. The likelihood problem (11) was already considered in Beirlant et al. (2009) in case of parametric modelling for . We here propose a more general treatment. The limit statements in the derivation can be obtained using the methods from Beirlant et al. (2009). The likelihood equations following from (11) are given by

 ⎧⎪ ⎪⎨⎪ ⎪⎩∂∂ξℓ=−kξ+1ξ2∑kj=1logYj,k+δkξ2∑kj=1b′η(¯Hθ(Yj,k))¯Hθ(Yj,k)logYj,k1+δkbη(¯Hθ(Yj,k))∂∂δkℓ=∑kj=1bη(¯Hθ(Yj,k))−δk∑kj=1b2η(¯Hθ(Yj,k)). (12)

Extended Generalized Pareto POT modelling. The likelihood equations following from (10) up to linear terms in are now given by

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩∂∂ξℓ=−kξ+1ξ2∑kj=1log(1+τYj,k)+δkξ2∑kj=1b′η(¯Hθ(Yj,k))¯Hθ(Yj,k)log(1+τYj,k)∂∂τℓ=kξτ{−1+(1+ξ)1k∑kj=111+τYj,k−δkk∑kj=1b′η(¯Hθ(Yj,k))(τYj,k)(1+τYj,k)−1−1/ξ}∂∂δkℓ=∑kj=1bη(¯Hθ(Yj,k))−δk∑kj=1b2η(¯Hθ(Yj,k)),

from which

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩^δk=∑kj=1bη(¯H^θk(Yj,k))∑kj=1b2η(¯H^θk(Yj,k)),1k∑kj=1log(1+^τkYj,k)=^ξk−^δkk∑kj=1b′η(¯H^θk(Yj,k))¯H^θk(Yj,k)log(1+^τkYj,k),1k∑kj=111+^τkYj,k=11+^ξk+^δk1+^ξk{1k∑kj=1b′η(¯H^θk(Yj,k))¯H^θk(Yj,k)−1k∑kj=1b′η(¯H^θk(Yj,k))¯H^θk(Yj,k)11+^τkYj,k}. (13)

Under the extended model we now state the asymptotic distribution of the estimators and . To this end let denote the quantile function of , and let denote the corresponding tail quantile function. Model assumption can be rephrased in terms of :

 (~E):U(vx)−U(v)σU(v)−hξ(x)δ(U(v))→v→∞xξBη(1/x),

where and regularly varying with index . Moreover in the mathematical derivations one needs the extra condition that for every , and sufficiently large

 (~E2):∣∣ ∣ ∣∣U(vx)−U(v)σU(v)−hξ(x)δ(U(v))−xξBη(1/x)∣∣ ∣ ∣∣≤ϵxξ|Bη(1/x)|max{xν,x−ν}.

Similarly, is rewritten as

 (~E+):U(vx)U(v)−xξξδ(U(v)))→v→∞xξBη(1/x).

The analogue of in this specific case is given by

 (~E+2):∣∣ ∣ ∣∣U(vx)U(v)−xξξδ(U(v))−xξBη(1/x)∣∣ ∣ ∣∣≤ϵxξ|Bη(1/x)|max{xν,x−ν},

with regularly varying with index .
Finally, in the expression of the asymptotic variances we use

 Eb2η=∫10b2η(u)du,EBη=∫10Bη(u)du,ECη=∫10uξBη(u)du.

The proof of the next theorem is outlined in the Appendix. It allows to construct confidence intervals for the estimators of

obtained under the extended models.
Theorem 1 Let be a sequence such that and such that . Moreover assume that in (10) and (11), is substituted by a consistent estimator as . Then

1. when with

 √k(^ξE+k−ξ)→dN(0,ξ2Eb2ηEb2η−(EBη)2),
2. when with

 (√k(^ξEk−ξ),√k(^τEkτ−1))→dN2(0,Σ)
 Σ=ξ2D⎛⎜ ⎜ ⎜⎝1(1+ξ)2(1+2ξ)−(ECη)2Eb2η1ξ(1+ξ)3−EBηECηξ(1+ξ)Eb2η1ξ(1+ξ)3−EBηECηξ(1+ξ)Eb2η1ξ2(1+ξ)2(1−(EBη)2Eb2η)⎞⎟ ⎟ ⎟⎠

where

 D=(1(1+ξ)2(1+2ξ)−(ECη)2Eb2η)(1−(EBη)2Eb2η