Tilting maximum Lq-Likelihood estimation for extreme values drawing on block maxima

One of the most common anticipated difficulties in applying mainstream maximum likelihood inference upon extreme values is articulated on the scarcity of extreme observations for bringing the extreme value theorem to hold across a series of maxima. This paper introduces a new variant of the Lq-likelihood method through its linkage with a particular deformed logarithm which preserves the self-dual property of the standard logarithm. Since the focus is on relatively small samples consisting of those maximum values within each sub-sampled block (by splitting the sample into blocks of equal length), the maximum Lq estimation will favour reducing uncertainty associated with the variance leaving the bias unchallenged. A comprehensive simulation study demonstrates that the introduction of a more sophisticated treatment of maximum likelihood improves the estimation of extreme characteristics, with significant implications for return-level estimation which is a crucial component in risk assessment for many operational settings prone to extreme hazards, such as earthquakes, floods or epidemics. We provide an illustrative example of how the proposed tilting of Lq-likelihood can improve inference on extreme events by drawing on public health data.

Authors

• 1 publication
• 4 publications
08/14/2020

Uniqueness and global optimality of the maximum likelihood estimator for the generalized extreme value distribution

The three-parameter generalized extreme value distribution arises from c...
03/03/2021

Estimation of Dirichlet distribution parameters with bias-reducing adjusted score functions

The Dirichlet distribution, also known as multivariate beta, is the most...
12/29/2017

Finite-sample risk bounds for maximum likelihood estimation with arbitrary penalties

The MDL two-part coding index of resolvability provides a finite-sampl...
07/04/2019

Bayesian analysis of extreme values in economic indexes and climate data: Simulation and application

Mixed modeling of extreme values and random effects is relatively unexpl...
01/10/2019

Inference for extreme values under threshold-based stopping rules

There is a propensity for an extreme value analyses to be conducted as a...
04/03/2020

A sequential design for extreme quantiles estimation under binary sampling

We propose a sequential design method aiming at the estimation of an ext...
09/23/2021

Semiparametric bivariate extreme-value copulas

Extreme-value copulas arise as the limiting dependence structure of comp...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Maximum likelihood methods constitute the usual approach within parametric inference and still play a leading role in the development of new estimators. Their main advantage stems from their desirable properties of consistency and asymptotic normality, which enables the exploitation of their asymptotic efficiency thoroughly, as long as the natural regularity conditions for maximum likelihood are met. Drawing on these asymptotic properties, confidence intervals and hypothesis testing procedures can be devised in a straightforward manner. The Maximum Lq-likelihood (MLq) method was introduced by

Ferrari and Yang (2010) and independently by Hasegawa and Arita (2009), thus enriching the class of maximum likelihood estimators. MLq methodology has been claimed a good alternative to the standard maximum likelihood estimation in connection with small up to moderate sample sizes. The general asymptotic theory pertaining to MLq is well established in Ferrari and Yang (2010). Applications of this theory to deal with extreme value distributions can be found in Ferrari and Paterlini (2009); Huang et al. (2013). A recent uptake of the MLq in other applied areas crops up, for example, in the works by Qin and Priebe (2013) and Wu et al. (2017).

The main purpose of this paper is to discuss the maximum Lq-likelihood method in the context of extreme value analysis of a series of annual maxima and alike. The study of extreme and rare events, that can dominate the risk because their impact is so high, is essential in many applied contexts. For instance, the year 2018 is set to be among the hottest in the UK since records began. Residents blasted their air conditioners so much, they caused power shortages. Other examples in which extreme events may have dire consequences are the concentration of pollutants, sea-level rise, very heavy rainfall, lifespan, and closing values of stock indices. Extreme value theory provides a rigorous and prolific framework for studying rare events with severe impact. The basic assumption is that the observations are independent and identically distributed. Over the last decade there has been an astounding growth in the statistical models and techniques to analyse extreme values. The usual setting regards the Generalised Extreme Value (GEV) distribution as the appropriate building block for parametric inference. The dual problem postulates the Generalised Pareto distribution (GPD) as the probabilistic instrument fitting those observations above a given high threshold.

The present paper addresses MLq estimators for the parameters and return levels ascribed to a limiting GEV distribution within the extreme value domains of attraction framework. It outlines a semi-parametric approach to this inference problem based on the block maxima (BM) method, meaning that we will take observations in blocks of equal size (supposedly large ) and apply the MLq estimation procedure under the assumption that the maximum in each block (e.g. within a year of i.i.d. observations) follows approximately a GEV distribution, with distribution function (d.f.) depending on the shape parameter :

 Gγ(x)=exp{−(1+γx)−1/γ},1+γx>0. (1.1)

Since the seminal works by Gumbel (Gumbel, 1958), aiming at hydrological applications, the BM method has been one of the most enduring truisms in statistics for extreme values. But the BM has gained new impetus recently, particularly in connection with time series analysis (see e.g. Jaruškova and Hanek (2006); Caires (2009); Ferreira and de Haan (2015); Bücher and Segers (2017, 2018)). The paper by Ferreira and de Haan (2015)

lists a number of relative merits of the BM method and sets up the appropriate theoretical framework for semi-parametric inference drawing on block maxima. To the best of our knowledge, there is yet no actual combined use of the MLq methodology and the BM in a semi-parametric framework whilst working on max-domains of attraction. This paper aims to bridge this gap and exploit the MLq estimation procedure given the interest is in the block maxima, an approach widely employed even in the situation where the original sample is wholly available and not only the annual largest values. We will tackle the estimation of the extreme value index and how the real constants intervening in the linear normalisation of the partial maxima of i.i.d. random variables

, are absorbed into location and scale parameters as we move towards the limiting GEV distribution with tending to infinity. We will analyse the extent to which the required linear normalisation affects the estimation of return levels. This work significantly amplifies the developments in Ferrari and Paterlini (2009), in the sense that we will be tackling the case of a general extreme value distribution rather than restricting ourselves to the heavy-tailed case of . Moreover, the work by Ferrari and Paterlini (2009) focuses solely on the limiting GEV (or the dual GP) parametric distribution itself with unknown shape, location and scale parameters. Our aim is to delve into domains of attraction of the GEV distribution with enough detail to form the basis to a more efficient method for estimating critical extreme value indices (such as return levels), particularly when facing small samples. In this respect, we will formulate the extreme value condition for block maxima and its second order refinement in such a way as to enable a better understanding to be gained from the emergent connections with the well-known peaks over threshold (POT) method. Alongside the latter, our simulation study will complement the findings in Ferreira and de Haan (2015) as we aim to strike the proper balance between the number of block maxima and the distortion parameter

being used to deform the logarithm. If the blocks are lengthy then estimation upon a few block maxima leads to a low precision as the estimators tend to have large variance in this case. In contrast, small block sizes render a large number of block maxima, instilling bias and leading to low accuracy in the estimation. This is an equivalent problem to the usual bias/variance trade-off in statistics: with small block sizes, the fit to the limiting GEV is likely to perform poorly, and the bias tends to override the extrapolation to more extreme measurements, whereas larger blocks yield fewer maxima and spark the variance onto increased levels of uncertainty. By quantifying the amount of uncertainty intrinsic to the GEV fit, the Lq-likelihood down-weighs the least informative and hence most uncertain observed maxima as these deviate from the GEV limiting model. The evaluation of the relative performance of the MLq estimator and its tilted version introduced in this paper will be based on a comprehensive finite-sample simulation study, conducted for a wide array of different distributions belonging to different domains of attraction. Then, semi-parametric estimators for return levels attached to a minute probability, including right endpoint estimators, will also be developed and exploited.

The remainder of the paper is organised as follows. In Section 2, we introduce some notation and preliminaries on the MLq estimation method. Section 3 expounds the max-domains of attraction framework, where we provide a BM characterisation which aligns with the POT domains characterisation via the underpinning theory of regular variation (see Appendix B of de Haan and Ferreira (2006) for regular variation). In Section 4, we introduce the two MLq estimators devised to meet the above described aims. Section 5 contains a simulation study for performance evaluation of the novel MLq variant proposed in this paper. Section 6 illustrates the connection between the MLq estimation culture and the mainstream ML estimation for extremes by drawing on BM collected from historical data records of Influenza epidemics. Finally, in Section 7, we articulate the meaningful advances of MLq estimation, in the way of a take home message for practitioners with interest in extreme value statistics, which we advocate as the most appropriate branch of statistical methodology to be implemented for risk assessment and forecasting. The theoretical result for aligning BM and POT domains is deferred to the Appendix.

2 Maximum Lq-likelihood: notation and preliminaries

Let be a distribution function (d.f.) underlying a population with a random feature quantified in the random variable . Assume is a sequence of i.i.d. copies of with the same d.f. . Here and throughout this paper, we assume continuous with either finite or infinite right endpoint . Let denote the family of distributions given by

with probability density function (p.d.f.)

. We are going to consider estimators that maximises a certain function of the parameters of interest, say , where x is a

-vector of sample realisations and

is a -vector of parameters. The function is the so-called criterion function. Note that this setting includes but is not limited to maximum-likelihood-like estimators. Now, let

denote the true value of the parameter being estimated. The Kullback-Leibler (K-L) divergence is a measure of closeness of a probability distribution in

to the target distribution , that is

 K(θ)=K(fθ,fθ0):=−∫Rfθ0(x)logfθ(x)fθ0(x)dx=Eθ0[logfθ0(X)fθ(X)].

In parametric statistical inference, the maximum likelihood estimator of can be viewed as the minimiser of the empirical counterpart of the K-L divergence between the true and the model-fit distributions (see e.g. Ekström (2008)). Under usual regularity conditions in the maximum likelihood sense, the MLE estimator of is expected to converge to a point of maximum of . The maximum Lq-likelihood method, introduced by Ferrari and Yang (2010) and independently by Hasegawa and Arita (2009), embeds a deformed version of the classical logarithm – the well-known Tsallis logarithm – whereby the maximum Lq-likelihood estimator of (again, real or vector-valued parameter) maximises the Lq-likelihood

 Lq(θ;x)=n∑i=1ℓq(fθ(xi)) (2.1)

with , for . Here, is interpreted in the continuity sense as the mainstream log-likelihood criterion . The criterion function is akin to the Box-Cox transformation in statistics, where the parameter gauges the degree of distortion in the underlying density. If , then the Lq-likelihood assigns more weight to data points with a high likelihood and puts less weight on those with a low likelihood; if , then the standard MLE is recovered. Maximising the L-likelihood mirrors the K-L setting in the way that it arises from the empirical minimisation of the Tsallis divergence between and (as pointed out in Hasegawa and Arita (2009)). However, the main impact of modifying the criterion function is that it precludes the holistic view of likelihood in the way of a function that is proportional to the density, i.e. as some function that can be factored into the product , with not depending on the unknown parameter . Therefore, when working out the ordinary maximum likelihood estimator we can sift out the kernel component of the density, and proceed with focus on it whilst discarding all multiplicative factors that do not contain the parameter

being estimated. This remarkable feature of mainstream maximum likelihood alleviates the computational effort involved in the estimation of parameters for many group families of distributions. Not for nothing, the log-transform group does benefit significantly from this appealing aspect in standard likelihood maximisation. The log-transform is widely used to unify distributions with varying degrees of tail heaviness. We see this feature in the Pareto stemming from the exponential distribution, and more predominantly in applications relating the lognormal distribution with the normal distribution

(see e.g. Lehmann and Romano, 1998, page 486). The log-transform group families of distributions can have their parameters estimated explicitly at a stroke just by taking the corresponding log-transformed observations on to the mainstream ML procedure. Such an advantage does not take place with deformed versions of the logarithm embedded in maximum -likelihood methodology. An example follows in this respect. Suppose the simple case of the Exponential distribution given by the c.d.f. , , , and their relationship with the Pareto distribution with c.d.f. given by These two cases of simple inference lend themselves well to the purpose of illustrating the role of the distortion parameter in MLq concomitant with the view of the L-likelihood as a weighted likelihood.

Example 2.1

Suppose a random sample from the Exponential distribution with p.d.f. If then the standard MLE is given by However, for , the -likelihood score function becomes

 n∑i=1∂∂θℓq(f(yi;θ))=n∑i=1(1−θyi)exp{−(1−q)θyi},

meaning that there is no closed-form expression for the MLq estimator, but after some rearrangement, we can recognise the MLq estimator for the mean value in the form of the weighted average , with the weights , for all , .

Furthermore, suppose a random sample from the Pareto distribution with p.d.f given by . Note that . Whence, prompts the Hill estimator (Hill, 1975) for the Pareto’s shape parameter , notably However, if , with positive, then the -likelihood score function is

 n∑i=1∂∂θℓq(f(xi;θ))=n∑i=1[(1−q)θ−qx−(1−q)(θ+1)i−θ1−qx−(1−q)(θ+1)ilogxi],

for all , , with corresponding the -likelihood equation given by With , the MLq estimator rephrases as the weighted average

 ~γ=1~θ=n∑i=1ωilogXin∑i=1ωi,

but the weights are not directly transferable from the exponential case, i.e., . Again, there is no explicit expression for this estimator.

Whilst mainstream ML estimators can accommodate log-transformations in the data, thus preserving their functional form, it is evident from Example 2.1 that the analogue property does not hold in the broader sense for MLq methods. Even within the exponential family of distributions, MLq estimators offer some resistance to the log-transformation in the data. Clearly, maximising with respect to involves weighing the tail with and extra factor , which will smooth out for near 1, i.e. as the MLq approaches the mainstream ML procedure. This implies that, in MLq estimation, we are not allowed to toggle back and forth between the estimation of scale and shape parameters directly but rather we need to adjust the weights accordingly if using this device. Naturally, these difficulties tend to disappear as tends to 1. Furthermore, likelihood maximisation does not lead to closed-form estimators but rather to estimates over-reliant on numerical optimisation techniques, with potential to lead to convergence issues more often than desirable. This can pose a substantial difficulty within the BM framework. For example, the prototypical extreme value distribution for maxima of heavy-tailed distributions is the Fréchet distribution with shape parameter which then reduces to the light-tailed Gumbel with scale if a log-transform is applied. But the mainstream ML method does not provide an explicit estimator for the scale in the Gumbel distribution to begin with. Hence, the general MLq is not disadvantaged since numerical maximisation techniques are needed for all , including . The key insight to the MLq estimation is that the -likelihood does not behave as the standard -likelihood, in that the sequence of solutions do not zero in the mean value of the score function under , i.e. the equation

 Eθ0[∂∂θℓq(fθ(X))]=Eθ0[∂∂θℓ1(fθ(X))(fθ(X))1−q]=0,

holds true if . We note that stands for the vector of partial derivatives with respect to of a scalar function taking on a vector variable. In Ferrari and Yang (2010), the designated target has been coined “surrogate” parameter of . Therefore, MLq estimators are inherently biased estimators, but this difficulty tends to disappear when tends to 1. On the face of the discussion above, why should one favour any MLq estimator over the standard and fully-fledged unbiased and asymptotically normal ML estimators for BM? We defer the reader to the works by Dombry (2015); Dombry and Ferreira (2017); Bücher and Segers (2017) for a comprehensive account of ML estimation in the BM setting. Here, we emphasise that a compelling argument for adopting the more sophisticated MLq estimation lies in its verified efficiency when drawing inference on small samples. MLq estimators successfully aim to trade an increase in bias for a decrease in variance resulting in an overall decrease in mean squared error. Although MLq estimators tend to lack a closed-form expression, their asymptotic variance can quite often be presented explicitly (cf. Wu et al., 2017

, in connection with gamma distributions)

. In this paper, we tilt the MLq method at the edge of the sample (i.e. for extreme values) by shifting the attention to a new type of deformed logarithm within the spirit of minimising the empirical K-L divergence between the true distribution and its extreme value fit. To this end, we pick up the more niche (than the Tsallis) deformed logarithm, introduced in Trivellato (2013), and here expressed in terms of the distortion parameter :

 ℓNq(x)=x1−q−x−(1−q)(1−q)(x1−q+x−(1−q)). (2.2)

The novelty in this paper stems from conjoining the BM method and the Lq-likelihood principle endowed with the deformed logarithm (2.2).

3 Extreme values framework

Because there is no essential difference in maximisation and minimisation, we assume the EVT holds upon the maximum of the random sample , for a sufficiently large sample size . We denote the sample maximum by , that is , and we shall always be concerned with sample maxima. Corresponding results for minima are readily accessible by using the device . The celebrated Fisher and Tippet theorem (Fisher and Tippett, 1928) or the Extreme Value theorem (EVT), with prominent unifying contributions by Gnedenko (1943) and de Haan (1970), establishes the GEV distribution as the class of limiting distributions for the linearly normalised partial maxima . More concretely, if there exist real constants , such that

 limn→∞P(Xn,n−bnan≤x)=limn→∞Fn(anx+bn)=G(x), (3.1)

for every continuity point of , then given by (1.1). We then say that is in the max-domain of attraction of , for some extreme value index [notation: ]. For , the right-hand side is interpreted by continuity as . With some effort, we can replace with running through the reals and consider normalising functions and , where denotes the integer part of . By taking the logarithm in both sides of the extreme value condition (3.1) followed by Taylor’s expansion we have that

 (3.2)

for those such that . As a precursor to the statistical approach for BM that follows, we are going to formulate the above in terms of inverse (quantile) functions. Let be the tail quantile function defined by the generalised inverse of , i.e. for . We put so that . Inverting the limit in (3.2) with some rebranding (Theorem 1.1.8 of de Haan and Ferreira (2006) enables the latter), we get the well-known condition of extended regular variation (Bingham et al., 1987; de Haan, 1970; de Haan and Ferreira, 2006) that

 limt→∞U(tx)−U(t)a(t)=G←γ(e−1/x)=xγ−1γ, (3.3)

for all [notation: ]. The limit in (3.3) coincides with the -function of the GPD, with distribution function , which resonates the popular statistical culture for drawing inference on the excesses above a high threshold ascribed to the POT method. The theory of regular variation provides necessary and sufficient conditions for . In particular if and only if there exists a positive measurable function such that the pertaining tail quantile function . In fact, the extreme value condition (3.3) on the tail quantile function is the usual assumption in semi-parametric inference for extreme outcomes. However, we will not pursue this direction any further. Instead, we will use the equivalent extreme value condition provided in Ferreira and de Haan (2015) for dealing with block length and/or block number as opposed to the number of upper order statistics above a sufficiently high (random) threshold. To this effect, we define the random sample consisting of i.i.d. block maxima as

 Mi=max(i−1)m

The above states that we are dividing the whole sample of size into blocks of equal length (time) . For the EVT to hold within each block, the block length must be sufficiently large, i.e. one needs to impose tending to infinity. Now, let be the left generalised inverse of , i.e. . In other words, , for . Again, we attempt inversion of (3.2) to meet our purpose of grasping a max-domain of attraction characterisation. Similarly as before, we put in such a way to get to

 limt→∞V(tx)−b(t)a(t)=G←γ(e−1/x)=xγ−1γ,

for all . In contrast to the previous case of associating relation (3.1) with (3.3), there is now an asymptotically negligible factor creeping in when substituting with . At this point, we refrain from delving into the details on how links with . The bias stemming from absorbing (or ) into the location parameter of the GEV limit distribution (see (3.1)) is somewhat difficult to control, but we will have a closer look at this later on in the simulation study which comprises section 5. For now, we highlight that the BM construction in (3.4) suggests the approximate equality

 P(Mi>x)≈1−Gγ(x−b(m)a(m)),

provided sufficiently large (cf. Eq.3.1). Hence, can be eventually regarded as the return level with an average recurrence interval (interval between successive exceedances) of . For a more detailed discussion on comparative approaches in univariate extreme values we refer the reader to de Haan et al. (2015). Ferreira and de Haan (2015) establishes that it is possible to redefine appropriately so that , for some , if and only if

 limm→∞V(mx)−V(m)a(m)=xγ−1γ, (3.5)

for . The theoretical development for working out the order of convergence in (3.5) and (3.3) in tandem is deferred to Proposition A.1 in the appendix (Appendix A).

The extreme value condition (3.5) (i.e. ) is the main condition in the paper as it provides the max-domain of attraction characterisation. This is essentially what distinguishes the proposed approach from mainstream parametric ones: the aim is to show numerically that the MLq estimator introduced in this paper (which will be devised upon (4.2)) is not only valid when the observations come from the exact limiting extreme value distribution but also under the more realistic assumption that the observations come from a distribution belonging to some extreme domain of attraction attached to the extreme value index . The extreme value index (EVI) is often regarded as a gauge of tail heaviness: if , then we are in the presence of a heavy-tailed distribution with polynomially decaying tail. All distribution functions belonging to the max-domain of attraction with negative are light-tailed with finite right endpoint. The intermediate case is of particular interest in many applied sciences where extremes are relevant, not only because of the simplicity of inference within the Gumbel domain but also for the great variety of distributions possessing an exponentially decaying tail whether having infinite or finite right endpoint.

The key insight to statistics of extremes is to position ourselves at the edge of the sampled observations so as to enable extrapolation beyond the sample range as ascertained by the extreme value theorem (3.1). This aspect is here accounted for in the consideration of a minute probability , depending on the block size in such a way that , as , ties with the return level through the extreme value condition (3.5). Setting , we obtain the approximate relation:

 xm=F←(1−pm)≈V(m)+a(m)(−mlog(1−pm))−γ−1γ. (3.6)

Furthermore, by letting in case , a class of estimators arises for the right endpoint via the approximation , as , and by noticing that . In this sequence, the existing finite right endpoint can be viewed as the ultimate return level. When estimating extreme characteristics of this sort, we are required to replace all the unknowns in (3.6) by their empirical analogues, yielding the estimators

 ^xm:=^V(m)+^a(m)(−mlog(1−pm))−^γ−1^γ, (3.7)

and

 ^xF:=^V(m)−^a(m)^γ, (3.8)

respectively. The quantities , and stand for appropriate consistent estimators for the scale and location functions and , and EVI .

In the heavy tailed case of , the extreme value condition (3.5) simplifies to the regular variation of at infinity, that is , for all , thus giving rise to the estimator for the return level associated with the return period :

 ^xHm:=^V(m)(−mlog(1−pm))−^γ. (3.9)

Expressions (3.7)–(3.9) highlight the distinctiveness of the semi-parametric approach, in the sense that and pertain to the true (unknown) distribution function underlying the sampled data, thus making any statistical inference procedure context-dependent but distribution-free, where bespoke estimation methods can be devised upon summary statistics in a close relationship with extreme value conditions akin to (3.5). In contrast, the parametric approach bears its significancy and adequacy on the GEV limiting distribution, with ensuing context-free but distribution-dependent estimators. The latter is the approach adopted in Ferrari and Paterlini (2009), for instance. This paper is concerned with the former.

4 Tilting the Lq-likelihood

Under a semi-parametric approach, maximum likelihood estimators for the vector-valued parameter are obtained by pretending (which is approximately true) that the random variables are independent and identically distributed as maxima of GEV distribution with d.f. given by

 Gθ(x)=exp{−(1+γx−μσ)−1/γ},

for those such that . The density of the parametric fit to the BM framework is the GEV density, which we denote by , may be differ slightly from the true unknown p.d.f. underlying the sampled data. Hence, the less stringent assumption we make in the semi-parametric approach is that the corresponding d.f. belongs to some max-domain of attraction of , provided a suitable linear normalisation with constants and (cf. (3.5) and explanatory text around this condition). We typically estimate these constants and via maximum likelihood, despite these being absorbed into the scale and location parameters of the parametric limiting distribution thus assumed fixed, eventually. As a result, BM-type estimators are not so accurate for small block sizes since these estimators must rely on blocks of reasonable length to fulfil the extreme value theorem.

There are two alternative criterion functions under comparison in this paper, each of which giving rise to a MLq estimator as in

 ˜θ:=argmaxθ∈Θk∑i=1ℓq(gθ(xi)).

For , the Tsallis and the more niche (cf. Eq.2.2) deformed logarithms thus lead to:

 ˜θT = argmaxθ∈Θk∑i=1(gθ(xi))1−q−11−q,q≥0, (4.1) ˜θN = argmaxθ∈Θk∑i=1(gθ(xi))1−q−(gθ(xi))−(1−q)(1−q)((gθ(xi))1−q+(gθ(xi))−(1−q)). (4.2)

The MLq estimation method picks up the standard maximum likelihood estimator (MLE) if one sets . This line of reasoning can be stretched on to a continuous path, that is, as tends to 1, the MLq estimator approaches the usual MLE. The common understanding is that values of closer to one are preferable when we have numerous maxima drawn from large blocks since this will give enough scope for the EVT to be accessible and applicable. In practice, we often encounter limited sample sizes in the sense that either a small number of extremes ( sample maxima) or blocks of insufficient length to contain even one extreme are available. In this situation, we cannot afford the advantages of the usual asymptotic theory for maximum likelihood, meaning that we cannot rely on the traditional Fisher information for the efficient assessment of probabilistic uncertainty in the form of confidence intervals. MLq estimators have been recognised as particularly useful to deal with small sample sizes, which is often the situation in the context of the analysis of extreme values due to the inherent scarcity of extreme events with catastrophic impact. Previous research by Ferrari and Yang (2010) and Ferrari and Paterlini (2009) shows that the main contribution towards the relative decrease in the mean squared error stems from the variance reduction, which is the operative statement in small sample estimation. This is in contrast with the bias reduction often sought after in connection with large sample inference. Large enough samples tend to yield stable and smooth trajectories in the estimates-paths, allowing scope for bias to set in, and eventually reflecting the regularity conditions in the maximum likelihood sense. Once these asymptotic conditions are attained, the dominant component of the bias starts to emerge, and by then it can be efficiently removed. This is often implemented at the expense of an increased variance, for the usual bias/variance trade off in statistics seems never to offer anything worthwhile on one side without also providing a detriment to the other. In this paper, we are aiming to trade off bias for variance by sifting through the distortion line of (with fixed rather than depending on the block length ).

5 Simulation results

Before discussing the simulation results, we highlight that this section aims at extreme value estimation in the max-domain of attraction. This is tantamount to a semi-parametric approach within a distribution-free framework, where the exact fit to the GEV distribution still finds its way through the max-stability property as in there exist constants , such that . In order to draw comparison between MLq estimators upon either side of the limiting relation (3.1), we need to account for eventual divergences between the true d.f. underlying the data, and the target (model fit) GEV distribution prescribed in the limit. In this sequence, we generated i.i.d. samples from different parent distributions belonging to the same max-domain of attraction. More concretely, we have conducted a Monte Carlo simulation study with replicates consisting of independent samples of size , each of which was split into blocks of equal length (i.e. ). We have conducted a wider simulation study offline to which the GEV, Burr and Reversed Burr (RevBurr) distributions are taken here as key-examples. The background calculations for supporting the statements below regarding the Burr distribution are provided in Example A.5 in the Appendix. The RevBurr builds easily on the Burr distribution, therefore we omit further details than those given in the following:

• GEV, where we fix the extreme value index (EVI) at . The GEV distribution satisfies the extreme value condition (3.5) exactly rather than in the approximate limiting sense. The second order condition (A.3) does not hold and, consequently, we stipulate that (see also Remark A.4 for the dual formulation in terms of the GP distribution).

• Burr, with d.f. , , . According to Example A.5, satisfies the extreme value condition of second order (A.3) with and , whenever . We have set .

• RevBurr with right endpoint . A random variable is said to follow a Reversed Burr distribution with parameters if , with a Burr random variable. Proposition A.1 ascertains that (A.3) holds with and . Again, we set .

Two variants of the MLq estimation are applied, both the standard MLq based on Tsallis logarithm (defined in (4.1), notation: MLq), and the tilted self-dual MLq estimation (defined in (4.2), notation: MLqAlt). The idea is to tinker with the distortion parameter with values set in to turn the variance reduction into an effective gain in the mean squared error. To this end, we shall look at the empirical analogue of the ratio mean squared error (RMSE),

 RMSE(q)=MSE(~θMLq)MSE(^θMLE)=Var(~θ)+(~θ−θ)2Var(^θ)+(^θ−θ)2. (5.1)

Figure 1 displays estimated bias and ratio mean squared error regarding the estimation of the extreme value index when sampling from the GEV distribution with true . The symmetrical pattern in the estimates yield of MLqAlt stems from the equality involving the criterion function: , . The fact that we are simulating from the GEV distribution already hints at a potentially limited gain in using MLq rather than the mainstream ML estimation. Figure 1(a) clearly depicts the effect of the distortion on the bias: the standard MLE shows the smallest possible empirical bias when , and only by selecting can we achieve smaller bias in the MLq than in mainstream ML estimation. So, perhaps there is not enough reason to favour a more demanding estimation method which is likely to be mired in computational effort. Figure 1(b) evidences that, despite sampling from the ideal model (i.e., the prototypical GEV distribution) a more efficient estimation of the EVI can still trickle down from the MLq method, and more prominently from the MLqAlt. The RMSE stays belows below 1 whenever is set between 0.8 and 1 in whichever variant of MLq we choose to pursue.

The potential gain from using MLq becomes more apparent as we proceed to the estimation of a return level by substituting the triplet of estimates into (3.7). We now carry this performance evaluation of MLq estimators forward on to the return level estimation given in (3.7). In this sequence, we assign a recurrence probability , for example, set at . This way, is forced to be slightly smaller than , just enough to ensure actual extrapolation beyond the sample range. The plot in Figure 2 shows the RMSE (see (5.1)) for the return level (or high quantile) estimation. Although the decrease in the RMSE is not so pronounced for the tilted MLq [notation: MLqAlt] as that for the plain MLq, there is roughly a 20% efficiency gain in adopting a MLq-type estimator. The slow decay of the RMSE with respect to the MLqAlt as moves away from 1 is not necessarily a bad feature in the sense that it suggests that the MLqAlt method is more robust to any miss-specifications which could result in a poor choice of . In Figures 3 and 4 we repeat the same exercise for the Burr parent distribution. We note that we are now dealing with a heavy-tailed distribution belonging to the max-domain of attraction of the GEV distribution with . The second order parameter , which governs the speed of convergence to this extreme value limiting law, is in this case equal to (cf. Example A.5 in the Appendix). This value entails a moderate speed of convergence, a situation where, by weighting block maxima differently, the MLq estimation can be particularly useful in accounting for deviations between the true underlying distribution and the GEV model fit. Figure 3 pertains to the estimation of the EVI, where the true value was set at . We have similar findings as in Figure 1 regarding the GEV distribution with a near zero EVI of . In the interest of comparison, the right panel in Figure 4 shows a plot of the empirical RMSE built on replicates from the actual GEV. For the two models that Figure 4 encompasses, the MLq estimation seems to spark out of control after the RMSE has slumped to the lowest of low values, that is, when becomes smaller than 0.8, approximately. Furthermore, although the MLqAlt is not so sharp for values of within a close neighbourhood of 1, it seems to be able to curb the acclaimed efficiency of the mainstream MLE, particularly when the interest is to go on to the estimation of return levels.

Finally, as a representative for the max-domain of attraction with negative EVI, enclosing short tailed distributions with a finite upper bound , we have selected the Reversed Burr distribution with right endpoint . The simulation results are summarised in Figures 5 and 6. Some considerations are in order at this point. The aim is to proceed with the maximum likelihood estimation ascribed to the BM, that is, we are seeking to verify equations (4.1) and (4.2) as a result of deforming the customary maximum likelihood estimation which uses the standard log as criterion function evaluated at the GEV pdf. The caveat is that, in case of a negative shape parameter , the right endpoint is equal to (cf. Remark 1.1.5.c in de Haan and Ferreira (2006)) and therefore the usual regularity conditions in the maximum likelihood sense are not met. This partly explains why the mainstream MLE is surpassed by both MLq and MLqAlt in connection with . Notably, the new tilted MLq performs better in the estimation of a negative EVI; it is capable of reducing bias with the concomitant reduction in the MSE. Unfortunately, the good performance does not cross over to the estimation of the endpoint. In this case, the MLqAlt estimates stay very close to the MLE within the region , whereas the MLq estimator can perform marginally better for a limited range of -values greater than 1.

6 Case study

As an illustrative example, we are going to revisit the public health data set studied in Thomas et al. (2016), where the mainstream maximum likelihood has been applied albeit from a purely parametric point of view. The cumulative rate of Pneumonia and Influenza mortality (cP&I) is defined as the sum of weekly P&I mortality over eight consecutive weeks using a moving time window throughout the entire time series of historical data. The resulting cP&I mortality index is represented in Figure 7(a). The eight-week period fairly coincides with the length of a typical influenza epidemic. Following up on the BM construction summarised in (3.4), we consider the random sample as the vector storing maxima of cP&I observations within a respiratory year. The period from July to June is here identified as one block worth of observations, just enough to encompass annual influenza epidemics. Figure 7(b) displays the plot of the extracted series of maxima. There are block maxima realisations, recorded from July 1979 to June 2012. The highest maximum of 12 deaths per 100,000 was observed during the 1999-2000 respiratory year. Although the cP&I time series suggests the presence of a quadratic trend, this seems to be diluted at the higher levels of the process. We might still argue there is a decreasing trend over the later 7 years of maximal records, but we do not have enough observations to verify this claim from a statistical perspective, and therefore we proceed under the assumption that the block maxima are stationary.

The autocorrelation function plot on the right hand-side of Figure 8 seems to attest that it is reasonable to assume that the block maxima consist of independent realisations of . The right panel in Figure 8 encloses probability and quantile plots for checking the model’s fit to the target GEV distribution. Overall, there is a reasonable fit but the QQ-plot seems to flag up the larger sampled maxima as drifting away from the postulated GEV fit to the BM.

It is worthwhile emphasising that we do not often encounter the exact GEV distribution in the wild, so much so that there is a wealth of literature on semi-parametric inference for extreme values, notably the references Beirlant et al. (2004); de Haan and Ferreira (2006). We shall proceed under this semi-parametric setting, operating behind the veil of ignorance with regard to the true model generating the data. Hence, we shall follow a distribution-free and context-dependent approach to tilt the MLq procedure via the criterion function introduced in (2.2). The simulation study has shown that this alternative MLq method can lead to significant improvement in forecasting extreme risk, harnessing modelling issues often posed by those deviant observations from the target GEV fit. Furthermore, we will address how the maximum likelihood compares with the maximum product of spacings (MPS) estimator in this case study. The MPS estimator of maximises the product of spacings

 k+1∏i=1Di(θ)=k+1∏i=1{Gθ(xi,k)−Gθ(xi−1,k)},

with and , or equivalently the log-spacings

 LMPS(θ;x)=k+1∑i=1logDi(θ). (6.1)

The MPS method was introduced by Cheng and Amin (1983), and independently by Ranneby (1984). A generalisation of this method is proposed and studied in great depth by Ekström (2001). The MPS method was further exploited by Huang and Lin (2013) in estimating and testing for the only possible three types of extreme value distributions (Fréchet, Gumbel and Weibull), all unified in the GEV distribution. Following a similar line of reasoning as before with the Lq-likelihood, we are going to replace the standard logarithm in (6.1) by the Tsallis logarithm. Figure 9 displays the results for the estimation of the EVI as well as a plot for the corresponding return level estimates (see Eq.3.7). The horizontal grey line marks the recorded historical maximum of . Informed by the simulations, we find it appropriate to restrict to the interval , where all the estimators are seen to perform to their best. In finding a suitable to obtain a point estimate from, we should bear in mind the most distinctive traits of the sort of estimators reported in the simulation section 5. In particular, the MLq estimators are biased and tend to be unbiased as the distortion alleviates (i.e. approaches 1) up to the point where the standard MLE is attained (i.e. ). On the other hand, a substantial reduction in the variance can be achieved for values of away from 1, thus prompting a decrease in the mean squared error. The MLq method is more sensible to small deviations of from 1 than the new tilted MLqAlt. In the absence of an automatic procedure for selecting , we shall adopt a common technique in the extreme value analysis culture and pick up a suitable just by eye-balling throughout the plots, screening for a range of values of that can yield identical estimates. The left panel in Figure 9, concerning the EVI estimator, deems appropriate to select a value of between 0.74 and 0.82, since this is where all the estimates-yields of MLq, MLqAlt and MPS are in the closest vicinity of each other. Furthermore, it is within this range of values of that the smoother variant MLqAlt tops to its maximum surely by instilling some bias with the distortion less than 1, but also and most importantly by deploying a dip in the variance. When the distortion has induced all possible reduction in the variance, i.e. as approaches 0.5, the MLqAlt estimator starts to behave erratically yielding a jagged trajectory. This typified behaviour of MLqAlt comes across very clearly in the simulations presented in Section 5. Taking all into consideration, we find sensible to choose . Whilst taking on a value nearly at the crossing of MLq and MPS estimators and where the MLqAlt picks up a local maximum, this choice of will deliver reasonable estimates for moderately heavy-tails in connection with a EVI of magnitude around . Therefore, we evaluate the three estimators at and average them out to come up with a sensible estimate for the EVI , that is, .

The right panel in Figure 9 displays the associated return level estimates by setting . Note that , which means that we are aiming at extrapolating beyond the range of the sample of maxima. It is reasonable to conclude that the observed maximum of 12 deaths per 100,000 observed during the 1999-2000 respiratory year stands approximately on the 1 in every 1000 years event, i.e. an event of 11.5 deaths per 100,000 is characterised as an event which is exceed, on average, once in every 1000 respiratory years.

7 Conclusions

Few statistical procedures are more constantly studied or fervidly applied right now than maximum likelihood methods. Much of the attraction of maximum likelihood estimators and hypothesis testing is based on their remarkably neat properties for large sample sizes, in particular the asymptotic normality ensuing from the regularity conditions revolving around likelihood. The dominant idea in statistics culture is that small sample sizes can be tackled using bootstrap resampling techniques. MLq estimation, however, sprints up the variance to overtake bias in an effective minimisation of the mean squared error, which is a critical aspect in the bias-variance trade off when only small sample sizes are available. This brings efficient sample estimation within reach of maximum likelihood-type technique. We found compelling evidence in the numerical experiments to establish that extreme value maximum likelihood estimators are substantially improved through the MLq method. Moreover, tilting the MLq by adopting the novel criterion function introduced in this paper – a deformed logarithm having the self-dual property in common with the standard logarithm – leads to a more robust estimation method than the MLq in connection with Tsallis logarithm. This means that the now proposed tilted MLq seems to strike a good balance between robustness and efficiency. Secondly, we find significant empirical evidence to ascertain that it is more important to have lengthly blocks than a large number of blocks when using MLq estimation, regardless of the deformed the logarithm set in the criterion function. In essence, we do not need that many blocks, but an important requirement remains in that blocks need to be long enough for extreme value theorem to hold within each block. Even in the case of an exceptional good fit to the GEV distribution, the estimation of return levels can benefit from the use of the MLq, particularly in the new tilted variant being proposed in this paper. Figure 10 depicts the estimated RMSE defined in (5.1) when sampling from the Burr distribution. This distribution belongs to the max-domain of the GEV with . The convergence of this specified Burr distribution to the limiting GEV is fairly quick as identified by the value (cf. Section 5). Figure 10 shows that the both MLq and MLqAlt estimators (defined in (4.1) and (4.2)) yield sharper results for the estimated return level (3.7) (note that RMSE is consistently below 1) if the blocks are lengthier (i.e. with ) despite fewer maxima will be retained for inference (i.e. ).

Finally, for a small number of blocks, there is more flexibility in the choice of since for larger samples the mainstream MLE tends persist as a good competitor to the MLq. This paper deals with fixed , in the sense that the MLq estimators are regular M-estimators. We have experienced what is reported by Ferrari and Yang (2010) (cf. Remark (i) in p. 7) in that the numerical results do not look promising if one attempts the obvious bias correction by . The distortion parameter governs the sensitivity of the estimation to very extreme observations: the smaller the , the less able MLq is to capture extremes. In this sequence, Ferrari and Yang (2010) found that choices of such that lies between and tend to increase the relative performance of the MLq method over the mainstream ML. We anticipate that it is difficult to develop a unified analytical procedure for the choose of within the BM framework. But it can be worthwhile to select slightly less than one so as to enable the meaningful improvements we found to be gained from weighing extremes differently, and to attain some robustness against contamination from different sources of extremeness. The choice of in MLq and the suitable direction (greater or smaller than 1) is inherited by the particular extreme characteristic we are aiming to estimate. For example, the estimation of a high quantile can require a distinct choice of to that of the estimator for the extreme value index . Although this caveat dissipates with the new tilted MLq, the relevant statistical theory that could underpin a systematic choice of is yet to be developed for max-domains of attraction under the umbrella of regular variation theory. We envision a considerable effort will be put into achieving this aim before too long.

Appendix A Second Order Refinement of BM

Before getting underway with linking second order developments in POT and BM approaches, let us examine how the first order behaviour of the tail quantile function ties with the its dual BM quantile function . The extreme value condition (3.3) implies that

whence (cf. Eq.3.5) with auxiliary (positive) function , i.e.

 limt→∞V(tx)−V(t)˜a(t)=xγ−1γ,x>0, (A.1)

Moreover, since is of regular variation at infinity with index , i.e. , as , and this convergence is locally uniform, then

 limt→∞˜a(t)a(t)=limt→∞a(t1/t1−e−1/t)a(t)=(limt→∞1/t1−e−1/t)γ=1,

that is, the two auxiliary functions intervening in the regular variation statements associated with and are asymptotically equivalent.

We are now ready to exploit the link between a second order strengthening of (A.1) and that of (3.3).

Proposition A.1

Assume condition (3.3) (i.e. ) and that is of extended regular variation of second order, that is, there exists a positive or negative function with and a non-positive parameter , such that for ,

 limt→∞U(tx)−U(t)a(t)−xγ−1γA(t)=1ρ(xγ+ρ−1γ+ρ−xγ−1γ)=:Hγ,ρ(x). (A.2)

Define

 ˜A(t):=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩1−γ2t−1,γ≠1,ρ<−1,A(t)+1−γ2t−1,γ≠1,ρ=−1,A(t),ρ>−1 or (γ=1,ρ>−2),A(t)+112t−2,γ=1,ρ=−2,112t−2,γ=1,ρ<−2.

If is either a positive or negative function near infinity, then with

 ~a(t):=⎧⎪ ⎪⎨⎪ ⎪⎩a(t)(1+γ−12t−1),γ≠1,ρ≤−1,a(t),ρ>−1 or (γ=1,ρ>−2),a(t)(1−112t−2),γ=1,ρ≤−2,

the following second order condition holds

 limt→∞V(tx)−V(t)~a(t)−xγ−1γ˜A(t)=Hγ,~ρ(x), (A.3)

for all , where if , and if .

Remark A.2

For , or , the function is understood in the limiting sense. The second order parameter governs the speed of converge in (3.3). Analogously, is the second order parameter which regulates the convergence in (A.1).

Remark A.3

Although the second order condition (A.2) (resp. (A.3)) is a second-order refinement of (3.3) (resp. (A.1)), it remains quite a general condition, verified for all usual distributions belonging to some max-domain of attraction. The most straightforward expansion for can be given as follows (cf. Eq.9 in Neves, 2009): if the second-order relation (A.2) holds with an auxiliary function such that , then (A.2) is equivalent to

 U(t)=c0+c1tγ−1γ+c2tγ+ργ+ρ+o(tγ+ρ), (A.4)

as , with and such that . On this note, we highlight that those distributions satisfying (A.2) with and , for which asymptotically equal to (thus entailing ) are not fenced by Proposition A.1. For such distributions we find tending to faster than (thus implying ). This result ties with Corollary 4.1 in Drees et al. (2003) somewhat. The most salient example in this respect is the GEV distribution, of which the standard Fréchet distribution is a particular case yielding (and precluding the representation (A.4) since .

Remark A.4

There are distributions that do not satisfy the second order condition (A.2) but for which condition (A.3) holds. The GPD with tail quantile function is the prototypical example. In view of Remark 2.4.4 in de Haan and Ferreira (2006) and Table 3.2 therein, we assign to the second order speed of convergence of the GPD in (A.2). This concession enables a stretching of Proposition A.1 to every distribution exhibiting faster convergence than any negative power of in the tail (i.e. such that , for all ). By this token, if one recognises as the appropriate characterisation for the role of the GPD w.r.t. (A.2), then Proposition A.1 deems condition (A.3) verified with if and if . This conclusion for the GPD manifests itself in the proof of Proposition A.1 given below (see also Example A.7).

Proof: We observe that

 V(tx)−V(t)a(t)=V(tx)−U(t)a(t)+U(t)−V(t)a(t). (A.5)

By noticing that and are closely related via , the first term on the right hand-side rephrases as

 U(11−e−1tx)−U(t)a(t),

which, under the second order setting in this lemma, expands to

 (1/t1−e−1tx)γ−1γ+A(t)Hγ,ρ(1/t1−e−1tx)+o(A(t)),

as . Using Taylor’s expansion upon the above, we get for the first building block in (A.5), as ,

 V(tx)−U(t)a(t)=xγ−1γ+12txγ−1+3γ−124t2xγ−2+A(t)Hγ,ρ+o(1t2)+O(t−1A(t)). (A.6)

The analogous expansion for the second building block (i.e. the bias term) stems from (A.6) evaluated at , whereby the stated result follows from bringing back both these expansions into (A.5), together with a power series development of for :

 V(tx)−V(t)a(t)=(1+γ−12t+(3γ−1)(γ−2)24t2)xγ−1γ+1−γ2tHγ,−1(x)+A(t)Hγ,ρ(x)+(3γ−1)(2−γ)12t2Hγ,−2(x)+o(1t2)+O(t−1A(t)).

Finally, we provide four examples of application of Proposition A.1 alongside further details as to how the prominent GPD can, at a first glance, escape the grasp of this proposition.

Example A.5

Burr. This example develops along similar lines to the proof of Proposition A.1. The Burr distribution, with d.f. , , , provides a very flexible model which mirrors well the GEV behaviour in the limit of linearly normalised maxima, also allowing a wide scope for tweaking the order of convergence through changes in the parameter . The associated tail quantile function is . Upon Taylor’s expansion of , the extreme tail condition up to second order (see Eq.A.2) arises:

as . Whence, the second order condition on the tail given in (A.2) holds for and , , with

 a(t)=t1λτλτ(1−(1τ−1)t−1λ) and A(t)=1λ(1τ−1)t−1λ=(γ+ρ)tρ.

Proposition A.1 is clearly applicable and therefore the Burr distribution satisfies the extreme value condition of second order (A.3) with and if .

Example A.6

Cauchy. The relevant d.f. is , The corresponding tail quantile function is