1 Introduction
Prediction of extremely tail quantiles or equivalently, so called return levels, is an important problem in various applications of extreme value analysis including but not limited to meteorology, hydrology, climatology and finance. The main task of inference in many of these prediction problems involves the computation of probabilities of yet unobserved rare events that are not seen in the historical records. Specifically, the problem may involve the estimation of a high quantile, say at level
, based on a sample of size and, where is very close to 1 so that is small. Examples include the prediction of a quantile of precipitation which could be realized once in every 200 years based on 50 to 100 year rainfall data, or the prediction of high financial loss based on few years of data, just to name a few.Extreme value theory (EVT) provides tools to make prediction of extreme events, based on the assumption that, the underlying distribution of the normalized random variable resides in the domain of attraction of an extreme value distribution
(Fisher and Tippett, 1928). In that case, extreme quantiles can be predicted by estimating the parameters of the specific extreme value distribution and the normalizing constants. Detailed reviews on different models and methods of estimation in this context can be found in Embrechts et al. (1997), and Coles (2001), among others. Many distinct models and forecasts could be applicable for predicting an extreme quantile in a given situation, thereby calling for a comparative assessment of their predictive performance.To our knowledge, relatively little work has been done on evaluation of quantile prediction (point forecast) in the context of extreme events. Friederichs and Thorarinsdottir (2012) considers probabilistic forecast, and derived continuous rank probability score (CRPS) for two common extreme value distributions, to assess forecasts for predictive distribution of the peak wind prediction. This method also used in Bentzien and Friederichs (2014a) and Scheuerer and Moller (2015) for precipitation forecast ensembles through the assumption of a parametric form for the prediction distribution. Lerch et al. (2017) proposed a weighted CRPS for probabilistic forecast by imposing weights on tail of the distribution with emphasis on extreme events. In case of point prediction, specifically for evaluating quantile prediction, the common tool to use is the quantile score which is a strictly proper and consistent scoring rule for quantile functional (Gneiting and Raftery, 2007). However, for estimating a high target quantile, due to the sparseness of the data at extreme tails, it may very well be possible that, all the estimates fall outside the range of the observed data therefore leaving no effective information about the estimates to validate. This also implies that the quantile score is optimized trivially at the lowest estimate (see equation (2) of Section 2).
To address this problem, we propose to predict a set of equally extremal quantiles on different subsets of the data when comparing various quantile prediction methods. The quantile points are chosen such that the predicted quantiles are observed in the rest of the data (test sample). Each of the competing predictor is then used to estimate the series of quantiles on different parts of the data and validated in the rest of the data. This method then uses a combined quantile scoring function and cross validation to assess different competing extreme quantile predictors.
The outline of the paper is as follows. In Section 2, we present different methods of assessment of the extreme quantile predictors and also list few examples of the extreme quantile estimate. We present a simulation study in Section 3 to demonstrate the performance of the optimum scoring predictors under different data generating mechanisms. In Section 4, we apply the proposed methodology to two different datasets: the LANL netflow data and the precipitation data corresponding to one station of the Global Historical Climatology Network data. We conclude with some remarks in Section 5.
2 Methodology
Consider a random sample of size , , where is the sample space of the random variable . Assume that the sample is drawn from a continuous bivariate distribution having support . Let denote the th quantile of , where
is the cumulative distribution function.
Let be a set of competing predictors for the th quantile of with being very close to 1. We aim to evaluate the above competing predictors and make an informed choice amongst the members of .
Evaluation of point predictors is typically done by means of a scoring function that assigns a numerical score when the point forecast issued and the observation realized (Gneiting and Raftery, 2007). For evaluating a point predictor consistently, it is recommended to use a strictly proper and consistent scoring rule for the functional of interest (Gneiting, 2011). The quantile score is such a rule for assessing quantile predictors at given probability levels and used in various applications (Bentzien and Friederichs, 2014b). Given a set of competing predictors, the optimum quantile predictor can be defined as the minimizer of the total quantile score
(1) 
where is the quantile check function (Koenker, 1984).
Predicting quantiles from very extreme tails is a difficult task due to spareness of the data at the tails, specially for a heavy tailed distribution (Wang et al., 2012). This fact also makes it challenging to evaluate the predictions, as they may not be realized inside the range of the observed data. Further, if all the competing predictions are larger than the observed responses , the aggregate quantile score (1) reduces to
(2) 
with . Therefore, the equation (2) is always minimized at . In other words, the smallest quantile estimate is always the optimal estimate of (1). This may not be a reasonable answer when assessing prediction of very extreme quantiles. This shortcoming motivated us to consider an improved scoring method for assessing predictors for high quantiles. The developed method can be easily adopted to the extreme lower quantiles with close to 0.
2.1 Proposed scoring method
First approach: For predicting the th quantile with close to 1, let us consider predicting a new quantile at level (say), , based on a subsample (training sample) of size of the original sample of size . We then propose to choose and such that, estimating the quantile at probability level based on is equally extreme as of estimating based on the original sample of size . By equally extreme, we mean, the expected number of exceedances above the estimated quantile of levels and are the same in training and original sample, respectively. This condition implies
(3) 
To solve for the unknowns parameters and and to assess the above competing estimates, we consider an additional constraint: the expected number of observations exceeding th quantile in the rest of the sample (validation sample) is fixed, (say), where . Therefore, for a given , this condition is written as:
(4) 
Conditions (3) and (4) produce a unique solution for and as
(5)  
(6) 
Note that, represents the average number of observations in the test dataset (size ) exceeding th quantile of the distribution of . A smaller (or larger) value of leads to a trial quantile (probability level ) closer (or distant) to the target quantile (probability level ) and leaves less (more) observations for evaluation of the predictions. Therefore, the selection of an appropriate choice of is an crucial issue depending on how much importance is to be given to the estimation and validation of the extremes.
We consider subsets of the full dataset , with sizes , where and is the greatest integer less or equal to . The idea is that, at a time, the competing models (predictors) will be trained on one of the subsets and then tested on the remainder of the data, which, for fold , we denote as , . We use fold cross validation, such that, at a time one subsample of size is used as training sample and collection of the rest of the subsamples are used as the test sample. Thus each data point appears in the training sets exactly once but can be seen times in the test set. This is opposite to the conventional fold crossvalidation method where, a collection of subsets are used as training set and one fold is left for validation.
Consider distinct values of the tuning parameter as , where each of the choices satisfies the condition (4) and through (5) and (6) produces corresponding set of trial quantiles to be estimated based on the subsamples with sizes , respectively. For a competing predictor , we define a combined score for predicting of th quantile
(7) 
where, , and is the quantile check function at level as defined in (1
). This combined loss function is obtained by pooling information assessing the predictor at multiple and similar extreme trial quantile levels. The optimal score predictor then defined as the minimizer of the combined quantile score
(8) 
Note that, the optimum predictor is the minimiser of the quantity in the parenthesis of (7). This principle coincides with estimation of the quantile regression at a given level (Koenker, 2004). Here we would like to tackle the question of choosing such that the prediction can be validated with exceedances and also optimize the check function in (7) in the test sample.
The use of multiple quantile levels to estimate regression coefficients appeared in the quantile regression context (Koenker, 1984, 2004; Zou and Yuan, 2008) and in extreme quantile estimation (Wang et al., 2012; Velthoen et al., 2019). Here quantile levels are chosen such that the optimal asymptotic properties of the regression parameters can be achieved. Following this theory one can choose values of such that values of the coincide with the above quantile levels used in Wang et al. (2012) and Velthoen et al. (2019).
Second approach: Alternatively, for a given with a fold sample, consider training predictions using subsamples and test them on one subsample. For a given , let us consider predicting a new quantile at probability level based on a sample of size such that
(9)  
(10) 
where is the number of subsamples (folds) of size . Here we propose to train the estimates based on subsamples (the training sample size is ) and test on one subsample of size . In the whole process of the fold cross validation with the above scheme, each observation can be seen times in the training and once in the test part. One can thus expect to see observations exceeding the quantile estimate in the whole process. The optimal score predictor is defined as
2.2 Examples of extreme quantile predictor
The two commonly used models for predicting extreme quantiles are the Generalized Extreme Value Distribution (GEV) for block maxima and the Generalized Pareto Distribution (GPD) for peaksoverthresholds of the variable of interest.
GEV model: The GEV model, parametrized by the location parameter , scale parameter and shape parameter has cumulative distribution function of the form
(13) 
where for .
For a given level and dataset , the th quantile of can be estimated as
(14) 
where , and are estimates of , and .
GPD model: Let us assume that there exists a nondegenerate limiting distribution for appropriately linearly rescaled excesses of a sequence of independently and identically distributed observations above a threshold . Then under general regularity conditions, the limiting distribution will be a GPD as (Pickands, 1975). The GPD is parametrized by a shape parameter and a thresholddependent scale parameter , with cumulative distribution function ,
(15) 
and and .
The GPD model depends on a threshold
which needs to be chosen apriori. The choice of the threshold is a crucial issue, as too low of a threshold leads to bias from model misspecification, and too high of a threshold increases the variance of the estimators: a bias variance tradeoff. Many threshold selection procedures have been proposed in the literature, see
(Davison and Smith, 1990; Drees et al., 2000; Coles, 2001) including many others.Once an appropriate threshold is selected, the parameters can be estimated through maximum likelihood method and subsequently, the target high quantile estimate can be extracted by plugging in the estimates of , and in
(16) 
where , and are the maximum likelihood estimates based on the GPD assumption for the exceedances ((Coles, 2001), Section 4.3.3) . Also, can be approximated by the ratio where is the number of excesses of the threshold in the given sample of size .
3 Simulation Study
In this section, we carry out a simulation study to investigate the performance of the optimum score predictors and obtained through the proposed scoring methods and compare them with that of the conventional optimal quantile score predictor at high tails.
We simulate samples from the random variable from the following four sets of distribution with densities:

with three choices of the shape parameter : () = 0.5 , () =0, and () =0.5,

, with two different choices of the weight parameter : () = 0.5 and () =0.99,

,

,
where
denotes the probability density function (pdf) of the Uniform distribution (0, 10) and
denotes the pdf of the GPD with location , scale , and shape . The three different values of the shape parameter in model correspond to the three types of the extreme value families: (Weibull), (Gumbel) and (Frechet). We choose standard scale and location in each of the cases: to . The model, is a mixture with mixing probability such that it produces lower 100 percent samples from and upper 100(1) percent samples from the GPD. We choose and in model. The model is obtained by equally mixing () two GPDs with different shape parameters. Modelis the Gamma distribution with rate, scale and shape parameter taking values as 1, 1 and 0.1, respectively.
For each of the simulated datasets, we apply the conventional scoring and the proposed scoring methods and to evaluate and select the optimum score predictors of the quantile at the level ) based on a sample of size . We consider , this value of can be thought of as 50 years of daily observations (150 days each season) in the rainfall data analyzed in Section 4.2.
As a competing set of predictors for the target quantile at probability level , we consider the GPD model based predictors given in (16) estimated by maximum likelihood. For selecting the threshold in the GPD model, we consider the “rulesofthumb” approach (Ferreira et al., 2003), which uses a fixed upper fraction of data to fit the GPD. We consider two sets of thresholds. The first set is based on ten upper order statistics: 150, 125, 100, 75, 50, 40, 30, 20, 10 and 3, and we refer to this set as . Here the largest order statistics corresponds to the lowest GP threshold. We use the numbers starting from 1 to 10, to denote the predictors in serial order, such that, 1 and 10 corresponds to threshold based predictor with the highest and lowest value of the upper order statistics, respectively.
Note that, each of the members of the set uses an equal number of observations to be fitted to the GPD in the training and the original prediction. We consider another set of thresholds, determined by sample percentiles with probability levels: 0.98, 0.9833, 0.9867, 0.99, 0.993, 0.995, 0.996, 0.9973, 0.9987, and 0.9996, and refer to this set as . This implies that, for the set , the number of samples fitted to the GPD in original prediction is larger than those in the training stage. The above percentile levels are chosen such that, the number of exceedances in the original sample of size , are approximately the same for the sets and . For notational convenience, similar to the set , we denote the members of by the numbers starting from 11 to 20.
We select the tuning parameter of and using the same principle that is used in selecting intermediate quantile levels in extreme high quantile estimation (Wang et al., 2012; Velthoen et al., 2019). These probability levels are chosen such that they confirm fixed number of exceedances and also asymptotic convergence of the estimated quantiles are guaranteed. In their work, Wang et al. (2012) choose the set of intermediate quantile with probability levels where , indicates the integer part, and Velthoen et al. (2019) consider . We consider the largest value of as and use the set . It is found that, the use of additional values of doesnot change the performance of the proposed methods much.
We generate replicates each of size from the models –. For each of the simulated dataset, we use , and to find the optimum quantile score predictors from the competing predictorset for th quantile, and for each of the three methods, we compute the root mean squared error (RMSE) across iterations as
(17) 
where is the optimal prediction in the th iteration.
Table 1 shows the RMSE of the three optimum predictors , and for the sets of predictors , and union of the two sets, . The first proposed approach outperforms the conventional method , in terms of RMSE, in most of the cases. The only exception is model, where the optimum predictor produces more RMSE than . This data generating model consists of very few GPD components, specifically, for a given sample of size and for a given in (5), the proposed approach returns a training sample of size containing on average four GPD samples. This implies most of the competing predictors of are poorly fitted to the GPD in the training sample, thus commit high bias and high RMSE. An increase of the GPD component in (50 percent data from GPD) returns lower RMSE for compared to . It is also found that, the presence of only 5 percent GPD component in model (for and it provides atleast 20 GPD sample in the training data) enables the competing predictors to fit the GPD reasonably well and, in this case, outperforms the other two methods (results are not reported here).
Note that, for the conventional scoring method , the predictors are trained on the entire original sample of size 7500. Therefore, in the case of model, all the competing predictors are trained on at least 75 GPD sample points (1 percent of the total sample), thus are reasonably well fitted to the GPD, and outperforms (row 4 of table 1). It can be concluded that, is more efficient than only in the case when the data contains only few extreme components.
In contrast to , the second approach uses most of the data for training. For the model and for sample of size is 7500, it uses a training sample of size 6750 (for in (10)), which again consists of 67 GPD sample points. The presence of a sufficient number of GPD samples enables the predictors to be well fitted to the GPD and the optimum predictor returns less error than . Table 1 also shows that outperforms the classical optimum score predictor in all the cases. And in general is more efficient than . The superiority of over , in model, could be because of better estimation of the competing quantile predictors (based on larger training sample ) and at the same time having reasonable amount of exceedances in the cross validation (for example ).
Figure 1 shows the frequency distributions of the competing predictors set selected as optimum through , and when the data is generated from the pure GPD distribution model. It is found that, uniformly selects the predictors with slightly more preference to higher threshold based predictors. Whereas frequently selects smaller upper order statistics based thresholds, indicating preference to the predictors with a large number of observations to be fitted to the GPD. It is also found that, selects the lower thresholds of both and frequently and have same pattern over the two sets. It can be said that has less discriminate power compared to for distinguishing sets and . Note that, for the original sample of size , the training sample size for turns out to be 7000 giving a reasonably large sample to estimate all the predictors of . When the training sample size is reduced to 5000 or 2500, the method tends to select the predictor from the set (the results are omitted for brevity).
AB  A  B  

Model  
0.012  0.013  0.013  0.012  0.013  0.013  0.012  0.013  0.013  
1.222  1.049  1.189  1.218  1.049  1.182  1.219  1.105  1.179  
472.752  124.597  140.406  471.768  125.476  136.789  472.171  454.680  143.459  
364.772  92.306  106.079  361.164  92.269  99.362  364.840  314.388  105.599  
33.542  505.818  25.395  36.453  589.432  24.324  33.816  26.475  21.957  
365.025  99.671  100.051  361.381  100.007  97.724  364.963  317.206  108.985  
1.114  1.054  1.110  1.108  1.046  1.113  1.108  1.031  1.101 
We then compare the optimum predictors with a randomly chosen predictor and the median of the competing prediction set. As before, we use competing set for estimating quantile at probability level and use the tuning parameter . Table 2 shows the RMSE of the optimum score predictors , , , along with those of the random and the median prediction. As expected, the performance of the random predictor is worst compared to the other predictors. The predictor dominates the median prediction except model and .
Model  

0.012  0.013  0.013  0.013  0.013  
1.222  1.049  1.189  1.106  1.285  
472.752  124.597  140.406  142.581  282.812  
364.772  92.306  106.079  107.145  206.844  
33.542  505.818  25.395  20.588  358.044  
365.025  99.671  100.051  107.435  781.103  
1.114  1.054  1.120  0.982  1.049 
4 Assessment of quantile predictions at high tails: real data examples
In this section, we illustrate the proposed scoring methods with two different real data examples: cyber netflow bytes transfer and daily precipitation.
4.1 LANL cyber netflow data
The unified netflow and host event dataset, available from Los Alamos National Laboratory (LANL, https://csr.lanl.gov/data/2017.html), throughly described in Turcotte et al. (2018)
, is one of the most commonly used datasets in cyber security research for anomaly detection with enormous importance in industry and society
(Adams and Heard, 2016). The data comprise of records describing communication events between various devices connected to the LANL enterprise. The daily netflow data consisting of hundreds of thousands of records, is available for 90 days (day 2 to day 91, starting from a specific epoch time). Each of the flow records (Netflow V9) is an aggregate summary of a bidirectional network communication with the following components:
StartTime, EndTime, SrcIP, DstIP, Protocol, SrcPort, DstPort, SrcPackets, DstPackets, SrcBytes and DstBytes (Turcotte et al., 2017). In this work we analyze one important component of this data set.As a variable of interest, we consider total volume of bytes transfer (through an edge) which started in an epoch time. The data is available for every second of the day, so in a day we have =86400 observations. Figure 2 shows the histogram of the byte transfer, indicating the heavy tailed nature of the underlying distribution. We consider the target quantile as and use the competing set of 20 predictors considered in the previous section.
The optimum scoring predictors through methods , and turn out to be predictors with percentile based threshold with levels 0.9946667, 0.9867 and 0.9933. The optimal prediction of th quantile of total byte transfer using the above three methods are 218.4887 giga bytes, 157.08 giga bytes and 233.33 giga bytes, respectively.
4.2 GHCN daily precipitation data
Prediction of extreme high return values of daily precipitation is a common task in hydrological engineering where these values correspond to return periods of 100 or 1000 years based on relatively few years of data (Bader et al., 2018). Daily precipitation data is available from Global Historical Climatology Network (GHCN) and can be downloaded freely from ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ for tens of thousands of surface weather sites around the world (Menne et al., 2012). To illustrate the proposed methodology, we consider a sample station chosen randomly from the set of stations in one of the US coastal state, California.
We consider precipitation data for the months, November to March for each year for the time period 1940 to 2015 (following Bader et al. (2018)) with the total number of daily data points as 3303 after excluding zeros and some missing observations. Figure 3
shows the histogram of the daily precipitation data, indicating the heavy tailed nature of the response variable. Here we set our target as to predict the
observation (150year) return level, implying . As a set of predictors, we consider 20 unconditional predictors considered in Section 3. As the values of we take the set of values (1, 2, 4, 8). The optimum predictors using the three methods , and are the 20th, 4th and 1st member of the competing set and the optimum predictions are 1847.21, 2227.97 and 2727.22 millimeter, respectively.5 Concluding remarks
We propose two distributionfree scoring methods for extreme high quantile predictors based on their predictive performances. The first proposed method evaluates the competing predictors by employing them to predict a series of equally extreme quantiles, at levels on different parts of the samples of sizes and evaluate their predictive performance in the other parts of the sample . These parameters are obtained by solving two equations for a given tuning parameter . The other method uses the collection of subsample to train the predictors and test them on one subsample. A numerical study shows the increased efficiency of the proposed methods compared to the conventional scoring method.
The use of covariate information to strengthen the scoring methods is a possible extension of this work. In future research one might investigate further how the tuning parameter can be modeled using the extreme value index of the underlying distribution.
Acknowledgment
The research of Kaushik Jana is supported by the Alan Turing Institute Lloyd’s Register Foundation Programme on DataCentric Engineering. The authors also like to thank Dr. Tobias Fissler of Imperial College for many helpful discussions.
References
 Dynamic networks and cybersecurity. edition, World Scientific (Europe), . External Links: Document, Link, https://www.worldscientific.com/doi/pdf/10.1142/q0022 Cited by: §4.1.
 Automated threshold selection for extreme value analysis via ordered goodnessoffit tests with adjustment for false discovery rate. Ann. Appl. Stat. 12 (1), pp. 310–329. External Links: Document, Link Cited by: §4.2, §4.2.
 Decomposition and graphical portrayal of the quantile score. Quarterly Journal of the Royal Meteorological Society 140 (683), pp. 1924–1934. Cited by: §1.
 Decomposition and graphical portrayal of the quantile score. Quarterly Journal of the Royal Meteorological Society 140 (683), pp. 1924–1934. Cited by: §2.
 An introduction to statistical modeling of extreme values. Springer, London. Cited by: §1, §2.2, §2.2.
 Models for exceedances over high thresholds. Journal of the Royal Statistical Society. Series B (Methodological) 52 (3), pp. 393–442. Cited by: §2.2.
 How to make a hill plot. Ann. Statist. 28 (1), pp. 254–274. External Links: Document, Link Cited by: §2.2.
 Modelling extremal events. Berlin: Springer. Cited by: §1.

On optimising the estimation of high quantiles of a probability distribution
. Statistics 37 (5), pp. 401–434. Cited by: §3.  Limiting forms of the frequency distribution of the largest or smallest member of a sample. Mathematical Proceedings of the Cambridge Philosophical Society 24 (2), pp. 180–190. External Links: Document Cited by: §1.
 Forecast verification for extreme value distributions with an application to probabilistic peak wind prediction. Environmetrics 23 (7), pp. 579–594. Cited by: §1.
 Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association 102 (477), pp. 359–378. External Links: Document Cited by: §1, §2.
 Making and evaluating point forecasts. Journal of the American Statistical Association 106 (494), pp. 746–762. Cited by: §2.
 A note on lestimates for linear models. Statistics & Probability Letters 2 (6), pp. 323 – 325. Cited by: §2.1, §2.

Quantile regression for longitudinal data.
Journal of Multivariate Analysis
91 (1), pp. 74 – 89. Cited by: §2.1, §2.1.  Forecaster’s dilemma: extreme events and forecast evaluation. Statist. Sci. 32 (1), pp. 106–127. External Links: Document, Link Cited by: §1.
 An overview of the global historical climatology networkdaily database. Journal of Atmospheric and Oceanic Technology 29 (7), pp. 897–910. Cited by: §4.2.
 Statistical inference using extreme order statistics. Ann. Statist. 3 (1), pp. 119–131. External Links: Document, Link Cited by: §2.2.
 Probabilistic wind speed forecasting on a grid based on ensemble model output statistics. Ann. Appl. Stat. 9 (3), pp. 1328–1349. External Links: Document, Link Cited by: §1.
 Unified host and network data set. ArXiv 1708.07518v1 (), pp. . External Links: ISSN Cited by: §4.1.
 Unified host and network data set. In Data Science for CyberSecurity, pp. 1–22. External Links: Document, Link, https://www.worldscientific.com/doi/pdf/10.1142/9781786345646_001 Cited by: §4.1.
 Improving precipitation forecasts using extreme quantile regression. Extremes. Cited by: §2.1, §3.
 Estimation of high conditional quantiles for heavytailed distributions. Journal of the American Statistical Association 107 (500), pp. 1453–1464. Cited by: §2.1, §2, §3.
 Composite quantile regression and the oracle model selection theory. Ann. Statist. 36 (3), pp. 1108–1126. External Links: Document Cited by: §2.1.
Comments
There are no comments yet.