Bayesian regional food frequency analysis for large catchments

by   Thordis L. Thorarinsdottir, et al.

Regional flood frequency analysis is commonly applied in situations where there exists insufficient data at a location for a reliable estimation of flood quantiles. We develop a Bayesian hierarchical modeling framework for a regional analysis of data from 203 large catchments in Norway with the generalized extreme value (GEV) distribution as the underlying model. Generalized linear models on the parameters of the GEV distribution are able to incorporate location-specific geographic and meteorological information and thereby accommodate these effects on the flood quantiles. A Bayesian model averaging component additionally assesses model uncertainty in the effect of the proposed covariates. The resulting regional model is seen to give substantially better predictive performance than the regional model currently used in Norway.


page 1

page 2

page 3

page 4


Bayesian regional flood frequency analysis for large catchments

Regional flood frequency analysis is commonly applied in situations wher...

Model Averaging for Generalized Linear Model with Covariates that are Missing completely at Random

In this paper, we consider the estimation of generalized linear models w...

Multiview Hierarchical Agglomerative Clustering for Identification of Development Gap and Regional Potential Sector

The identification of regional development gaps is an effort to see how ...

Assessing the effectiveness of regional physical distancing measures of COVID-19 in rural regions of British Columbia

We study the effects of physical distancing measures for the spread of C...

Evolving Spatially Aggregated Features from Satellite Imagery for Regional Modeling

Satellite imagery and remote sensing provide explanatory variables at re...

Una valutazione di copertura, qualita ed efficienza dei servizi sanitari regionali tra 2010 e 2013

An application of Multiplicative Non-Parametric Corporate Performance Mo...

Robust Clustering with Subpopulation-specific Deviations

The National Birth Defects Prevention Study (NBDPS) was a case-control s...

1 Introduction

Flood frequency analysis (FFA) is a statistical data-based approach to determine the magnitude of a flood event with a certain return period. If sufficient data are available at a single site, an extreme value distribution is usually fitted to the observed data (at-site FFA or local model). However, ungauged sites or sites suffering from incomplete data require the use of data from nearby or comparable gauged stations. We refer to this as regional flood frequency analysis (RFFA).

The motivation for this study is the need to update the current guidelines for estimation of the design flood in ungauged catchments in Norway (Castellarin et al., 2012; Midttømme et al., 2011). The current guidelines are based on recommendations that are more than 20 years old (Sælthun et al., 1997)

. With increased data availability (20 more years) and the development of new Bayesian inference methods, including for engineering applications (e.g.

Ball et al. (2016)), there is a significant potential for improvement to the current RFFA methods.

The classical approach for RFFA is the index-flood method, see e.g. Dalrymple (1960) and Hosking and Wallis (1997)

. This method assumes that the flood frequency curve for all sites in a region follows the same distribution up to a scaling factor. The index flood method hence consists of three independent steps: (1) identification of homogeneous regions or similar stations, (2) estimation of the index flood, and (3) derivation of the growth curve that gives factors for scaling an index flood to a suitable return level. If no appropriate data are available, the index flood is derived by regional regression analysis or based on nearby measurement stations (scaled with respect to catchment area), and a regional growth curve is applied. An overview of European procedures for RFFA is given in

Castellarin et al. (2012)

. For step (1), typical approaches are to use fixed geographical regions, station similarity, or focused pooling (region of influence). In Austria, and interpolation approach is used. For step (2), linear regression, possibly combined with transformation of independent and/or dependent variables is used. For step (3), maximum likelihood, ordinary moment estimation or estimation based on l-moments are commonly used.

The regional approach currently used in Norway is based on (1) fixed geographical regions, (2) linear regression on transformed variables, and (3) a unique growth curve for each region. The growth curve of each region is an average of all growth curves within a region based on an l-moment estimator. This approach does not account for parameter uncertainty, neither in the regression equation for the index flood estimation nor in the growth curves. The model uncertainty originates from the model selection in the regression analysis for the index flood model as well as the selection of a parametric distribution for the growth curve, whereas the parameter uncertainty originates from a limited sample size and measurement errors (e.g. Steinbakk et al., 2016).

A Bayesian approach accounts for these uncertainties and easily provides the predictive distribution of design floods (e.g. Fawcett and Walshaw, 2016; Kochanek et al., 2014). The potential for increasing the reliability of design flood estimates by including/improving the knowledge basis from which flood estimates are derived has increased the popularity of Bayesian methods. In addition, such methods are well suited for combining sources of information, such as historical information and expert judgments (e.g. Parent and Bernier, 2002; Reis and Stedinger, 2005). In the most recent update of the guidelines for FFA in Australia, the Bayesian approach is recommended (Ball et al., 2016). In a recent study on RFFA in small catchments in Norway, a Bayesian index flood approach is recommended (Glad et al., 2014).

Bayesian hierarchical models taking into account spatial and temporal structures have been used to describe extreme values in meteorology including extreme precipitation (Cooley et al., 2007; Renard et al., 2013; Dyrrdal et al., 2015), wind and storm surges (Fawcett and Walshaw, 2016). These Bayesian hierarchical models have a data layer described by an extreme value distribution at each site which depends on some unknown parameters, typically location, scale and shape. The spatial dependency is, in most cases, modeled by letting each parameter in the extreme value distribution depend on geographical, meteorological, and other location-specific characteristics. For some applications, a spatial dependency is based on distance, e.g. the Austrian RFFA procedure described in Castellarin et al. (2012) and the regional approaches for precipitation presented in Renard et al. (2013) and Dyrrdal et al. (2015).

The main objective of this study is to establish a hierarchical Bayesian model that can be used for estimating design floods in ungauged catchments. The following sub-objectives were identified:

  • Estimate the regional model and identify the most important predictors;

  • Evaluate the predictive performance of the regional model;

  • Compare the regional model to the existing model of Sælthun et al. (1997).

In order to achieve these aims, we build upon the work of Dyrrdal et al. (2015) and use a hierarchical Bayesian model to spatially describe the parameters of a generalized extreme value (GEV) distribution for annual maximum daily discharge in 203 catchments in Norway. For comparison, we also fit a local extreme value distribution to the discharge series at individual sites, using a three-parameter GEV distribution if more than 50 years of data are available and the simpler two parameter Gumbel distribution if between 20 and 50 years of data are available, following the current guidelines for statistical FFA in Norway (Midttømme et al., 2011).

The remainder of the paper is organized as follows: Section 2 and 3 present the data and the regional GEV model, detailing the Bayesian hierarchical framework and Markov sampling, respectively. Section 4 gives an overview of the regional model currently used in Norway, and Section 5 outlines the setup for model validation. Section 6 presents first the resulting model and discusses model validation in terms of reliability and stability, then explores individual return levels for certain stations, and ends with a comparison of the model to the current regional and local models. The final Section 6 contains some concluding discussion and details of the Bayesian inference algorithm are given in the appendix.

2 Data

The flood data consist of annual maximum floods from 203 streamflow stations of the Norwegian hydrological database “Hydra II” with at least 20 years of quality controlled data for periods with minimal influence from river regulations (see Engeland et al. (2016) for details). For all gauging stations, we extracted a set of catchment properties as listed in Table 1. Climatological temperature and precipitation information are derived from the km daily data product SeNorge available at Histograms for record length, catchment areas, lake percentage, mean annual temperature and precipitation, and the contribution of rain to floods is shown in Figure 1. Figure 2 shows the spatial distribution of mean annual precipitation and temperature, mean annual maximum floods and rain contribution.

Explanatory variables Description
Effective lake Percent of effective lake
Average fraction of rain Average relative contribution of rain (vs. snowmelt) in the floods
Catchment area Total area of catchment, also including parts outside Norway
Inflow Average inflow per year
Average rain in April Average over the period 1960–1990
Average rain in August Average over the period 1960–1990
Snow melting in March Average over the period 1960–1990
Catchment gradient Difference in height meters between the 20th and 90th percentile of the gradient profile, standardized by the total catchment length
Exposed bedrock Percent of mountainous area
Relative catchment area Total area divided by catchment length
Table 1: Overview of covariate information used in the regional Bayesian model.
Figure 1: Histograms for record length (years), catchment areas (), lake percentage (), mean annual temperature (C), mean annual precipitation (mm) and the estimated rain contribution to floods ().
Figure 2: Spatial distribution of mean annual precipitation (mm), mean annual temperature (C), mean annual maximum floods () and the rain contribution to floods ().

The catchment areas range between 50 and 18 110 km with a median of 247 km. The presence of lakes influences flood sizes, and 49.1% of the catchments have more than 1% of the catchment area (effectively) covered by lakes. For these catchments, the median effective lake percentage is 2.73%. The mean annual precipitation ranges from 291 to 3571 mm with a median of 732 mm. We see a strong west-east gradient in the spatial distribution with the highest precipitation on the west coast. The mean annual temperature ranges from -4.6 to 6.0 C with a median of 0.1 C. The temperature is influenced by elevation as well as latitude in that it decreases with both elevation and latitude. The relative contribution of rain is estimated by calculating the ratio of accumulated rain and snowmelt in a time window prior to each flood and then averaging these ratios for all floods, see Engeland et al. (2016) for details. Rainfall gives the major contribution to floods in most coastal catchments, whereas snowmelt is important in inland, northern and high altitude catchments. The typical flood season for catchment dominated by snowmelt floods is spring and early summer while no clear seasonal patterns are seen for the catchments dominated by rain floods.

The flood records and the associated catchment properties (catchment area, record length, mean annual runoff and several other catchment descriptors) are available as supplementary materials.

3 Methods

Extreme value theory provides a framework for modeling the tail of probability distributions. Let

denote continuous, univariate random variables assumed to be independent and identically distributed. If the normalized distribution of the maximum

converges as , then it converges to a generalized extreme value (GEV) distribution (Fisher and Tippett, 1928; Jenkinson, 1955). For this reason, a GEV distribution is commonly used to model block maxima (the maxima over equally sized blocks of data) such as the annual maximum. See Coles (2001) for an introduction to the statistical application of extreme value theory.

In Norway, the three-parameter GEV distribution or a special case thereof, the two-parameter Gumbel distribution, are recommended for analyzing long data series from individual stations as these models have been found to provide the best fit for Norwegian data (Midttømme et al., 2011; Castellarin et al., 2012). We have thus chosen to base our regional model on the three-parameter GEV distribution. Alternative models may provide a better fit in other regions. An overview of methods that are used for operational flood frequency analysis in Europe is given in Castellarin et al. (2012).

3.1 Regional GEV model

3.1.1 Model formulation

The data are given by series of annual maximum floods. Denote by the maximum flood in year at station , the set of all stations in the data set, where is the number of annual floods observed at station . We assume the floods to be independent and identically distributed over time with site specific covariates

The three parameter GEV distribution is here parametrized in terms of the location , inverse scale , and shape . The distribution is usually parametrized with the scale parameter , rather than the inverse scale (Coles, 2001), but the current parametrization is common in the Bayesian setting, see e.g. Dyrrdal et al. (2015). The density is given by


for with

when . For the density is given by the Gumbel distribution

for with

For the purposes of dam safety, we are interested in estimates of certain high quantiles of the resulting GEV distribution at site . The tail behavior is driven by the value of the shape parameter and generally falls in three classes: The Fréchet type () has a heavy upper tail, the Gumbel type () is characterized by a light upper tail, while the Weibull type () is bounded from above. The shape parameter thus provides vital information on the statistical properties of the variable of interest and is, concurrently, difficult to estimate due to the involved parametric form of the density in (1) as a function of . To estimate the quantile of the resulting GEV distribution function, we employ the GEV quantile function


That is, is the return level associated with the return period at site .

Note that the model formulation in (1) assumes stationarity in time, ignoring e.g. potential effects of climate change. This follows Wilson et al. (2010) who found no systematic trends over time when analyzing annual maximum flood magnitudes in the Nordic countries.

3.1.2 Bayesian hierarchical framework

The model in (1) assumes a set of site-specific parameters at each station . The spatial variability is the result of a number of factors related to the variation in terrain and climate which we aim to capture through the covariates listed in Table 1. Each of the parameters , and is specified by a linear model, e.g. for the location parameter


where the first covariate in the vector

is a constant equal to . Here, we assume that for a fixed model in which some of the elements of are assumed to take real values, while others may be restricted to be equal to zero. The constraint implies that the th covariate does not influence the location parameter under the model . In addition to perform inference over the parameter vector , we thus also perform inference over the set of possible models through Bayesian model averaging as described in Dyrrdal et al. (2015, Section 3.3). We can then assess the posterior marginal inclusion probabilities of each covariate.

The linear model in (3) assumes that the variability in the GEV parameters across is fully captured by the variability in the covariates, but in practice there may be additional heterogeneity not directly captured by . To account for this, we include a site-specific random factor for each of the parameters resulting in the model

where the random terms are independent and given by a zero mean Gaussian prior distribution. Our full model can thus be written as



. We aim to use uninformative priors, with a gamma distribution for the precision

and independent standard normal priors for for .

The model in (3.1.2) is a slight simplification of the model discussed by Dyrrdal et al. (2015) as it assumes the random factors for to be independent across the stations . The inference over the parameters can thus be performed as described in Dyrrdal et al. (2015) using an appropriate modification of the associated package SpatGEVBMA (Lenkoski, 2014), available in R (R Core Team, 2016). The model in Dyrrdal et al. (2015) assumes an identity link on the precision parameter

. The extension to a logarithmic link requires the calculation of new proposal distributions for the Markov chain Monte Carlo (MCMC) algorithm. This is described in the appendix.

3.1.3 Posterior return levels

We run a Markov chain to return a collection of samples


where is typically in the range of 50 000 to 100 000, with a suitable number of burn-in samples removed, i.e. the first 10 000 to 20 000 samples. This yields a Markov sample of the parameter set , including both fixed and random effects. The sample of the corresponding GEV distributions directly yields, by using the GEV quantile function in (2), a sample of quantiles

This sample approximates, , the posterior distribution of the th return level at the site . Here, denotes the vector of maximum floods at site for all years

for which measurements are available. Given this sample, it is straightforward to derive approximations for the posterior mean and median with point-wise posterior credible intervals for the

th return level.

The Markov sample also constitutes a mixture distribution based on the GEV density function in (1)


approximating the posterior predictive distribution of a future observation

. Such mixture distributions do not have an explicit quantile function and the return level is found by simulating a number of observations from each mixture component, using the empirical quantile of observations pooled over all mixture components as an approximation. Due to the large number of Markov samples, only around 10 to 100 simulated observations from each component are needed to achieve a good approximation of the true return level. The Equation (6) implies BMA, given that the predictive distribution is averaged of different models.

Now assume we are interested in estimating the th return level at a new site not used to estimate the model, but for which the covariates listed in Table 1 are available. The th return level of the posterior predictive distribution for site is given by the empirical th quantile found when combining simulated observations from all mixture components, based on the fixed effects , and . The quantile of the predictive distribution will be close, but not identical, to (the approximation of) the median of the posterior distribution of the th return level.

The uncertainty is quantified by the point-wise 80 % posterior credible intervals of the quantiles corresponding to a posterior sample of the three GEV parameters, given as follows

  1. sample ;

  2. set ,

and similar for and . The additional sampling of random effects gives a better (out-of-sample) calibration. Note that in general there is a higher level of uncertainty for a new site , than a site used to estimate the model, due to the independent sampling step in (i). This additional sampling of the random effects ensures that the uncertainty is not underestimated.

3.2 Local GEV model

We compare the regional model to a local non-hierarchical GEV model for on-site analysis without any spatial structure. In this setting, we assume that the annual maxima are described by Equation (1) with independent parameters, , and , not sharing informative covariates. The unknown parameters are estimated by MCMC within a Bayesian framework such that , and is updated in turn, see Steinbakk et al. (2016) for details on the prior distributions and their updates. For stations with data series less than 50 years, the shape parameter is assumed to be zero and a Gumbel distribution is fitted instead of the GEV distribution according to FFA recommendations for Norway (Midttømme et al., 2011; Castellarin et al., 2012). Figure 3 shows the spatial distribution of the estimated parameters from the local GEV model. The map of the shape parameter reveals that a considerable proportion of sites are estimated using a Gumbel distribution due to short data series.

Figure 3: Spatial distribution of the median parameter estimates from the local GEV model.

4 Current regional model for Norway

The Norwegian Water Resources and Energy Directorate (NVE) is responsible for determining guidelines for flood frequency analysis in Norway. The current regional framework for estimating flood return levels were established by Sælthun et al. (1997), see also Castellarin et al. (2012). The return values are found by first estimating the index flood based on geographical and hydrological parameters, and second, using a growth curve to scale up the index flood to a certain design flood. The two steps are determined by different procedures. Both the model for index flood and the growth curve are based on initially dividing sites into distinct flood types (spring, autumn, glacier or all year floods) with geographical regions; four different zones for spring floods and three zones of autumn floods. Each subdivision then has a separate regression model and relevant covariates. An overview of the range of covariates, with descriptions, used in the separate regression models is given in Table 2.

Parameter-name Unit
Catchment area km
Mean specific annual runoff per
Mean annual precipitation mm
Effective lake %
Exposed bedrock %
Catchment length km
Gradient of the main river m/km
Table 2: Overview of key parameters for computing index flood in the current system of NVE

One difficulty with the current regional model is the lack of a rigorous definition of the flood regions. It can be difficult to determine the appropriate region of a new site, and the estimated return values vary with the choice of region. Hence it will be highly beneficial to model the regional or spatial characteristics of floods as a continuum in a more complex regression model. In addition, the growth curve is only available for certain predetermined return periods.

5 Model validation

To validate the models we follow Renard et al. (2013) assessing reliability, or calibration, and stability. Reliability describes the consistency between validation data (data not used for calibration) and FFA predictions. A reliable, or well-calibrated, model should yield an estimated distribution close to the unknown true distribution of the data. Stability, on the other hand, quantifies the ability of the model to yield similar estimates when calibration data change.

We assess the predictive power of the regional model through a cross-validation study, such that reliability is assessed through the consistency between predictions and hold-out data. Due to the heavy computational burden of the Markov sampling in the hierarchical model, a smaller number of sites, specifically 27 stations, were selected by experts to represent the range of different sites in Norway. We employ a leave-one-out cross-validation scheme, where all data for each of the 27 stations, in turn, are left out of the model fitting. The distribution of the random effects gives the main difference between in-sample and out-of-sample predictions, as the in-sample parameter estimates allow for, and usually have, correlated random effects, while the random effects are independently drawn for the out-of-sample estimates.

5.1 Reliability

The main reliability assessment tool is the probability integral transforms (PITs), displayed graphically by histograms and probability-probability (PP) plots. If observations follow the estimated distribution, the PIT will be uniformly distributed

(Dawid, 1984)

Histograms of PIT values can be assessed at a local level if the data series are long, whereas histograms with too few observations do not allow for a useful graphical assessment. Instead one should assess regional average reliability by combining PIT values from several or all locations in a single histogram. The reliability of individual stations is assessed by PP plots displaying the observed empirical distribution of PIT values against the theoretical uniform distribution. The values should to the largest degree follow a one-to-one line, and deviation will indicate bad reliability such as over- or underestimation compared to observed values.

We assess the reliability in the tail of the predictive distribution using proper scoring rules, in particularly the quantile score, see e.g. Friederichs and Hense (2007), Gneiting and Raftery (2007) and references therein. If denoting the predictive distribution by and the realized observation by , the quantile score is given by

for a specific probability level . Alternatively, one could use the weighted continuous ranked probability score integrating over all quantiles greater than some threshold, say the 50-year return level (Gneiting and Ranjan, 2011). But as the current regional model used in Norway is only (easily) available for certain pre-determined return periods, models are compared using the quantile scores.

While histograms and PP plots of PIT values only assess the reliability of the predictions, scoring rules such as the quantile score simultaneously assess several aspects of the predictive distributions, see e.g. Stephenson et al. (2008) and Bentzien and Friederichs (2014). That is, by conditioning (stratifying) on the predicted probabilities, the scores may be decomposed into the sum of three components: reliability, resolution and uncertainty. The resolution is related to the information content of the prediction model. It describes the variability of the observations under different forecasts and indicates whether the prediction model can discriminate between different outcomes of an observation. In our setting, a prediction model with a good resolution is able to discriminate between locations with different flood characteristics while a prediction model with no resolution would issue the same predictive distribution at all locations. The uncertainty component refers to the variability in the observations and is thus identical for competing models when assessed under the same data set.

5.2 Stability

Within a comparison framework, Renard et al. (2013) advised to first assess reliability, and then use stability to further discriminate between models if several models are equally reliable. The stability quantifies to which degree the statistical model yields similar predictive distributions when trained on different, but identically distributed, data sets. This is a property solely of the statistical model, thus arbitrarily large return periods can be assessed. A general procedure is as follows: First, the statistical model is fitted using all available data, yielding an estimated predictive distribution , followed by estimation on a set of leave-one-out scenarios, yielding estimated predictive distributions . Then each is compared to using some measure of divergence, giving an average or maximum divergence for . We will assess the stability of our model through the variability of parameter estimates, in particular the intercept term, over the leave-one-out cross-validation scheme.

6 Results

This section shows the results from the aforementioned models to predict annual flood maxima in Norway. We first assess the reliability of the regional model following the cross-validation study and then present in-depth results for certain stations, selected to showcase the range of individual site behavior.

6.1 Bayesian regional model

Relevant covariates for the regional model were first explored by assessing relationships between covariates and the estimated , and from the local model. This revealed that some covariates needed to be transformed, or be combined into variables. To decide on the specific set of covariates, we used stepwise linear regression optimizing AIC (Siotani et al., 1985), with the index (mean level) flood for each station as the response. The resulting best model contained 13 covariates, seen in Table 1, and the selected variables overlap to a large degree with the covariates used by the current regional mode, for details see Table 2. New covariates not considered in the current framework are the average fraction of rain versus snowmelt, the meteorological variables (i.e. rain in April and August and snowmelt in March) and, in particular, longitude and latitude. The regional model was run using 100,000 MCMC iterations and 20,000 burn-in samples, with the posterior marginal inclusion probability of each covariate given in Table 3.

Constant 100 100 100
Longitude 53 99 6
Latitude 84 100 8
Percent of effective lake 98 100 2
Inflow 6 11 9
Average fraction of rain 3 22 58
Catchment area 2 5 12
Average rain in April 42 75 5
Average rain in August 100 100 5
Snow melting in March 8 16 4
Catchment gradient 45 12 2
Percent of bedrock 22 8 2
Relative area to length 13 95 11
Table 3: Inclusion probability (%) for the covariates for models for location parameter , scale parameter and shape parameter .

It is seen that spatial location of a station, in terms of longitude and latitude, is crucial for all GEV parameters, thereby reflecting the importance of flood regions and the spatial similarity between flood sites. Maps of the median estimate of the fixed effects for , and are shown in Figure 4. There are clear spatial structures with an east to west and north to south gradient in both the location and the scaling parameter, while the shape parameter mainly displays a difference between coast and inland. From Figure 3 one can see that the estimated fixed effects for and agree with the distribution of the parameters from the local model, while the pattern of is not observed in the local model. These results highlight that pooling of information across catchment sites via the linear regression model is highly beneficial.

Figure 4: Spatial distribution of the median fixed effects per site, for 203 stations, for the three GEV parameters, and .

Further, both the parameters and are explained by the percentage of effective lake, as a dominant lake in the catchment area can dampen both the mean flood level and the yearly variations, and by the average rain in April and August, as more rain will increase both average flood levels and variability. The derived covariate, catchment area divided by the basin length, is seen to significantly affect the scaling , suggesting that “long” catchments (elongated along the length) experience less variability possibly due to a dampening effect. The percent of bedrock and the catchment gradient, on the other hand, influence mainly the location parameter , as a larger degree of mountainous terrain within a catchment and a steeper gradient will increase the average flood level. Lastly, the shape parameter is mainly explained by the average fraction of rain in the flood, and to a smaller degree the catchment area, the relative area to length and the inflow. In areas with a smaller fraction of rain compared to snowmelt, the annual maximum flood is more often a spring flood caused by snowmelt, which will to a larger degree be limited by an upper threshold. This characteristic would be accounted for by a negative shape parameter.

Figure 5 shows the median estimates of random effects for the three parameters, per station, and could reveal whether additional spatial effects need to be accounted for. Overall, the random effects seem to be spatially independent, apart from scattered, but small, clusters present in all three parameters which could be due to river networks or regional catchment areas.

Figure 5: Spatial distribution of the median random effects per site, for 203 stations, for the three GEV parameters, and .

We validate the model following Section 5 and start by comparing histograms of aggregated PIT values for the out-of-sample and in-sample regional models, and the local model. Figure 6 shows that the local and in-sample regional models are well calibrated over the 27 selected stations, while the out-of-sample model is somewhat overdispersive. The PP plots in Figure 7 show that the regional model will severely over- and underestimate the return levels of some out-of-sample stations, while the predictive distribution of most stations is seen to be well calibrated. We assess the stability of the model through the variability of the fixed effects in each of the three parameters. Figure 8 shows box plots of the mean regression coefficients over the 27 leave-one-out cross-validation models, for each of the 13 selected covariates and the three parameters. All estimates are seen to be very stable, in particular for and , and only the coefficient estimates of the rain contribution and area for are slightly less stable.

Figure 6: Histograms of PIT values for all observations from the 27 cross-validation stations, a total of 1305 observations. The regional out-of-sample model shows an excess of PIT values away to 0 and 1.
Figure 7: Probability-probability plots of the PIT values for the 27 cross-validation stations. For the regional out-of-sample model some stations are highly overestimated, while others are underestimated.
Figure 8: Assessment of stability: Box plots displaying the variability of the posterior mean regression coefficients over the 27 cross-validation models, shown for each of the 13 covariates behind and .

To compare the fit of the different models with the observed data, we explore the results of six selected sites. Figure 9 shows the (out-of-sample) estimated return level plotted against the return period (on a log scale), for three sites with short (), medium () and long () data series. The figure shows the observed data (black dots), the median estimate of the local model (black lines), predictive distribution (dashed lines) and pointwise 80% credible bands (gray area) of the return level for the regional model and the current regional model (in dotted lines with squares). The displayed stations have been selected to reflect a good fit by the regional model to the observed data, such that the regional predictive distribution gives a 1000 year return level in agreement with the local model. It is seen that for these three selected sites the new regional model greatly improves on the regional model currently used in Norway. In addition, it is also clear that the current regional model stays within the 80 % credible bands of the new regional model.

Figure 9: Estimated return levels for three stations (32, 56 and 93 years of observations, respectively) showing good agreement between the regional and local model. Black lines: Local Bayesian model. Dashed lines: Posterior predictive distribution. Gray area: 80% credibility interval for the posterior quantile distribution. Dotted line with squares: Standard regional model. Black dots: Data.

Figure 10 shows the same out-of-sample return value plots for three sites with a short (), medium () and long () data series, but selected to reflect a bad agreement between the regional model and observed data. In all three sites the regional model overestimates the return levels compared to the local model, but in all three cases our regional model gives a better fit than the current regional model. Table 4 displays the average quantile scores for the out-of-sample predictions for the 27 selected stations, comparing the regional model to the local model and the regional model currently used by NVE. For all return periods, , the new regional model performs significantly better than the current regional model, but not as well as the local model.

Figure 10: Estimated return levels for three stations (32, 56 and 93 years of observations, respectively) with a bad agreement between the regional and local model. Black lines: Local Bayesian model. Dashed lines: Posterior predictive distribution. Gray area: 80% credibility interval for the posterior quantile distribution. Dotted line with squares: Standard regional model. Black dots: Data.
NVE model 6.71 [6.23,   7.20] 4.04 [3.63,   4.46] 3.17 [2.80,   3.55]
Regional model 2.94 [2.72,   3.17] 1.01 [0.87,   1.17] 0.62 [0.51,   0.75]
Local model 1.95 [1.80,   2.11] 0.63 [0.57,   0.70] 0.36 [0.32,   0.40]
Table 4: Quantile scores for return periods , 50 and 100 comparing the regional model with the local and the regional model currently in use at NVE (Sælthun et al., 1997). The 90% uncertainty intervals are obtained by bootstrapping.

7 Discussion

We have, in accordance with our main objective, developed a regional model for extreme flood estimation to be used when little or no data are available at a catchment site. The model is a Bayesian hierarchical framework with site-specific GEV parameters based on geographical, hydrological and meteorological covariates, such that information is shared across catchment sites. We have identified 12 important predictors of this type to describe the local variation in the model parameters. By evaluating quantile scores for different return periods, it is seen that the Bayesian regional model gives better predictive performance than the current regional model used by NVE in Norway. Recent work by Yan and Moradkhani (2016) also supports that methods utilizing Bayesian hierarchical models and model averaging are beneficial for analyzing extreme flood data when emphasizing the quantification of extreme flood uncertainty.

While the new regional model significantly improves upon the current NVE model, a reliability analysis at 27 out-of-sample stations reveals that the model is, overall, somewhat overdispersive. A further analysis of the performance at individual stations indicates that the selected set of covariates may not be able to pick up the location-specific effects for all locations. Among the selected stations, the regional model tends to more often provide a higher estimates of flood sizes for high return periods than the local model. The Viksvatn station by the Gaular river in Sogn is one example of a catchment site where the regional model highly overestimates the flood distribution compared to the local model which is based on 113 years of observed annual floods, as seen in Figure 10. However, in other unrelated analyzes, we have found that it appears that recorded level of rain at catchment sites in the Sogn area are consistently estimated too high. As the average rain in August is one of the most influential predictors of the overall flood level, any overestimation of this covariate will have an extreme impact on predicted return levels. This highlights the importance of good data quality for achieving reliable return level predictions.

The covariate describing the average contribution of rain and snowmelt in the floods is found to be the best predictor of the shape parameter . However, the exact calculation of this covariate requires some observed flood data. While the covariate may be estimated solely based on available meteorological data products such as the SeNorge data product, it remains to be assessed how much additional uncertainty is introduced.


This work was jointly funded by The Research Council of Norway and Energy Norway through grant 235710 (FlomQ). The flood and hydrological data were extracted from the national hydrological database at the Norwegian Water Resources and Energy Directorate and are listed in the supplementary materials. Climatological data, derived from the data product SeNorge, are available at source code for estimating the flood frequency distributions is implemented in the statistical programming language R ( as a part of the R package SpatGEVBMA which is available on GitHub at


  • Ball et al. (2016) Ball, J., M. Babister, R. Nathan, W. Weeks, E. Weinmann, M. Retallick, and I. Testoni (2016). Australian rainfall and runoff: A guide to flood estimation. Commonwealth of Australia.
  • Bentzien and Friederichs (2014) Bentzien, S. and P. Friederichs (2014). Decomposition and graphical portrayal of the quantile score. Quarterly Journal of the Royal Meteorological Society 140, 1924–1934.
  • Castellarin et al. (2012) Castellarin, A., S. Kohnová, L. Gaál, A. Fleig, J. L. Salinas, A. Toumazis, T. R. Kjeldsen, and N. Macdonald (Eds.) (2012). Review of applied statistical methods for flood frequency analysis in Europe. NERC/Centre for Ecology & Hydrology. ISBN:978-1-906698-32-4.2012.
  • Coles (2001) Coles, S. (2001). An introduction to statistical modeling of extreme values. Springer, London.
  • Cooley et al. (2007) Cooley, D., D. Nychka, and P. Naveau (2007). Bayesian Spatial Modeling of Extreme Precipitation Return Levels. Journal of the American Statistical Association 102(479), 824–840.
  • Dalrymple (1960) Dalrymple, D. (1960). Flood-frequency analysis. Technical Report Supply Paper 1543-A, Geological Survey Water. Available at
  • Dawid (1984) Dawid, A. P. (1984). Statistical theory: The prequential approach (with discussion and rejoinder). Journal of the Royal Statistical Society Ser. A 147, 278–292.
  • Dyrrdal et al. (2015) Dyrrdal, A. V., A. Lenkoski, T. L. Thorarinsdottir, and F. Stordal (2015). Bayesian hierarchical modeling of extreme hourly precipitation in norway. Environmetrics 26, 89–106.
  • Engeland et al. (2016) Engeland, K., L. Schlichting, F. Randen, K. S. Nordtun, T. Reitan, T. Wang, E. Holmqvist, A. Voksø, and V. Eiden (2016). Utvalg og kvalitetssikring av flomdata for flomfrekvensanalyser. Technical Report 85-2016, NVE.
  • Fawcett and Walshaw (2016) Fawcett, L. and D. Walshaw (2016). Sea-surge and wind speed extremes: optimal estimation strategies for planners and engineers. Stochastic Environmental Research and Risk Assessment 30(2), 463–480.
  • Fisher and Tippett (1928) Fisher, R. A. and L. H. C. Tippett (1928). Limiting Forms of the Frequency Distribution of the Largest and Smallest Member of a Sample. Proceedings of the Cambridge Philosophical Society 24, 180–190.
  • Friederichs and Hense (2007) Friederichs, P. and A. Hense (2007). Statistical downscaling of extreme precipitation events using censored quantile regression. Monthly weather review 135(6), 2365–2378.
  • Glad et al. (2014) Glad, P. A., T. Reitan, and S. Stenius (2014). Regionalt formelverk for flomberegning i små nedbørfelt. Technical Report 62-2014, NVE, Oslo.
  • Gneiting and Raftery (2007) Gneiting, T. and A. E. Raftery (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association 102, 359–378.
  • Gneiting and Ranjan (2011) Gneiting, T. and R. Ranjan (2011). Comparing density forecasts using threshold-and quantile-weighted scoring rules. Journal of Business & Economic Statistics 29(3), 411–422.
  • Hosking and Wallis (1997) Hosking, J. R. M. and J. R. Wallis (1997). Regional frequency analysis. Cambridge University Press, UK.
  • Jenkinson (1955) Jenkinson, A. F. (1955). The Frequency Distribution of the Annual Maximum (or Minimum) values of Meteorological elements. Quarterly Journal of the Royal Meteorological Society 81, 158–171.
  • Kochanek et al. (2014) Kochanek, K., B. Renard, P. Arnaud, Y. Aubert, M. Lang, T. Cipriani, and E. Sauquet (2014). A data-based comparison of flood frequency analysis methods used in france. Natural Hazards and Earth System Sciences 14, 295–308.
  • Lenkoski (2014) Lenkoski, A. (2014). SpatGEVBMA: Hierarchical spatial generalized extreme value (GEV) modeling with Bayesian Model Averaging (BMA). R package version 1.0.
  • Midttømme et al. (2011) Midttømme, G., L. Pettersson, E. Holmqvist, Ø. Nøtsund, H. Hisdal, and R. Sivertsgård (2011). Guidelines for flood estimation (in Norwegian). Oslo: NVE. Retningslinjer nr. 4/2011.
  • Parent and Bernier (2002) Parent, E. and J. Bernier (2002). Bayesian POT modelling for historical data. Journal of Hydrology 274, 95–108.
  • R Core Team (2016) R Core Team (2016). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.
  • Reis and Stedinger (2005) Reis, D. and J. Stedinger (2005). Bayesian MCMC flood frequency analysis with historical information. Journal of Hydrology 313(1–2), 97–116.
  • Renard et al. (2013) Renard, B., K. Kochanek, M. Lang, F. Garavaglia, E. Paquet, L. Neppel, K. Najib, J. Carreau, P. Arnaud, Y. Aubert, F. Borchi, J.-M. Soubeyroux, S. Jourdain, J.-M. Veysseire, E. Sauquet, T. Cipriani, and A. Auffray (2013). Data-based comparison of frequency analysis methods: A general framework. Water Resources Research 49, 825–843.
  • Rue and Held (2005) Rue, H. and L. Held (2005). Gaussian Markov random fields: theory and applications. CRC press.
  • Sælthun et al. (1997) Sælthun, N., O. Tveito, T. Bønsnes, and L. Roald (1997). Regional flomfrekvensanalyse for norske vassdrag. Technical Report 14-97, NVE.
  • Siotani et al. (1985) Siotani, T., Y. Fujikoshi, and T. Hayakawa (1985). Modern multivariate statistical analysis, a graduate course and handbook. American Sciences Press.
  • Steinbakk et al. (2016) Steinbakk, G. H., T. L. Thorarinsdottir, T. Reitan, L. Schlichting, S. Hølleland, and K. Engeland (2016). Propagation of rating curve uncertainty in design flood estimation. Water Resources Research 52(9), 6897–6915.
  • Stephenson et al. (2008) Stephenson, D. B., C. A. S. Coelho, and I. T. Jolliffe (2008). Two extra components in the Brier score decomposition. Weather and Forecasting 23, 752–757.
  • Wilson et al. (2010) Wilson, D., H. Hisdal, and D. Lawrence (2010). Has streamflow changed in the Nordic countries? – recent trends and comparisons to hydrological projections. Journal of Hydrology 394, 334–346.
  • Yan and Moradkhani (2016) Yan, H. and H. Moradkhani (2016). Toward more robust extreme flood prediction by bayesian hierarchical and multimodeling. Natural Hazards 91, 203–225. DOI:10.1007/s11069-015-2070-6.

Appendix A Hierarchical model with a link on the precision

This section discusses an extension to the MCMC algorithm described in Dyrrdal et al. (2015) where the regression equation for the precision parameter is defined with

as the response variable. In general, assume we want to update a parameter

in our model, where is the current value. We draw a new value from a proposal distribution and accept the proposal with probability where

Here, denotes the likelihood of the full data set which depends on and potentially other parameters which are kept fixed throughout, and is the prior distribution of which similarly might depend on the other parts of the model. Given the complexity of the model, it is vital to design efficient proposal distributions which return good proposals and are robust in that they do not require fine-tuning for each individual data set.

For designing the proposal distribution, we employ a Gaussian approximation (Rue and Held, 2005, Ch. 4.4). Assume that the posterior distribution of the parameter is written on the form

for some function . A quadratic Taylor expansion of the log-posterior around the value gives

where and . The posterior distribution may now be approximated by

the density of the Gaussian distribution

. We thus choose as our proposal distribution. This implies that in order to update the model in Dyrrdal et al. (2015) to include a logarithmic link for the precision , we need to calculate the first two derivatives of the log likelihood function with respect to the random effect .

a.1 The case

We have , where and . Now, fix and set . We then have


It follows that


a.2 The case

Using the same notation as above, we have


Note that