1 Introduction
Flood frequency analysis (FFA) is a statistical databased approach to determine the magnitude of a flood event with a certain return period. If sufficient data are available at a single site, an extreme value distribution is usually fitted to the observed data (atsite FFA or local model). However, ungauged sites or sites suffering from incomplete data require the use of data from nearby or comparable gauged stations. We refer to this as regional flood frequency analysis (RFFA).
The motivation for this study is the need to update the current guidelines for estimation of the design flood in ungauged catchments in Norway (Castellarin et al., 2012; Midttømme et al., 2011). The current guidelines are based on recommendations that are more than 20 years old (Sælthun et al., 1997)
. With increased data availability (20 more years) and the development of new Bayesian inference methods, including for engineering applications (e.g.
Ball et al. (2016)), there is a significant potential for improvement to the current RFFA methods.The classical approach for RFFA is the indexflood method, see e.g. Dalrymple (1960) and Hosking and Wallis (1997)
. This method assumes that the flood frequency curve for all sites in a region follows the same distribution up to a scaling factor. The index flood method hence consists of three independent steps: (1) identification of homogeneous regions or similar stations, (2) estimation of the index flood, and (3) derivation of the growth curve that gives factors for scaling an index flood to a suitable return level. If no appropriate data are available, the index flood is derived by regional regression analysis or based on nearby measurement stations (scaled with respect to catchment area), and a regional growth curve is applied. An overview of European procedures for RFFA is given in
Castellarin et al. (2012). For step (1), typical approaches are to use fixed geographical regions, station similarity, or focused pooling (region of influence). In Austria, and interpolation approach is used. For step (2), linear regression, possibly combined with transformation of independent and/or dependent variables is used. For step (3), maximum likelihood, ordinary moment estimation or estimation based on lmoments are commonly used.
The regional approach currently used in Norway is based on (1) fixed geographical regions, (2) linear regression on transformed variables, and (3) a unique growth curve for each region. The growth curve of each region is an average of all growth curves within a region based on an lmoment estimator. This approach does not account for parameter uncertainty, neither in the regression equation for the index flood estimation nor in the growth curves. The model uncertainty originates from the model selection in the regression analysis for the index flood model as well as the selection of a parametric distribution for the growth curve, whereas the parameter uncertainty originates from a limited sample size and measurement errors (e.g. Steinbakk et al., 2016).
A Bayesian approach accounts for these uncertainties and easily provides the predictive distribution of design floods (e.g. Fawcett and Walshaw, 2016; Kochanek et al., 2014). The potential for increasing the reliability of design flood estimates by including/improving the knowledge basis from which flood estimates are derived has increased the popularity of Bayesian methods. In addition, such methods are well suited for combining sources of information, such as historical information and expert judgments (e.g. Parent and Bernier, 2002; Reis and Stedinger, 2005). In the most recent update of the guidelines for FFA in Australia, the Bayesian approach is recommended (Ball et al., 2016). In a recent study on RFFA in small catchments in Norway, a Bayesian index flood approach is recommended (Glad et al., 2014).
Bayesian hierarchical models taking into account spatial and temporal structures have been used to describe extreme values in meteorology including extreme precipitation (Cooley et al., 2007; Renard et al., 2013; Dyrrdal et al., 2015), wind and storm surges (Fawcett and Walshaw, 2016). These Bayesian hierarchical models have a data layer described by an extreme value distribution at each site which depends on some unknown parameters, typically location, scale and shape. The spatial dependency is, in most cases, modeled by letting each parameter in the extreme value distribution depend on geographical, meteorological, and other locationspecific characteristics. For some applications, a spatial dependency is based on distance, e.g. the Austrian RFFA procedure described in Castellarin et al. (2012) and the regional approaches for precipitation presented in Renard et al. (2013) and Dyrrdal et al. (2015).
The main objective of this study is to establish a hierarchical Bayesian model that can be used for estimating design floods in ungauged catchments. The following subobjectives were identified:

Estimate the regional model and identify the most important predictors;

Evaluate the predictive performance of the regional model;

Compare the regional model to the existing model of Sælthun et al. (1997).
In order to achieve these aims, we build upon the work of Dyrrdal et al. (2015) and use a hierarchical Bayesian model to spatially describe the parameters of a generalized extreme value (GEV) distribution for annual maximum daily discharge in 203 catchments in Norway. For comparison, we also fit a local extreme value distribution to the discharge series at individual sites, using a threeparameter GEV distribution if more than 50 years of data are available and the simpler two parameter Gumbel distribution if between 20 and 50 years of data are available, following the current guidelines for statistical FFA in Norway (Midttømme et al., 2011).
The remainder of the paper is organized as follows: Section 2 and 3 present the data and the regional GEV model, detailing the Bayesian hierarchical framework and Markov sampling, respectively. Section 4 gives an overview of the regional model currently used in Norway, and Section 5 outlines the setup for model validation. Section 6 presents first the resulting model and discusses model validation in terms of reliability and stability, then explores individual return levels for certain stations, and ends with a comparison of the model to the current regional and local models. The final Section 6 contains some concluding discussion and details of the Bayesian inference algorithm are given in the appendix.
2 Data
The flood data consist of annual maximum floods from 203 streamflow stations of the Norwegian hydrological database “Hydra II” with at least 20 years of quality controlled data for periods with minimal influence from river regulations (see Engeland et al. (2016) for details). For all gauging stations, we extracted a set of catchment properties as listed in Table 1. Climatological temperature and precipitation information are derived from the km daily data product SeNorge available at www.senorge.com. Histograms for record length, catchment areas, lake percentage, mean annual temperature and precipitation, and the contribution of rain to floods is shown in Figure 1. Figure 2 shows the spatial distribution of mean annual precipitation and temperature, mean annual maximum floods and rain contribution.
Explanatory variables  Description 

Longitude  
Latitude  
Effective lake  Percent of effective lake 
Average fraction of rain  Average relative contribution of rain (vs. snowmelt) in the floods 
Catchment area  Total area of catchment, also including parts outside Norway 
Inflow  Average inflow per year 
Average rain in April  Average over the period 1960–1990 
Average rain in August  Average over the period 1960–1990 
Snow melting in March  Average over the period 1960–1990 
Catchment gradient  Difference in height meters between the 20th and 90th percentile of the gradient profile, standardized by the total catchment length 
Exposed bedrock  Percent of mountainous area 
Relative catchment area  Total area divided by catchment length 
The catchment areas range between 50 and 18 110 km with a median of 247 km. The presence of lakes influences flood sizes, and 49.1% of the catchments have more than 1% of the catchment area (effectively) covered by lakes. For these catchments, the median effective lake percentage is 2.73%. The mean annual precipitation ranges from 291 to 3571 mm with a median of 732 mm. We see a strong westeast gradient in the spatial distribution with the highest precipitation on the west coast. The mean annual temperature ranges from 4.6 to 6.0 C with a median of 0.1 C. The temperature is influenced by elevation as well as latitude in that it decreases with both elevation and latitude. The relative contribution of rain is estimated by calculating the ratio of accumulated rain and snowmelt in a time window prior to each flood and then averaging these ratios for all floods, see Engeland et al. (2016) for details. Rainfall gives the major contribution to floods in most coastal catchments, whereas snowmelt is important in inland, northern and high altitude catchments. The typical flood season for catchment dominated by snowmelt floods is spring and early summer while no clear seasonal patterns are seen for the catchments dominated by rain floods.
The flood records and the associated catchment properties (catchment area, record length, mean annual runoff and several other catchment descriptors) are available as supplementary materials.
3 Methods
Extreme value theory provides a framework for modeling the tail of probability distributions. Let
denote continuous, univariate random variables assumed to be independent and identically distributed. If the normalized distribution of the maximum
converges as , then it converges to a generalized extreme value (GEV) distribution (Fisher and Tippett, 1928; Jenkinson, 1955). For this reason, a GEV distribution is commonly used to model block maxima (the maxima over equally sized blocks of data) such as the annual maximum. See Coles (2001) for an introduction to the statistical application of extreme value theory.In Norway, the threeparameter GEV distribution or a special case thereof, the twoparameter Gumbel distribution, are recommended for analyzing long data series from individual stations as these models have been found to provide the best fit for Norwegian data (Midttømme et al., 2011; Castellarin et al., 2012). We have thus chosen to base our regional model on the threeparameter GEV distribution. Alternative models may provide a better fit in other regions. An overview of methods that are used for operational flood frequency analysis in Europe is given in Castellarin et al. (2012).
3.1 Regional GEV model
3.1.1 Model formulation
The data are given by series of annual maximum floods. Denote by the maximum flood in year at station , the set of all stations in the data set, where is the number of annual floods observed at station . We assume the floods to be independent and identically distributed over time with site specific covariates
The three parameter GEV distribution is here parametrized in terms of the location , inverse scale , and shape . The distribution is usually parametrized with the scale parameter , rather than the inverse scale (Coles, 2001), but the current parametrization is common in the Bayesian setting, see e.g. Dyrrdal et al. (2015). The density is given by
(1) 
for with
when . For the density is given by the Gumbel distribution
for with
For the purposes of dam safety, we are interested in estimates of certain high quantiles of the resulting GEV distribution at site . The tail behavior is driven by the value of the shape parameter and generally falls in three classes: The Fréchet type () has a heavy upper tail, the Gumbel type () is characterized by a light upper tail, while the Weibull type () is bounded from above. The shape parameter thus provides vital information on the statistical properties of the variable of interest and is, concurrently, difficult to estimate due to the involved parametric form of the density in (1) as a function of . To estimate the quantile of the resulting GEV distribution function, we employ the GEV quantile function
(2) 
That is, is the return level associated with the return period at site .
3.1.2 Bayesian hierarchical framework
The model in (1) assumes a set of sitespecific parameters at each station . The spatial variability is the result of a number of factors related to the variation in terrain and climate which we aim to capture through the covariates listed in Table 1. Each of the parameters , and is specified by a linear model, e.g. for the location parameter
(3) 
where the first covariate in the vector
is a constant equal to . Here, we assume that for a fixed model in which some of the elements of are assumed to take real values, while others may be restricted to be equal to zero. The constraint implies that the th covariate does not influence the location parameter under the model . In addition to perform inference over the parameter vector , we thus also perform inference over the set of possible models through Bayesian model averaging as described in Dyrrdal et al. (2015, Section 3.3). We can then assess the posterior marginal inclusion probabilities of each covariate.The linear model in (3) assumes that the variability in the GEV parameters across is fully captured by the variability in the covariates, but in practice there may be additional heterogeneity not directly captured by . To account for this, we include a sitespecific random factor for each of the parameters resulting in the model
where the random terms are independent and given by a zero mean Gaussian prior distribution. Our full model can thus be written as
(4)  
with
. We aim to use uninformative priors, with a gamma distribution for the precision
and independent standard normal priors for for .The model in (3.1.2) is a slight simplification of the model discussed by Dyrrdal et al. (2015) as it assumes the random factors for to be independent across the stations . The inference over the parameters can thus be performed as described in Dyrrdal et al. (2015) using an appropriate modification of the associated package SpatGEVBMA (Lenkoski, 2014), available in R (R Core Team, 2016). The model in Dyrrdal et al. (2015) assumes an identity link on the precision parameter
. The extension to a logarithmic link requires the calculation of new proposal distributions for the Markov chain Monte Carlo (MCMC) algorithm. This is described in the appendix.
3.1.3 Posterior return levels
We run a Markov chain to return a collection of samples
(5) 
where is typically in the range of 50 000 to 100 000, with a suitable number of burnin samples removed, i.e. the first 10 000 to 20 000 samples. This yields a Markov sample of the parameter set , including both fixed and random effects. The sample of the corresponding GEV distributions directly yields, by using the GEV quantile function in (2), a sample of quantiles
This sample approximates, , the posterior distribution of the th return level at the site . Here, denotes the vector of maximum floods at site for all years
for which measurements are available. Given this sample, it is straightforward to derive approximations for the posterior mean and median with pointwise posterior credible intervals for the
th return level.The Markov sample also constitutes a mixture distribution based on the GEV density function in (1)
(6) 
approximating the posterior predictive distribution of a future observation
. Such mixture distributions do not have an explicit quantile function and the return level is found by simulating a number of observations from each mixture component, using the empirical quantile of observations pooled over all mixture components as an approximation. Due to the large number of Markov samples, only around 10 to 100 simulated observations from each component are needed to achieve a good approximation of the true return level. The Equation (6) implies BMA, given that the predictive distribution is averaged of different models.Now assume we are interested in estimating the th return level at a new site not used to estimate the model, but for which the covariates listed in Table 1 are available. The th return level of the posterior predictive distribution for site is given by the empirical th quantile found when combining simulated observations from all mixture components, based on the fixed effects , and . The quantile of the predictive distribution will be close, but not identical, to (the approximation of) the median of the posterior distribution of the th return level.
The uncertainty is quantified by the pointwise 80 % posterior credible intervals of the quantiles corresponding to a posterior sample of the three GEV parameters, given as follows

sample ;

set ,
and similar for and . The additional sampling of random effects gives a better (outofsample) calibration. Note that in general there is a higher level of uncertainty for a new site , than a site used to estimate the model, due to the independent sampling step in (i). This additional sampling of the random effects ensures that the uncertainty is not underestimated.
3.2 Local GEV model
We compare the regional model to a local nonhierarchical GEV model for onsite analysis without any spatial structure. In this setting, we assume that the annual maxima are described by Equation (1) with independent parameters, , and , not sharing informative covariates. The unknown parameters are estimated by MCMC within a Bayesian framework such that , and is updated in turn, see Steinbakk et al. (2016) for details on the prior distributions and their updates. For stations with data series less than 50 years, the shape parameter is assumed to be zero and a Gumbel distribution is fitted instead of the GEV distribution according to FFA recommendations for Norway (Midttømme et al., 2011; Castellarin et al., 2012). Figure 3 shows the spatial distribution of the estimated parameters from the local GEV model. The map of the shape parameter reveals that a considerable proportion of sites are estimated using a Gumbel distribution due to short data series.
4 Current regional model for Norway
The Norwegian Water Resources and Energy Directorate (NVE) is responsible for determining guidelines for flood frequency analysis in Norway. The current regional framework for estimating flood return levels were established by Sælthun et al. (1997), see also Castellarin et al. (2012). The return values are found by first estimating the index flood based on geographical and hydrological parameters, and second, using a growth curve to scale up the index flood to a certain design flood. The two steps are determined by different procedures. Both the model for index flood and the growth curve are based on initially dividing sites into distinct flood types (spring, autumn, glacier or all year floods) with geographical regions; four different zones for spring floods and three zones of autumn floods. Each subdivision then has a separate regression model and relevant covariates. An overview of the range of covariates, with descriptions, used in the separate regression models is given in Table 2.
Parametername  Unit 

Catchment area  km 
Mean specific annual runoff  per 
Mean annual precipitation  mm 
Effective lake  % 
Exposed bedrock  % 
Catchment length  km 
Gradient of the main river  m/km 
One difficulty with the current regional model is the lack of a rigorous definition of the flood regions. It can be difficult to determine the appropriate region of a new site, and the estimated return values vary with the choice of region. Hence it will be highly beneficial to model the regional or spatial characteristics of floods as a continuum in a more complex regression model. In addition, the growth curve is only available for certain predetermined return periods.
5 Model validation
To validate the models we follow Renard et al. (2013) assessing reliability, or calibration, and stability. Reliability describes the consistency between validation data (data not used for calibration) and FFA predictions. A reliable, or wellcalibrated, model should yield an estimated distribution close to the unknown true distribution of the data. Stability, on the other hand, quantifies the ability of the model to yield similar estimates when calibration data change.
We assess the predictive power of the regional model through a crossvalidation study, such that reliability is assessed through the consistency between predictions and holdout data. Due to the heavy computational burden of the Markov sampling in the hierarchical model, a smaller number of sites, specifically 27 stations, were selected by experts to represent the range of different sites in Norway. We employ a leaveoneout crossvalidation scheme, where all data for each of the 27 stations, in turn, are left out of the model fitting. The distribution of the random effects gives the main difference between insample and outofsample predictions, as the insample parameter estimates allow for, and usually have, correlated random effects, while the random effects are independently drawn for the outofsample estimates.
5.1 Reliability
The main reliability assessment tool is the probability integral transforms (PITs), displayed graphically by histograms and probabilityprobability (PP) plots. If observations follow the estimated distribution, the PIT will be uniformly distributed
(Dawid, 1984)Histograms of PIT values can be assessed at a local level if the data series are long, whereas histograms with too few observations do not allow for a useful graphical assessment. Instead one should assess regional average reliability by combining PIT values from several or all locations in a single histogram. The reliability of individual stations is assessed by PP plots displaying the observed empirical distribution of PIT values against the theoretical uniform distribution. The values should to the largest degree follow a onetoone line, and deviation will indicate bad reliability such as over or underestimation compared to observed values.
We assess the reliability in the tail of the predictive distribution using proper scoring rules, in particularly the quantile score, see e.g. Friederichs and Hense (2007), Gneiting and Raftery (2007) and references therein. If denoting the predictive distribution by and the realized observation by , the quantile score is given by
for a specific probability level . Alternatively, one could use the weighted continuous ranked probability score integrating over all quantiles greater than some threshold, say the 50year return level (Gneiting and Ranjan, 2011). But as the current regional model used in Norway is only (easily) available for certain predetermined return periods, models are compared using the quantile scores.
While histograms and PP plots of PIT values only assess the reliability of the predictions, scoring rules such as the quantile score simultaneously assess several aspects of the predictive distributions, see e.g. Stephenson et al. (2008) and Bentzien and Friederichs (2014). That is, by conditioning (stratifying) on the predicted probabilities, the scores may be decomposed into the sum of three components: reliability, resolution and uncertainty. The resolution is related to the information content of the prediction model. It describes the variability of the observations under different forecasts and indicates whether the prediction model can discriminate between different outcomes of an observation. In our setting, a prediction model with a good resolution is able to discriminate between locations with different flood characteristics while a prediction model with no resolution would issue the same predictive distribution at all locations. The uncertainty component refers to the variability in the observations and is thus identical for competing models when assessed under the same data set.
5.2 Stability
Within a comparison framework, Renard et al. (2013) advised to first assess reliability, and then use stability to further discriminate between models if several models are equally reliable. The stability quantifies to which degree the statistical model yields similar predictive distributions when trained on different, but identically distributed, data sets. This is a property solely of the statistical model, thus arbitrarily large return periods can be assessed. A general procedure is as follows: First, the statistical model is fitted using all available data, yielding an estimated predictive distribution , followed by estimation on a set of leaveoneout scenarios, yielding estimated predictive distributions . Then each is compared to using some measure of divergence, giving an average or maximum divergence for . We will assess the stability of our model through the variability of parameter estimates, in particular the intercept term, over the leaveoneout crossvalidation scheme.
6 Results
This section shows the results from the aforementioned models to predict annual flood maxima in Norway. We first assess the reliability of the regional model following the crossvalidation study and then present indepth results for certain stations, selected to showcase the range of individual site behavior.
6.1 Bayesian regional model
Relevant covariates for the regional model were first explored by assessing relationships between covariates and the estimated , and from the local model. This revealed that some covariates needed to be transformed, or be combined into variables. To decide on the specific set of covariates, we used stepwise linear regression optimizing AIC (Siotani et al., 1985), with the index (mean level) flood for each station as the response. The resulting best model contained 13 covariates, seen in Table 1, and the selected variables overlap to a large degree with the covariates used by the current regional mode, for details see Table 2. New covariates not considered in the current framework are the average fraction of rain versus snowmelt, the meteorological variables (i.e. rain in April and August and snowmelt in March) and, in particular, longitude and latitude. The regional model was run using 100,000 MCMC iterations and 20,000 burnin samples, with the posterior marginal inclusion probability of each covariate given in Table 3.
Constant  100  100  100 
Longitude  53  99  6 
Latitude  84  100  8 
Percent of effective lake  98  100  2 
Inflow  6  11  9 
Average fraction of rain  3  22  58 
Catchment area  2  5  12 
Average rain in April  42  75  5 
Average rain in August  100  100  5 
Snow melting in March  8  16  4 
Catchment gradient  45  12  2 
Percent of bedrock  22  8  2 
Relative area to length  13  95  11 
It is seen that spatial location of a station, in terms of longitude and latitude, is crucial for all GEV parameters, thereby reflecting the importance of flood regions and the spatial similarity between flood sites. Maps of the median estimate of the fixed effects for , and are shown in Figure 4. There are clear spatial structures with an east to west and north to south gradient in both the location and the scaling parameter, while the shape parameter mainly displays a difference between coast and inland. From Figure 3 one can see that the estimated fixed effects for and agree with the distribution of the parameters from the local model, while the pattern of is not observed in the local model. These results highlight that pooling of information across catchment sites via the linear regression model is highly beneficial.
Further, both the parameters and are explained by the percentage of effective lake, as a dominant lake in the catchment area can dampen both the mean flood level and the yearly variations, and by the average rain in April and August, as more rain will increase both average flood levels and variability. The derived covariate, catchment area divided by the basin length, is seen to significantly affect the scaling , suggesting that “long” catchments (elongated along the length) experience less variability possibly due to a dampening effect. The percent of bedrock and the catchment gradient, on the other hand, influence mainly the location parameter , as a larger degree of mountainous terrain within a catchment and a steeper gradient will increase the average flood level. Lastly, the shape parameter is mainly explained by the average fraction of rain in the flood, and to a smaller degree the catchment area, the relative area to length and the inflow. In areas with a smaller fraction of rain compared to snowmelt, the annual maximum flood is more often a spring flood caused by snowmelt, which will to a larger degree be limited by an upper threshold. This characteristic would be accounted for by a negative shape parameter.
Figure 5 shows the median estimates of random effects for the three parameters, per station, and could reveal whether additional spatial effects need to be accounted for. Overall, the random effects seem to be spatially independent, apart from scattered, but small, clusters present in all three parameters which could be due to river networks or regional catchment areas.
We validate the model following Section 5 and start by comparing histograms of aggregated PIT values for the outofsample and insample regional models, and the local model. Figure 6 shows that the local and insample regional models are well calibrated over the 27 selected stations, while the outofsample model is somewhat overdispersive. The PP plots in Figure 7 show that the regional model will severely over and underestimate the return levels of some outofsample stations, while the predictive distribution of most stations is seen to be well calibrated. We assess the stability of the model through the variability of the fixed effects in each of the three parameters. Figure 8 shows box plots of the mean regression coefficients over the 27 leaveoneout crossvalidation models, for each of the 13 selected covariates and the three parameters. All estimates are seen to be very stable, in particular for and , and only the coefficient estimates of the rain contribution and area for are slightly less stable.
To compare the fit of the different models with the observed data, we explore the results of six selected sites. Figure 9 shows the (outofsample) estimated return level plotted against the return period (on a log scale), for three sites with short (), medium () and long () data series. The figure shows the observed data (black dots), the median estimate of the local model (black lines), predictive distribution (dashed lines) and pointwise 80% credible bands (gray area) of the return level for the regional model and the current regional model (in dotted lines with squares). The displayed stations have been selected to reflect a good fit by the regional model to the observed data, such that the regional predictive distribution gives a 1000 year return level in agreement with the local model. It is seen that for these three selected sites the new regional model greatly improves on the regional model currently used in Norway. In addition, it is also clear that the current regional model stays within the 80 % credible bands of the new regional model.
Figure 10 shows the same outofsample return value plots for three sites with a short (), medium () and long () data series, but selected to reflect a bad agreement between the regional model and observed data. In all three sites the regional model overestimates the return levels compared to the local model, but in all three cases our regional model gives a better fit than the current regional model. Table 4 displays the average quantile scores for the outofsample predictions for the 27 selected stations, comparing the regional model to the local model and the regional model currently used by NVE. For all return periods, , the new regional model performs significantly better than the current regional model, but not as well as the local model.
NVE model  6.71  [6.23, 7.20]  4.04  [3.63, 4.46]  3.17  [2.80, 3.55] 

Regional model  2.94  [2.72, 3.17]  1.01  [0.87, 1.17]  0.62  [0.51, 0.75] 
Local model  1.95  [1.80, 2.11]  0.63  [0.57, 0.70]  0.36  [0.32, 0.40] 
7 Discussion
We have, in accordance with our main objective, developed a regional model for extreme flood estimation to be used when little or no data are available at a catchment site. The model is a Bayesian hierarchical framework with sitespecific GEV parameters based on geographical, hydrological and meteorological covariates, such that information is shared across catchment sites. We have identified 12 important predictors of this type to describe the local variation in the model parameters. By evaluating quantile scores for different return periods, it is seen that the Bayesian regional model gives better predictive performance than the current regional model used by NVE in Norway. Recent work by Yan and Moradkhani (2016) also supports that methods utilizing Bayesian hierarchical models and model averaging are beneficial for analyzing extreme flood data when emphasizing the quantification of extreme flood uncertainty.
While the new regional model significantly improves upon the current NVE model, a reliability analysis at 27 outofsample stations reveals that the model is, overall, somewhat overdispersive. A further analysis of the performance at individual stations indicates that the selected set of covariates may not be able to pick up the locationspecific effects for all locations. Among the selected stations, the regional model tends to more often provide a higher estimates of flood sizes for high return periods than the local model. The Viksvatn station by the Gaular river in Sogn is one example of a catchment site where the regional model highly overestimates the flood distribution compared to the local model which is based on 113 years of observed annual floods, as seen in Figure 10. However, in other unrelated analyzes, we have found that it appears that recorded level of rain at catchment sites in the Sogn area are consistently estimated too high. As the average rain in August is one of the most influential predictors of the overall flood level, any overestimation of this covariate will have an extreme impact on predicted return levels. This highlights the importance of good data quality for achieving reliable return level predictions.
The covariate describing the average contribution of rain and snowmelt in the floods is found to be the best predictor of the shape parameter . However, the exact calculation of this covariate requires some observed flood data. While the covariate may be estimated solely based on available meteorological data products such as the SeNorge data product, it remains to be assessed how much additional uncertainty is introduced.
Acknowledgments
This work was jointly funded by The Research Council of Norway and Energy Norway through grant 235710 (FlomQ). The flood and hydrological data were extracted from the national hydrological database at the Norwegian Water Resources and Energy Directorate and are listed in the supplementary materials. Climatological data, derived from the data product SeNorge, are available at www.senorge.com.The source code for estimating the flood frequency distributions is implemented in the statistical programming language R (http://www.Rproject.org) as a part of the R package SpatGEVBMA
which is available on GitHub at http://github.com/NorskRegnesentral/SpatGEVBMA.
References
 Ball et al. (2016) Ball, J., M. Babister, R. Nathan, W. Weeks, E. Weinmann, M. Retallick, and I. Testoni (2016). Australian rainfall and runoff: A guide to flood estimation. Commonwealth of Australia.
 Bentzien and Friederichs (2014) Bentzien, S. and P. Friederichs (2014). Decomposition and graphical portrayal of the quantile score. Quarterly Journal of the Royal Meteorological Society 140, 1924–1934.
 Castellarin et al. (2012) Castellarin, A., S. Kohnová, L. Gaál, A. Fleig, J. L. Salinas, A. Toumazis, T. R. Kjeldsen, and N. Macdonald (Eds.) (2012). Review of applied statistical methods for flood frequency analysis in Europe. NERC/Centre for Ecology & Hydrology. ISBN:9781906698324.2012.
 Coles (2001) Coles, S. (2001). An introduction to statistical modeling of extreme values. Springer, London.
 Cooley et al. (2007) Cooley, D., D. Nychka, and P. Naveau (2007). Bayesian Spatial Modeling of Extreme Precipitation Return Levels. Journal of the American Statistical Association 102(479), 824–840.
 Dalrymple (1960) Dalrymple, D. (1960). Floodfrequency analysis. Technical Report Supply Paper 1543A, Geological Survey Water. Available at https://pubs.usgs.gov/wsp/1543a/report.pdf.
 Dawid (1984) Dawid, A. P. (1984). Statistical theory: The prequential approach (with discussion and rejoinder). Journal of the Royal Statistical Society Ser. A 147, 278–292.
 Dyrrdal et al. (2015) Dyrrdal, A. V., A. Lenkoski, T. L. Thorarinsdottir, and F. Stordal (2015). Bayesian hierarchical modeling of extreme hourly precipitation in norway. Environmetrics 26, 89–106.
 Engeland et al. (2016) Engeland, K., L. Schlichting, F. Randen, K. S. Nordtun, T. Reitan, T. Wang, E. Holmqvist, A. Voksø, and V. Eiden (2016). Utvalg og kvalitetssikring av flomdata for flomfrekvensanalyser. Technical Report 852016, NVE.
 Fawcett and Walshaw (2016) Fawcett, L. and D. Walshaw (2016). Seasurge and wind speed extremes: optimal estimation strategies for planners and engineers. Stochastic Environmental Research and Risk Assessment 30(2), 463–480.
 Fisher and Tippett (1928) Fisher, R. A. and L. H. C. Tippett (1928). Limiting Forms of the Frequency Distribution of the Largest and Smallest Member of a Sample. Proceedings of the Cambridge Philosophical Society 24, 180–190.
 Friederichs and Hense (2007) Friederichs, P. and A. Hense (2007). Statistical downscaling of extreme precipitation events using censored quantile regression. Monthly weather review 135(6), 2365–2378.
 Glad et al. (2014) Glad, P. A., T. Reitan, and S. Stenius (2014). Regionalt formelverk for flomberegning i små nedbørfelt. Technical Report 622014, NVE, Oslo.
 Gneiting and Raftery (2007) Gneiting, T. and A. E. Raftery (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association 102, 359–378.
 Gneiting and Ranjan (2011) Gneiting, T. and R. Ranjan (2011). Comparing density forecasts using thresholdand quantileweighted scoring rules. Journal of Business & Economic Statistics 29(3), 411–422.
 Hosking and Wallis (1997) Hosking, J. R. M. and J. R. Wallis (1997). Regional frequency analysis. Cambridge University Press, UK.
 Jenkinson (1955) Jenkinson, A. F. (1955). The Frequency Distribution of the Annual Maximum (or Minimum) values of Meteorological elements. Quarterly Journal of the Royal Meteorological Society 81, 158–171.
 Kochanek et al. (2014) Kochanek, K., B. Renard, P. Arnaud, Y. Aubert, M. Lang, T. Cipriani, and E. Sauquet (2014). A databased comparison of flood frequency analysis methods used in france. Natural Hazards and Earth System Sciences 14, 295–308.
 Lenkoski (2014) Lenkoski, A. (2014). SpatGEVBMA: Hierarchical spatial generalized extreme value (GEV) modeling with Bayesian Model Averaging (BMA). R package version 1.0.
 Midttømme et al. (2011) Midttømme, G., L. Pettersson, E. Holmqvist, Ø. Nøtsund, H. Hisdal, and R. Sivertsgård (2011). Guidelines for flood estimation (in Norwegian). Oslo: NVE. Retningslinjer nr. 4/2011.
 Parent and Bernier (2002) Parent, E. and J. Bernier (2002). Bayesian POT modelling for historical data. Journal of Hydrology 274, 95–108.
 R Core Team (2016) R Core Team (2016). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.
 Reis and Stedinger (2005) Reis, D. and J. Stedinger (2005). Bayesian MCMC flood frequency analysis with historical information. Journal of Hydrology 313(1–2), 97–116.
 Renard et al. (2013) Renard, B., K. Kochanek, M. Lang, F. Garavaglia, E. Paquet, L. Neppel, K. Najib, J. Carreau, P. Arnaud, Y. Aubert, F. Borchi, J.M. Soubeyroux, S. Jourdain, J.M. Veysseire, E. Sauquet, T. Cipriani, and A. Auffray (2013). Databased comparison of frequency analysis methods: A general framework. Water Resources Research 49, 825–843.
 Rue and Held (2005) Rue, H. and L. Held (2005). Gaussian Markov random fields: theory and applications. CRC press.
 Sælthun et al. (1997) Sælthun, N., O. Tveito, T. Bønsnes, and L. Roald (1997). Regional flomfrekvensanalyse for norske vassdrag. Technical Report 1497, NVE.
 Siotani et al. (1985) Siotani, T., Y. Fujikoshi, and T. Hayakawa (1985). Modern multivariate statistical analysis, a graduate course and handbook. American Sciences Press.
 Steinbakk et al. (2016) Steinbakk, G. H., T. L. Thorarinsdottir, T. Reitan, L. Schlichting, S. Hølleland, and K. Engeland (2016). Propagation of rating curve uncertainty in design flood estimation. Water Resources Research 52(9), 6897–6915.
 Stephenson et al. (2008) Stephenson, D. B., C. A. S. Coelho, and I. T. Jolliffe (2008). Two extra components in the Brier score decomposition. Weather and Forecasting 23, 752–757.
 Wilson et al. (2010) Wilson, D., H. Hisdal, and D. Lawrence (2010). Has streamflow changed in the Nordic countries? – recent trends and comparisons to hydrological projections. Journal of Hydrology 394, 334–346.
 Yan and Moradkhani (2016) Yan, H. and H. Moradkhani (2016). Toward more robust extreme flood prediction by bayesian hierarchical and multimodeling. Natural Hazards 91, 203–225. DOI:10.1007/s1106901520706.
Appendix A Hierarchical model with a link on the precision
This section discusses an extension to the MCMC algorithm described in Dyrrdal et al. (2015) where the regression equation for the precision parameter is defined with
as the response variable. In general, assume we want to update a parameter
in our model, where is the current value. We draw a new value from a proposal distribution and accept the proposal with probability whereHere, denotes the likelihood of the full data set which depends on and potentially other parameters which are kept fixed throughout, and is the prior distribution of which similarly might depend on the other parts of the model. Given the complexity of the model, it is vital to design efficient proposal distributions which return good proposals and are robust in that they do not require finetuning for each individual data set.
For designing the proposal distribution, we employ a Gaussian approximation (Rue and Held, 2005, Ch. 4.4). Assume that the posterior distribution of the parameter is written on the form
for some function . A quadratic Taylor expansion of the logposterior around the value gives
where and . The posterior distribution may now be approximated by
the density of the Gaussian distribution
. We thus choose as our proposal distribution. This implies that in order to update the model in Dyrrdal et al. (2015) to include a logarithmic link for the precision , we need to calculate the first two derivatives of the log likelihood function with respect to the random effect .a.1 The case
We have , where and . Now, fix and set . We then have
where
It follows that
and
a.2 The case
Using the same notation as above, we have
where
Note that
Thus