1 Introduction
In this century, extreme temperatures impacted 97 million people causing over 40 billion Euro economic damage (EMDAT, 2019). In particular, heat stress has detrimental effect on human health and human activities. In response, dedicated warning systems are developed (Morabito et al., 2019; McGregor et al., 2015) requiring meteorological forecasts from which relevant indicators expressing the impact of heat stress on humans are developed (Di Napoli et al., 2019). A large number of different indices of various complexity exists and even a comparison of 165 different ones demonstrated that there is no universal acceptance of a single index (de Freitas and Grigorieva, 2017). In this paper, the focus is on two indices which can be easily derived from standard outputs of weather forecast models and are commonly used: the discomfort index (DI) and the indoor version of the wetbulb globe temperature (WBGTid). Both are function of temperature and dew point temperature, but have been chosen as they differ in their formulation and complexity which may have an impact on the final findings.
In our study, forecasts of DI and WBGTid are derived from the ensemble prediction system run at the European Centre for MediumRange Weather Forecasts (ECMWF). Ensemble predictions provide different scenarios of the future recognising the uncertainties inherit in weather forecasting (Leutbecher and Palmer, 2008). The derived probabilistic forecasts are an essential tool to support forecastbased decision making (Fundel et al., 2019). It has been already established that heat indices forecasts based on ensembles are skillful with respect to climatology up to 10 days (Pappenberger et al., 2015). However, the value of these forecast can be increased by correcting the forecasts through so called statistical postprocessing taking account of inefficiencies in representing the aforementioned uncertainties and systematic biases. This also adjusts for the fact that the forecasts are produced representing an average over a spatial area (a grid cell) whilst being used and applied locally on individual locations.
Statistical postprocessing methods include parametric models such as Bayesian model averaging
(BMA; Raftery et al., 2005) and ensemble model output statistics (EMOS; Gneiting et al., 2005), providing full predictive distributions of the weather variables at hand. We focus on the EMOS approach, which specifies the predictive distribution by a single parametric law with parameters connected to the ensemble members via appropriate link functions. Postprocessing methods are often applied to individual weather quantities (Wilks, 2018) although the dependence structure between different weather variables is of crucial importance in the calculation of heat stress indicators. This is acknowledged and handled if bivariate EMOS models (Schuhen et al., 2012; Baran and Möller, 2017) or other approaches to joint calibrations such as the ensemble copula coupling (ECC) method of Schefzik et al. (2013) are deployed. More advanced methods (see e.g. Ben Bouallègue et al., 2016; Barnes et al., 2019) exist, but are not investigated here.This paper explores the fundamental question whether it is more efficient to base the calibration on the two input forecasts (temperature and dew point temperature) or rather to calibrate directly the endproduct (the heat index). We apply these two different methodological approaches to the calibration of DI and WBGTid ensemble forecasts. On the one hand, we first calibrate ensemble forecast vectors of temperature and dew point temperature, either with a bivariate EMOS model or with ECC combined with independent univariate EMOS models. The calibrated DI and WBGTid forecasts are derived from these calibrated input forecasts. On the other hand, we develop a new EMOS model for postprocessing both DI and WBGTid ensemble forecasts, where the predictive PDFs follow a generalized extreme value (GEV) distribution. The benchmark method consists in a simple adjustment of the weather quantities to account for the scale mismatch between model output and point observations. A general overview of the applied procedures is given in Figure
1. The different approaches to postprocessing are ranked in terms of performance; their implementation and limitations are also discussed.Finally, we envisage the situation where one is interested only in the prediction of the warning level category of a heat index. In this case, statistical calibration can be considered as a classification problem resulting in forecast probabilities for the different categories. Besides obtaining the predictive distributions using the abovementioned calibration methods, we also test the forecast skill of a general machine learning based approach.
The paper is organized as follows. Section 2 provides the fundamental formulae for the calculation of DI and WBGTid, followed by a description of the temperature and dew point temperature data sets. The applied calibration methods (including the novel GEV EMOS approach) are given in Section 3, and the verification procedure is discussed in Section 4. Section 5 presents the results of our study devoting separate sections to bivariate and univariate postprocessing, and to calibration of discrete forecasts. Some figures are placed in the Appendix in order to improve readability. Finally, lessons learned and recommendations can be found in Section 6.
2 Data
2.1 Heat index definitions
We start with the formal definition of the two heat indices of interest, which are both functions of two weather quantities: temperature and dew point (DP) temperature. Discomfort index is calculated from temperature (C) and relative humidity (RH) () as
(2.1) 
RH can easily be expressed using temperature and DP temperature (C) using Magnus formula with constants suggested by Sonntag (1990)
(2.2) 
Indoor wetbulb globe temperature is calculated as
(2.3) 
where denotes the psychrometric wetbulb (PWB) temperature (C). One can obtain this quantity using Bernard’s formula (Bernard and Pourmoghani, 1999) providing it as a solution of equation
where
are the saturation vapour pressures (hPa) at DP temperature and PWB temperature , respectively (Lemke and Kjellstrom, 2012). It results that WBGTid is also a function of temperature and dew point and can be computed using e.g. the HeatStress package of R (Casanueva, 2019).
2.2 Observations
) and the PDFs of generalized extreme value, skew normal and split normal distributions with matching means, variances and skewnesses.
Temperature and dew point observations are available at 1459 surface synoptic observation (SYNOP) stations in the midlatitudes of Europe (latitudes: 35N – 65N; longitudes: 12.5W – 42.5E). The corresponding DI is derived by applying (2.1) and (2.2), whereas WBGTid is obtained from (2.3), where PWB is determined with the help of the HeatStress package. The selected data set covers a period between 1 May and 30 September 2017.
Both DI and WBGTid observations are negatively skewed (skewnesses are and , respectively). This should be taken into account in the choice of the parametric distribution family for EMOS postprocessing. The leftskewed character of both indices is visible in Figure 2 showing the corresponding climatological histograms together with the PDFs of generalized extreme value (GEV), skew normal and split normal distributions (with matching means, variances and skewnesses). These three parametric distributions are potential candidates for EMOS modeling (see Section 3.3).
2.3 Ensemble forecasts
The operational ensemble run at ECMWF comprises 50 perturbed members. We focus on forecasts initialized at 12 UTC and we consider 1  15 day lead times. For each observation, we associate the forecast at the nearest grid point. In order to account for the difference between model and station elevations, orographic correction is applied to both raw temperature and dew point forecasts. Practically, we add (C) to the forecasts, with the altitude difference between station and model representation. Temperature and dew point forecasts are used to derive the corresponding raw ensemble forecasts of DI and WBGTid.
Raw forecasts provide information on a grid (here the modelgrid resolution is km), whereas observations are point measurements. This scale mismatch leads to representativeness error in the forecast. The raw forecast uncertainty, valid at the modelresolution scale, can be increased in order to capture the temperature variability at smaller spatial scale. The method followed here is a simple downscaling step, inspired by the pioneer work of Saetra et al. (2004)
: it consists in adding to each member a draw from a centered Gaussian distribution with standard deviation
(2.4) 
This formula is derived from the analysis of 2 m temperature measurements of a highdensity observation network (Ben Bouallègue, 2020). The spread adjustment is applied independently to raw temperature and dew point ensemble forecasts, followed by a reordering of the adjusted forecasts using the rank of the raw forecasts before the calculation of DI and WBGTid forecasts (see Figure 1). These corrected forecasts are referred to as the adjusted ensemble and are used as a reference for the computation of skill scores (see Section 4.2).
3 Calibration methods
Our main approach to calibration is the computationally efficient EMOS method that has demonstrated excellent predictive performance for a wide range of weather quantities. In what follows, we denote the ensemble forecast of a given weather quantity for a given location, time and lead time, and the ensemble mean. and denote the ensemble variance and the ensemble mean absolute difference, respectively, defined as
Multivariate ensemble forecasts are distinguished by bold notation , and, in this case, the ensemble variance is replaced by the ensemble covariance matrix
The 50 members of the operational ECMWF ensemble system are considered as statistically indistinguishable and, in this sense, exchangeable. In the following, if we have ensemble members divided into exchangeable groups, where the th group contains ensemble members (), notation () will be used for the corresponding group mean (vector).
3.1 Normal EMOS model
Calibration of temperature and dew point ensemble forecasts is based on the EMOS model suggested by Gneiting et al. (2005). The predictive distribution is a Gaussian distribution with mean and variance linked to the ensemble members via equations
(3.1) 
where . When an ensemble contains groups of statistically indistinguishable ensemble members, members within a given group should share the same parameters (Gneiting, 2014; Wilks, 2018) and the equation for the mean in (3.1) is replaced by
(3.2) 
3.2 Bivariate normal EMOS model
The separate univariate EMOS calibration of temperature and dew point ensemble forecasts does not take into account the obvious dependence between these two weather quantities. In order to overcome this limitation, one can consider the joint postprocessing of these two variables using an EMOS model with a bivariate normal predictive distribution , where the mean vector and covariance matrix are expressed as
(3.3) 
respectively. Here , whereas and are twobytwo real parameter matrices, where is lower triangular. This way, is symmetric and nonnegative definite. In the case of existence of groups of exchangeable ensemble members, the bivariate generalization of (3.2) is
(3.4) 
Note that a bivariate EMOS model with a different parametrization of the covariance matrix has already been successfully applied for the calibration of wind vector ensemble forecasts (Schuhen et al., 2012).
3.3 Generalized extreme value EMOS model
In contrast to temperature and dew point (where the symmetric Gaussian distribution provides a reasonable fit), both DI and WBGTid observation distributions result in leftskewed histograms, calling for predictive PDFs with negative skewness. Potential candidates are the following: a skew normal, a split normal or a GEV distribution (see Figure 2). For these three competing laws, the GEV seems to best capture the tail behaviour of the observations (at least for the DI and WBGTid observations at hand).
A GEV distribution with location , scale and shape
can be characterized by its cumulative distribution function (CDF)
(3.5) 
if and zero otherwise, which for has a finite mean. For calibrating ensemble forecasts of DI and WBGTid, we suggest modeling location and scale parameters by
(3.6) 
with , whereas the shape parameter is considered to be independent of the ensemble members. Similar to (3.2) and (3.4), the exchangeable version of (3.6) considers the location as an affine function of the group means:
(3.7) 
The expression of the mean (or location) as an affine function of the ensemble is a general feature in EMOS postprocessing regardless of the weather quantity at hand (see e.g. Gneiting et al., 2005; Thorarinsdottir and Gneiting, 2010; Baran and Nemoda, 2016), whereas the dependence of the scale on the ensemble mean difference is similar to the censored GEV model of Scheuerer (2014) for calibrating precipitation ensemble forecasts. Exploratory tests with the data set at hand show that the proposed model significantly outperforms the GEV EMOS model with scale depending on the ensemble mean, that is when (see e.g. Lerch and Thorarinsdottir, 2013), and results in slightly (and not always significantly) better predictive performance than its natural modifications involving the ensemble variance , such as
3.4 Parameter estimation
Parameters of the normal and GEV EMOS models described in Sections 3.1 – 3.3
are estimated according to the optimum score estimation principle of
Gneiting and Raftery (2007). The estimates of the unknown parameters are obtained by optimizing the mean value of a proper scoring rule over an appropriate training dataset. The most popular proper scoring rules are the logarithmic (or ignorance) score (LogS; Good, 1952), that is the negative logarithm of the predictive PDF at the verifying observation and the continuous ranked probability score (CRPS; Gneiting and Raftery, 2007; Wilks, 2011). The CRPS corresponding to a (predictive) CDF and a real value (observation) is defined as(3.8) 
where denotes the indicator of a set , whereas and
are independent random variables with CDF
and finite first moment. The last representation in (
3.8) implies that the CRPS can be expressed in the same unit as the observation. Both CRPS and LogS are negatively oriented scoring rules (the smaller the better), and the optimization with respect to LogS gives back the maximum likelihood (ML) estimation of the parameters. Furthermore, the CRPS for both normal and GEV distributions can be expressed in closed form (for the exact formulae see Thorarinsdottir and Gneiting (2010) and Friedrichs and Thorarinsdottir (2012), respectively), thereby allowing for efficient parameter estimation procedures often being more robust than the ML approach.The logarithmic score is obviously well defined also in the case of multivariate predictive distributions. Further, the direct multivariate extension of the CRPS is the energy score (ES) introduced by Gneiting and Raftery (2007). Given a CDF on and a dimensional vector , the energy score is defined as
(3.9) 
where denotes the Euclidean distance and, similar to the univariate case, and are independent random vectors having distribution . In most cases (including the case of the Gaussian model), ES cannot be given in a closed form, so it should be replaced by a Monte Carlo approximation based on a large sample drawn from (Gneiting et al., 2008). In our study, we apply ML estimation for bivariate models, whereas parameters of all univariate EMOS models are obtained by optimizing the CRPS. In the latter case, tests using ML estimation have not demonstrated significantly better results.
The appropriate choice of training data is also an important issue in statistical postprocessing. In EMOS modelling, the standard approach is the use of rolling training periods where the parameter estimates are obtained using ensemble forecasts and corresponding validating observations for the preceding calendar days. Once the appropriate training period length is chosen, there are two general approaches to spatial selection of training data (Thorarinsdottir and Gneiting, 2010). In the local approach, model parameters are estimated using only training data of the given station, resulting in distinct parameter estimates for the different stations. The global (regional) approach uses training data of the whole ensemble domain and all observation stations share the same set of EMOS parameters. Global modelling requires a far shorter training period but is usually unsuitable for large and heterogeneous datasets like the one at hand. Alternatively, one can combine the advantages of local and global parameter estimation using a semilocal approach (Lerch and Baran, 2017): the training data for a given station is enhanced with data from stations with similar characteristics. Following Baran et al. (2019), clustering based semilocal modelling where regional parameter estimation is also considered. The clusters are derived using means clustering of feature vectors depending both on the station climatology and the forecast errors of the raw ensemble mean over the training period.
3.5 Ensemble copula coupling
The ensemble copula coupling (ECC) approach of Schefzik et al. (2013) is a general multivariate postprocessing method, where after univariate calibration, the rank order information in the raw ensemble is used to restore correlations between the calibrated forecasts. In our case, we first perform univariate EMOS calibration of temperature and dew point forecasts using the normal EMOS model of Section 3.1. Then we draw random samples of the same size as the raw ensemble from both individual EMOS predictive distributions (ECCR; see e.g. Schefzik, 2016) and reorder them to have the same rankings as the corresponding raw temperature and dew point forecasts. The obtained bivariate sample is considered as a sample from a calibrated bivariate predictive distribution of temperature and dew point.
In practice, when applying the ECC approach, 20 samples of the size of the raw ensemble (50) are drawn from the EMOS predictive distributions of temperature and dew point. Each sample is then reordered according to the rank structure of the corresponding raw ensemble forecasts resulting in 1000 draws from the ECC predictive distribution, which can be used for forecast evaluation. In a similar way, verification scores of the bivariate EMOS models are also estimated using 1000 samples from the bivariate predictive distributions.
4 Verification
4.1 Proper scores
For a given lead time, competing forecasts in terms of probability distribution are compared with the help of the mean CRPS in the univariate case and mean ES value in the bivariate case over all forecast cases in the verification data. The skill of univariate forecasts for dichotomous events (observation
exceeding a given threshold ) is assessed with the mean Brier score (BS). For a predictive CDF , the Brier score is defined asThe integral of BS over all possible thresholds results in the CRPS. The goodness of fit of probabilistic forecasts to extreme values of the univariate weather quantity is evaluated using the thresholdweighted continuous ranked probability score (twCRPS)
where is a weight function (Gneiting and Ranjan, 2011). The case where corresponds to the CRPS defined in (3.8). In the following, we set in order to focus exclusively on values above a given threshold .
4.2 Skill scores
The improvement in terms of a given score for a forecast with respect to a reference forecast can be quantified using the corresponding skill score
(4.1) 
where and denote the mean score values corresponding to forecasts and , respectively, and corresponds to the mean score value of a perfect deterministic forecast (with here ). Based on (4.1), we compute the continuous ranked probability skill score (CRPSS), the energy skill score (ESS), the Brier skill score (BSS) and the thresholdweighted continuous ranked probability skill score (twCRPSS). Skill scores are positively oriented (the larger the better), and in Section 5 the adjusted ensemble is used as a reference.
4.3 Calibration assessment
A simple graphical tool for assessing calibration of univariate ensemble forecasts is the verification rank histogram, which is the histogram of ranks of verifying observations with respect to the corresponding ensemble forecasts (see e.g. Wilks, 2011, Section 8.7.2). In the case of a perfectly calibrated
member ensemble, the ranks follow a uniform distribution on
and the deviation from uniformity can be quantified by the reliability index defined by(4.2) 
where is the relative frequency of rank (Delle Monache et al., 2006). One can also generalize the verification rank histogram to multivariate ensemble forecasts. Thorarinsdottir et al. (2016) suggested several options in order to define ranks in a multivariate context. We apply the average ranking given by the average of the univariate ranks and the multivariate ordering proposed by Gneiting et al. (2008). In case of a probabilistic forecast, one can plot the verification rank histogram and calculate corresponding reliability index on the basis of a large number of ensembles sampled from the predictive PDF.
Calibration assessment of univariate predictive distributions is examined with the help of probability integral transform (PIT) histograms (the continuous counterparts of the verification rank histograms). By definition, PIT is the value of the predictive CDF at the validating observation, with a possible randomization at points of discontinuity (Gneiting and Ranjan, 2013). In case of perfect calibration, PIT follows a uniform law on the interval.
4.4 Forecast value
The forecast value compares mean expenses of users having access to different levels of information; it has the form of a skill score (see Section 4.2). The framework is the one of a simple binary decision setting: facing a potential threatening event (i.e. a heat index exceeding a warning level), a user can take (or not) preventive action. This protective action has a cost , whereas the user can endure a loss if the event actually occurs but no protective action was taken anticipatively. The ratio is called costloss ratio and is considered as a characteristic of the user in the decisionmaking process. The user takes action based on the forecast: if the forecast probability is greater than the user’s costloss ratio then preventive action with cost is taken, otherwise the risk to face a loss in case of event is taken. This rational behaviour is sometimes referred to as the Bayes action (see Rodwell et al. (2019) for a discussion on the concept and applications). When only climatological information is available, the user’s costloss ratio is simply compared to the event base rate. In this framework, the forecast value is defined as the ratio of two mean expense differences involving the following elements: the mean expense of the user taking the Bayes action with respect to the forecast (, the mean expense of a user making decision based on the observation climatology (), and the mean expense of an omniscient user who is taking action only when the event actually occurs (). Formally, the forecast value is defined as
which can be expressed in a simple form using the elements of a contingency table
(see e.g. Wilks, 2001). In our implementation, the mean expenses of the different type of users are computed independently at each station. This means that we consider a climatology (estimated as the sample base rate) varying from one station to another.5 Results
We consider two different methodological approaches to the calibration of DI and WBGTid. In the first case, we start with joint postprocessing of temperature and dew point ensemble forecasts, either using directly a bivariate normal EMOS model or applying first univariate normal EMOS models to both components and then ECC to restore the dependence structure. Samples from the bivariate predictive distributions are then used to obtain samples from the predictive distributions of DI and WBGTid. The second approach is the direct postprocessing of DI and WBGTid ensemble forecasts with the help of the novel GEV model introduced in Section 3.3 (see Figure 1).
5.1 Bivariate calibration of temperature and dew point forecasts
We start with some considerations about the implementation of the EMOS approaches. The 50 ensemble members are regarded as exchangeable; their separate univariate and joint bivariate calibration is performed using normal EMOS models with distribution means linked to the ensemble means via (3.2) and (3.4), respectively, with . Thus, the EMOS models require the estimation of free parameters in the univariate case and in the bivariate case. A large number of free parameters calls either for long training periods in local estimation or for a semilocal approach. Both local and clusteringbased semilocal parameter estimations with 5, 10, 15, 25, 50 and 75 clusters are tested. Clustering of stations is based on 40 dimensional features equally representing the properties of both temperature and dew point forecasts and observations. In accordance with the results of Lerch and Baran (2017), the forecast skill does not really depend on the number of features applied, provided that one works with a sufficiently large number of features. Moreover, the longer the lead time of the postprocessed forecasts, the more training data are needed to outperform the raw ensemble forecasts (see discussion in Feldmann et al., 2019). As a tradeoff, the length of the training period is fixed to 60 days. This choice leaves 79 calendar days (period 14 July – 30 September 2017) as a verification period when taking into account the maximal lead time of 15 days.
First, calibration of temperature and dew point forecast is performed separately. Figure 3 shows the CRPSS of different semilocal EMOS models with respect to local EMOS. Local postprocessing is optimal up to day 5, then semilocal approaches exhibit significantly better predictive performance. Hence, we propose to apply different types of EMOS configurations for different forecast lead times: local EMOS at the beginning of the forecast range and semilocal configuration(s) for longer lead times. Practically, local EMOS is considered for days 1 – 5 both for temperature and dew point, while for days 6 – 10 we use semilocal EMOS with 75 clusters. For dew point, this latter model is also kept for days 11 – 15, whereas for temperature we reduce the number of clusters to 5. The different mixtures of EMOS configurations are reported in Table 1. In the following, the method referred to as ECC relies on the reported mixed approach for the univariate postprocessing step.
Postprocessing  Lead time (day)  

method  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15 
Normal EMOS, temperature  local  75 clusters  5 clusters  
Normal EMOS, dew point  local  75 clusters  
2D EMOS  local  50 clusters  5 clusters  
2D EMOS optimal  local  75 clusters  5 clusters  
GEV EMOS, DI and WBGTid  local  75 clusters  5 clusters 
In the same spirit as Figure 3, the optimal training configuration is investigated by inspecting the multivariate performance in terms of ESS for the semilocal bivariate EMOS models with respect to the local one. As reported in Table 1, optimal performance in terms of ES can be obtained by applying local modelling for days 1 – 5, semilocal with 75 clusters for days 10 – 15, and semilocal with 5 clusters for longer lead times (referred to as 2D EMOS optimal). However, this mixture of EMOS configurations is suboptimal when applied to derive predictive distributions of DI and WBGTid (see Section 5.2). Better forecast skill can be obtained using local EMOS for days 1 – 4, semilocal with 50 clusters for days 5 – 7, and semilocal with 5 clusters until the end of the forecast range (referred to as 2D EMOS).
In Figure 4, the different multivariate postprocessing approaches are compared in terms of a multivariate skill score, ESS. Bivariate EMOS and ECC postprocessing approaches significantly outperform both the raw ensemble and the adjusted ensemble up to day 9, while the advantage of the adjusted ensemble with respect to the raw ensemble disappears only after day 10. All performance differences shrink with forecast lead time. Bivariate EMOS generally outperforms ECC but both methods have similar performance between day 5 and day 10. When significant difference in mean ES between these two approaches is tested, it appears that the proportion of stations with significant difference (positive or negative) is less than 7% at day 1 but increases to more than 30% at day 10 (not shown).
Reliability of the bivariate forecast is now investigated. Figure 5 presents reliability indices (corresponding to average ranking and multivariate ordering) as a function of the forecast lead time. The reliability indices are rather consistent with one another: similar patterns are found for the two types of ranking (average and multivariate). One exception is the difference between raw and adjusted ensemble results which is much smaller when average ranks are considered. The raw ensemble results in strongly Ushaped histograms at short lead time, while the ensemble adjustment allows to improve reliability in particular with respect to the multivariate ranking. 2D EMOS exhibits the better reliability (lowest reliability indices) up to day 8 and fairly good results for longer lead times. However, EMOS postprocessed forecasts are slightly biased towards the end of the forecast range, producing low multivariate ranks of the verifying observations (see Figure 10 and 11 of Appendix A). ECC forecasts are highly hump shaped at short lead times, rather well calibrated at medium lead times, and Ushaped for long lead times. Postprocessing approaches applied here fail to improve reliability of the ensemble beyond day 10, which is in line with their predictive performance in terms of the ES.
In a nutshell, bivariate postprocessing can improve multivariate forecasts of temperature and dew point, and in particular reliability, in a multivariate sense, is increased. However, the postprocessing methods applied here do not explicitly account for any relationship between temperature and dew point respecting the laws of physics. This limitation results sometimes in situations where a dew point forecast is greater than a temperature forecast. In that case, the former is set equal to the latter in order to avoid physical inconsistencies and allow the computation of the heat indices. This situation occurs less than 3 % of the cases when applying 2D EMOS and can be more frequent when simply adjusting the ensemble forecasts.
5.2 Postprocessing of discomfort index and indoor wetbulb globe temperature forecasts
Calibrated DI and WBGTid forecasts can be derived from the jointly calibrated temperature and dew point forecasts, but DI and WBGTid ensemble forecasts can also be directly postprocessed using the GEV EMOS model introduced in Section 3.3. The GEV model has only free parameters in the case of a single group of exchangeable ensemble members. Considering a 60day rolling training period and the same verification period 14 July – 30 September 2017 as before, we inspect the CRPSS of various semilocal EMOS configurations with respect to local EMOS for DI and WBGTid (not shown). Based on this exploratory analysis, we propose to adapt the EMOS configuration as a function of the forecast lead time as follows: for both indices local EMOS is applied between day 1 and 4, semilocal EMOS with 75 and 5 clusters for days 5 – 8 and 9 – 15, respectively. This model is referred to as GEV EMOS (see Table 1).
In Figure 6, results in terms of CRPSS indicate that postprocessing significantly improves both the raw and adjusted ensembles up to day 7 for DI and day 6 for WBGTid. For both indices, 2D EMOS postprocessed forecasts have the best predictive skill, followed closely by GEV EMOS forecasts, whereas ECC shows the worst predictive performance among the calibration approaches. However, the differences between models are minor and often nonsignificant. For lead times up to day 7, the differences in CRPS between the competing EMOS forecasts are significant at a level for less than of the stations. This later number increases to more than 15% when differences are computed against the adjusted ensemble forecasts (not shown).
Verification rank and PIT histograms help diagnosing reliability issue in the forecast. Results for DI are shown in Figures 12 of Appendix B, while histograms for WBGTid, which are very similar, are not shown. For both indices, the raw ensemble forecasts are highly underdispersive for day 1; however, the dispersion improves with the forecast lead time. Moreover, raw forecasts exhibit a bimodal character, which is in line with the bivariate histograms of raw temperature and dew point forecasts. Only local GEV EMOS calibration removes this bimodality and in general, similar to bivariate forecasts, the longer the lead time, the smaller the effect of postprocessing.
Another perspective is now taken focusing on the forecast skill at the right tail of the observation distributions. Figure 7 (top panel) depicts BSS and twCRPSS considering the event DI exceeding C. This threshold corresponds to the level at which most of the population suffers from heatinduced discomfort (Stathopoulou et al., 2005). In terms of BSS, all EMOS approaches outperform the raw and adjusted ensemble up to day 10 with comparable skill. However, in terms of twCRPS, the GEV EMOS model clearly outperforms the other postprocessing methods. For lower thresholds, the differences between the EMOS approaches is less pronounced, while for higher thresholds the results are noisier and the significance bars are larger (not shown).
For WBGTid, a threshold C is considered. It corresponds to the threshold associated with the first warning level (green flag) (Patel et al., 2013). Figure 7 (bottom panel) shows BSS and twCRPSS: postprocessed forecasts significantly outperform the raw and adjusted ensemble forecasts up to at least day 8, whereas ECC exhibits positive skill score up to day 10. Considering larger thresholds, small BS and twCRPS values combined with large uncertainty in the scores leads to quite noisy results which are difficult to interpret (not shown). The change in EMOS configuration (from local to semilocal) is clearly visible in Figure 7: jumps appear at day 5 and 8 for 2D EMOS, at day 5 and 9 for GEV EMOS, while the curves corresponding to ECC remain smooth.
One more step towards refined performance analysis is taken by looking at the forecast value. For the two events considered above, forecast value is plotted as a function of the user costloss ratio in Figure 8. As a general feature, not only the forecast value, but also the number of users that can benefit from the forecast, decreases with the forecast lead time. At day 10, the forecast of WBGTid has no value for a user who needs high certainty about the event occurrence (high costloss ratios) but it is still valuable for a user who is risk adverse (low costloss ratios). The postprocessed forecasts have value for some users even beyond day 10 (not shown). As a purely statistical characteristic of the score, the forecast value reaches a maximum for a user with costloss ratio equal to the climatological base rate. This is true locally, but the maximum value varies from one location to another and the corresponding pick is therefore smoothed out when aggregated over different locations. In any case, the event for DI is more probable than the event for WBGTid, which explains that the maximum is reached for different values in the two cases. The raw ensemble has negative value for extreme for both DI and WBGTid forecasts at 4 and 10 days, but the ensemble adjustment allows to make the forecasts valuable for (almost) all types of users. Forecast postprocessing increases the forecast value further in most of the cases. There are also slight differences in terms of forecast value between the different calibration approaches.
5.3 Predictive distributions of categories of heat indices
Discomfort Index ()  WetBulb Globe Temperature ()  

Cat.  Range (C)  Classification  Cat.  Range (C)  Flag col. 
1  No discomfort  1  No flag  
2  Under population feels discomfort  2  Green  
3  Over population feels discomfort  3  Yellow  
4  Most of population feels discomfort  4  Red  
5  Everyone feels severe stress  5  Black  
6  State of medical emergency 
Heat indices are often grouped into categories according to the severity of the effect of heat stress on human activity. Table 2 provides a classification of DI and heat categories of WBGTid based on Stathopoulou et al. (2005) and Patel et al. (2013), respectively. Instead of continuous heat indices, we focus now on the corresponding categories taking values for DI and
for WBGTid. In this case, a predictive distribution is specified by a probability mass function (PMF) and postprocessing reduces to a classification problem, which is a typical application area of machine learning methods. Here, predictive distributions are obtained with the help of multilayer perceptron neural networks
(MLP NN; Hastie et al., 2009). We skip the intermediate step of obtaining DI and WBGTid ensemble forecasts from the corresponding forecasts of temperature and dew point and connect the latter directly to the DI and WBGTid categories. As primary input variables (also called features or covariates), we use the mean, variance and covariance of temperature and dew point ensemble forecasts, as well as the geographical coordinates (latitude, longitude), and the elevations of the SYNOP stations and of the corresponding model grid point. Instead of applying local or semilocal training, all forecasts cases in the training periods are regarded as a single group. This ninedimensional feature vector can be extended further with forecast errors of temperature and dew point. Ensemble mean errors of the 1 to 5 closest days to the end of the training period can be added to the input variables increasing the number of covariates by 2 to 10 more features. In this case, the length of the training period is naturally reduced by the number of days where the forecasts errors are considered.MLP NN  Lead time (day)  

configurations  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  
Net  single  single  dual  
DI  Neurons in hidden layer  40  15  15–15  
Forecast errors  5  0  0  
Net  single  single  single  
WBGTid  Neurons in hidden layer  30  35  30  
Forecast errors  5  0  0 
Predictive PMFs of the DI and WBGTid categories are obtained applying MLP NNs with a single hidden layer and crossentropy as loss function. In our case study, the different categories are highly imbalanced: the proportions of observations in the first category in Table
2 is for DI and for WBGTid, whereas the last category in Table 2 represents less than in both cases. For this reason, we also consider a two step procedure (referred to as dual net). In the first step a binary classification is performed, which for each feature vector results in the distribution of the binary outcome of belonging to the first (richest) category or not. Then a second net is trained using only feature vectors coming from the remaining classes, where for DI and for WBGTid. For each feature set this step yields a distribution of the corresponding classes and the final distribution of index categories is obtained as .Similar to the EMOS approaches, different training configurations are envisaged for different lead times. We tested the performance of single nets with 15, 20, 25, 30, 35, 40 and 50 neurons in the hidden layer, both with the ninedimensional primary features and with extended feature vectors containing forecast errors of either 1 or 5 days. In the case of dual nets, we used 15, 25 and 40 neurons in the hidden layer of both components and considered the primary features plus the 1 day forecast error. In general, a large number of neurons leads to overfitting, and according to our observations, for longer lead times the use of forecast errors does not improve the predictive performance, while the dual net can handle better the larger uncertainty in the forecasts. For DI categories, the following configuration is selected: a single net with 40 neurons making use of the forecast errors for 5 days as additional predictors for days 1 – 4, single net with 15 neurons for days 5 – 8, and dual net with 15 – 15 neurons for days 9 – 15 (see Table 3). For WBGTid categories, single networks with medium number of neurons are preferred, and moreover, the use of forecast errors has a longer lasting effect. So, the selected configuration is the following: for days 1 – 8 we consider 30 neurons and 5day forecast errors, for days 9 – 12 only the primary features and 35 neurons, whereas for the last 3 days we just reduce the number of neurons to 30 (see Table 3). In the following discussion these two blends are referred to as MLP NN.
In Figure 9, CRPS is applied to forecast and observation categories. All calibration approaches significantly outperform both the raw and the adjusted forecasts of DI categories till day 6. The results for the different EMOS models are in line with those of the continuous forecasts (Figure 6, left panel). The robust and rather generic MLP NN method is outperformed by the parametric models; however, the differences are often nonsignificant. A different ranking of the calibration methods can be observed in the right panel of Figure 9, which shows a high level of similarities with BSS and twCRPSS plots in the bottom panel of Figure 7. The GEV and 2D EMOS forecast skill jumps at the lead times where the training configurations are changed. After day 7, 2D EMOS loses its skill against the adjusted ensemble while the machine learning approach shows a reasonable predictive performance up to day 10. Among the different postprocessing approaches, MLP NN provides the sharpest predictive PMF in terms of the average width of the central predictive intervals and in terms of variance for longer lead times too (not shown). Due to the extremely imbalanced nature of WBGTid observation categories, during the training of a neural net one can easily face the situation where some categories are completely missing from the training data. This problem is less pronounced for the other investigated postprocessing methods as the EMOS predictive PMFs are calculated using continuous forecasts and observations. Thus, the WBGTid results of the different classification approaches should be interpreted cautiously.
6 Conclusion
This study focuses on improving forecast heat indices such as discomfort index (DI) and indoor wetbulb globe temperature (WBGTid) driven by temperature and dew point forecasts as inputs through statistical calibration. Postprocessing based on ensemble model output statistics (EMOS) approaches enable us to significantly improve continuous forecasts of heat indices, at least up to day 6. In particular, DI and WBGTid forecast reliability is clearly increased in that case. Among different flavours of EMOS approaches, the generalized extreme value (GEV) model has the best overall performance, followed by the bivariate EMOS and ensemble copula coupling. The GEV EMOS is tailored to the end products (DI and WBGTid), whereas calibration based on joint postprocessing of temperature and dew point forecasts can be applied to any heat index depending on these two weather quantities. For longer forecast ranges (typically beyond the first week in the forecast), postprocessing does not improve the forecast and can even lead to a deterioration of the forecast skill which is explained by the nonstationarity of the forecast errors.
Moving from continuous forecasts to predicted categories of heat indices, we further compare the traditional approach of postprocessing to multilayer perceptron neural networks, demonstrating that both methods have comparable predictive power. Furthermore, we evidenced that local EMOS postprocessing is outperformed by clusterbased EMOS approaches after day 4 if a limited training period is used. Decreasing the number of clusters with the forecast lead time helps extending the benefit of EMOS postprocessing for longer lead times. As a limit, a single cluster approach is equivalent to a global training approach which is not suitable for forecasting over a large domain.
The present work highlights several avenues of potential future research. For example, as an alternative training data set to the one above, one could consider the use of reforecasts. This is not investigated here but should be considered to address efficiently the calibration of medium to extended range forecasts. When accessing large training data sets, appropriate predictors could be defined to account for seasonality and/or weather regime dependent forecast errors. In this situation, machine learning approaches seem more flexible to deal with an increased number of predictors.
The assessment of forecast performance spans from the multivariate ES to the forecast value for a range of users defined by their costloss ratio, from the generic univariate CRPS to a focus on extreme events with BS and twCRPS. The cascade of verification results indicate that the ranking of the postprocessing methods might differ from one score to another: the best method for one aspect of the forecast performance is not necessary the best solution for all applications. While an improvement with EMOS postprocessing is expected to be effective up to day 6, the ensemble demonstrates potential skill beyond day 10 in terms of forecast value for some users.
The comparison of bivariate calibration approaches with direct postprocessing leads us to formulate the following recommendations. Direct postprocessing is generally more efficient; targeting one specific event/user could be even more appropriate. However, postprocessing of the input variables appears to be a competitive approach. In that case, calibration of the dependence structure gives the best results. The simplest approach, which consist in the calibration of each component separately combined using the empirical copula of the ensemble, has reasonable performance as well. For more complex heat indices, calibration of the input variables can be envisaged if no EMOS model exists for the end product, but as the above discussion tends to indicate, there is no royal road to postprocessing of heat indices.
Acknowledgments. Sándor Baran was supported by the National Research, Development and Innovation Office under Grant No. NN125679. Ágnes Baran and Sándor Baran also acknowledge the support of the EFOP3.6.216201700015 project. The project was cofinanced by the Hungarian Government and the European Social Fund. The authors are indebted to Claudia Di Napoli and David S. Richardson for their valuable suggestions and remarks.
References
 Baran et al. (2019) Baran, S., Leutbecher, M., Szabó, M. and Ben Bouallègue, Z. (2019) Statistical postprocessing of dualresolution ensemble forecasts. Q. J. R. Meteorol. Soc. 145, 1705–1720.
 Baran and Möller (2017) Baran, S. and Möller, A. (2017) Bivariate ensemble model output statistics approach for joint forecasting of wind speed and temperature. Meteorol. Atmos. Phys. 129, 99–112.

Baran and Nemoda (2016)
Baran, S. and Nemoda, D. (2016) Censored and shifted gamma distribution based EMOS model for probabilistic quantitative precipitation forecasting.
Environmetrics 27, 280–292.  Barnes et al. (2019) Barnes, C., Brierley, C. M. and Chandler, R. E. (2019) New approaches to postprocessing of multimodel ensemble forecasts. Q. J. R. Meteorol. Soc. 145, 3479–3498.
 Ben Bouallègue (2020) Ben Bouallègue, Z. (2020) Accounting for representativeness in the verification of ensemble forecasts. ECMWF Technical Memorandum, in preparation.
 Ben Bouallègue et al. (2016) Ben Bouallègue, Z., Heppelmann, T., Theis, S. E. and Pinson, P. (2016) Generation of scenarios from calibrated ensemble forecasts with a dual ensemble copula coupling approach. Mon. Weather Rev. 144, 4737–4750.
 Bernard and Pourmoghani (1999) Bernard, T. E. and Pourmoghani, M. (1999) Prediction of workplace wet bulb global temperature. Appl. Occup. Environ. Hyg. 14, 126–134.
 Casanueva (2019) Casanueva, A. (2019) HeatStress. doi:10.5281/zenodo.3264930. Available at: https://doi.org/10.5281/zenodo.3264929 [Accessed on 21 January 2020]
 de Freitas and Grigorieva (2017) de Freitas, C.R. and Grigorieva, E. A. (2017) A comparison and appraisal of a comprehensive range of human thermal climate indices. Int. J. Biometeorol. 61, 487–512.
 Delle Monache et al. (2006) Delle Monache, L., Hacker, J. P., Zhou, Y., Deng, X. and Stull, R. B. (2006) Probabilistic aspects of meteorological and ozone regional ensemble forecasts. J. Geophys. Res. 111 D24307, doi:10.1029/2005JD006917.
 Di Napoli et al. (2019) Di Napoli, C., Pappenberger, F. and Cloke, H. L. (2019) Verification of heat stress thresholds for a healthbased heatwave definition. J. Appl. Meteorol. Climatol. 58, 1177–1194.
 EMDAT (2019) EMDAT: The International Disaster Database, Centre for Research on the Epidemiology of Disasters (CRED), Université Catholique de Louvain. Available at: http://www.emdat.be [Accessed on 21 January 2020]
 Feldmann et al. (2019) Feldmann, K., Richardson, D. S. and Gneiting, T. (2019) Grid‐ versus stationbased postprocessing of ensemble temperature forecasts. Geophys. Res. Lett. 46, 7744–7751.
 Friedrichs and Thorarinsdottir (2012) Friederichs, P. and Thorarinsdottir, T. L. (2012) Forecast verification for extreme value distributions with an application to probabilistic peak wind prediction. Environmetrics 23, 579–594.
 Fundel et al. (2019) Fundel, V. J., Fleischhut, N., Herzog, S. M., Göber, M. and Hagedorn, R. (2019) Promoting the use of probabilistic weather forecasts through a dialogue between scientists, developers and endusers. Q. J. R. Meteorol. Soc. 145, 210–231.
 Gneiting (2014) Gneiting, T. (2014). Calibration of mediumrange weather forecasts. ECMWF Technical Memorandum No. 719. Available at: http://www.ecmwf.int/sites/default/files/elibrary/2014/9607calibrationmediumrangeweatherforecasts.pdf [Accessed on 21 January 2020]
 Gneiting and Raftery (2007) Gneiting, T. and Raftery, A. E. (2007) Strictly proper scoring rules, prediction and estimation. J. Amer. Statist. Assoc. 102, 359–378.
 Gneiting et al. (2005) Gneiting, T., Raftery, A. E., Westveld, A. H. and Goldman, T. (2005) Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Mon. Weather Rev. 133, 1098–1118.

Gneiting and Ranjan (2011)
Gneiting, T. and Ranjan, R. (2011) Comparing density forecasts using threshold and quantileweighted scoring rules.
J. Bus. Econ. Stat. 29, 411–422.  Gneiting and Ranjan (2013) Gneiting, T. and Ranjan, R. (2013) Combining predictive distributions. Electron. J. Stat. 7, 1747–1782.
 Gneiting et al. (2008) Gneiting, T., Stanberry, L. I., Grimit, E. P., Held, L. and Johnson, N. A. (2008) Assessing probabilistic forecasts of multivariate quantities, with applications to ensemble predictions of surface winds (with discussion and rejoinder). Test 17, 211–264.
 Good (1952) Good, I. J. (1952) Rational decisions. J. R. Stat. Soc. Ser. B Stat. Methodol. 14, 107–114.
 Hastie et al. (2009) Hastie, T., Tibshirani, R. and Friedman, J. (2009) The Elements of Statistical Learning. Data Mining, Inference, and Prediction. Second Edition. Springer, New York.
 Lemke and Kjellstrom (2012) Lemke, B. and Kjellstrom, T. (2012) Calculating workplace WBGT from meteorological data: a tool for climate change assessment. Ind. Health. 50, 267–278.
 Lerch and Baran (2017) Lerch, S. and Baran, S. (2017) Similaritybased semilocal estimation of EMOS models. J. R. Stat. Soc. Ser. C Appl. Stat. 66, 29–51.
 Lerch and Thorarinsdottir (2013) Lerch, S. and Thorarinsdottir, T. L. (2013) Comparison of nonhomogeneous regression models for probabilistic wind speed forecasting. Tellus A 65, 21206.
 Leutbecher and Palmer (2008) Leutbecher, M. and Palmer, T. N. (2008) Ensemble forecasting. J. Comp. Phys. 227, 3515–3539.
 McGregor et al. (2015) McGregor, G., Bessemoulin, P, Ebi, K. and Menne B. (eds.) (2015) Heatwaves and Health: Guidance on Warningsystem Development. World Meteorological Organisation and World Health Organisation, Geneva.
 Morabito et al. (2019) Morabito, M., Messeri, A., Noti, P., Casanueva, A., Crisci, A., Kotlarski, S., Orlandini. S., Schwierz, C., Spirig, C., Kingma, B. R., Flouris, A. D. and Nybo, L. (2019) An occupational heat–health warning system for Europe: The heatshield platform. Int. J. Environ. Res. Public Health 16, paper 2890, doi:10.3390/ijerph16162890.
 Pappenberger et al. (2015) Pappenberger, F., Jendritzky, G., Staiger, H., Dutra, E., Di Giuseppe, F., Richardson, D. S and Cloke, H. L. (2015) Global forecasting of thermal health hazards: the skill of probabilistic predictions of the universal thermal climate index (UTCI). Int. J. Biometeorol. 59, 311–323.
 Patel et al. (2013) Patel, T., Mullen, S. P. and Santee, W. R. (2013) Comparison of methods for estimating wetbulb globe temperature index from standard meteorological measurements. Mil. Med. 178, 926–932.
 Raftery et al. (2005) Raftery, A. E., Gneiting, T., Balabdaoui, F. and Polakowski, M. (2005) Using Bayesian model averaging to calibrate forecast ensembles. Mon. Weather Rev. 133, 1155–1174.
 Rodwell et al. (2019) Rodwell, M. J., Hammond, J., Thornton, S. and Richardson, D. S. (2019) Acting on forecasts  perceptions, problems and progress. Q. J. R. Meteorol. Soc., submitted.
 Saetra et al. (2004) Saetra, Ø., Hersbach, H., Bidlot, J. R. and Richardson, D. S. (2004) Effects of observation errors on the statistics for ensemble spread and reliability. Mon. Weather Rev. 132, 1487–1501.
 Schefzik (2016) Schefzik, R. (2016) Combining parametric lowdimensional ensemble postprocessing with reordering methods. Q. J. R. Meteorol. Soc. 142, 2463–2477.
 Schefzik et al. (2013) Schefzik, R., Thorarinsdottir T. L. and Gneiting, T. (2013) Uncertainty quantification in complex simulation models using ensemble copula coupling. Statist. Sci. 28, 616–640.
 Scheuerer (2014) Scheuerer, M. (2014) Probabilistic quantitative precipitation forecasting using ensemble model output statistics. Q. J. R. Meteorol. Soc. 140, 1086–1096.
 Schuhen et al. (2012) Schuhen, N., Thorarinsdottir, T. L. and Gneiting, T. (2012) Ensemble model output statistics for wind vectors. Mon. Weather Rev. 140, 3204–3219.
 Sonntag (1990) Sonntag, D. (1990) Important new values of the physical constants of 1986, vapour pressure formulations based on the IST90 and psychrometer formulae. Z. Meteorol. 70, 340–344.
 Stathopoulou et al. (2005) Stathopoulou, M. I., Cartalis, C., Keramitsoglou, I. and Santamouris, M. (2005) Thermal remote sensing of Thom’s discomfort index (DI): comparison with insitu measurements. Proc. SPIE 5983, Remote Sensing for Environmental Monitoring, GIS Applications, and Geology V, 59830K (29 October 2005).

Thorarinsdottir and Gneiting (2010)
Thorarinsdottir, T. L. and Gneiting, T. (2010) Probabilistic forecasts of wind speed: ensemble model output statistics by using heteroscedastic censored regression.
J. R. Stat. Soc. Ser. A Stat. Soc. 173, 371–388.  Thorarinsdottir et al. (2016) Thorarinsdottir, T. L., Scheuerer, M. and Heinz, C. (2016) Assessing the calibration of highdimensional ensemble forecasts using rank histograms. J. Comput. Graph. Stat. 25, 105–122.
 Wilks (2001) Wilks, D. S. (2001) A skill score based on economic value for probability forecasts. Meteorol. Appl. 8, 209–219.
 Wilks (2011) Wilks, D. S. (2011) Statistical Methods in the Atmospheric Sciences. 3rd ed., Elsevier, Amsterdam.
 Wilks (2018) Wilks, D. S. (2018) Chapter 3 – Univariate Ensemble Postprocessing. In Vannitsem, S., Wilks, D. S., Messner, J. W. (eds.), Statistical Postprocessing of Ensemble Forecasts, Elsevier, Amsterdam, pp. 49–89.
Comments
There are no comments yet.