1 Introduction
Comparative evaluation of forecasts of statistical functionals is a standard issue in the realm of forecasting (Gneiting, 2011). It relies on comparing expected or averaged losses of competing forecasts, after the realization of the quantity , on which the functional is based, has been observed. The aim of this paper is to investigate how proxies for , which are observed together with , can be utilized to improve forecast comparisons in the sense that they result in the same ordering but bring an increase in the power of comparative forecast tests.
The motivation comes mainly from highfrequency finance, where highfrequency data are routinely used to generate forecasts  say of volatilities  also over moderate time horizons such as daily volatilities (Corsi, 2009). Our investigation shows how highfrequency data can be used to obtain sharper forecast evaluation. In comparative forecast evaluation, this would mean that when comparing two forecasts of daily volatilities in terms of expected values of loss functions, the better forecast can be determined with higher power when using these highfrequency data in the process of forecast evaluation.
Our main point of departure was Patton (2011), who showed that using various noisy volatility proxies  e.g. based on high frequency  is valid in comparative forecast evaluation, that is, preserves the order of the expected losses. Hansen and Lunde (2006) have similar results, while Laurent et al. (2013) provide a multivariate generalization of the characterization in Patton (2011), and Koopman et al. (2005) illustrate the use of realized measures for forecast comparisons on various observed high frequency data sets. We are interested in the comparison of different possibly misspecified forecasts, in which situation Patton (2020) shows that the ranking of the forecasts may depend on the loss function.
A very recent, closely related contribution is by Hoga and Dimitriadis (2021), who focus on predicting the mean, and illustrate their methods for GDP forecasts. Our contributions and their relation to the results in Hoga and Dimitriadis (2021) can be summarized as follows.

We extend the analysis from Patton (2011) and Hansen and Lunde (2006) about the validity when using various proxies from volatilities, that is, second moments, to general moments and beyond to ratios of moments. In the terminology of Hoga and Dimitriadis (2021), we rely on the concept of exact robustness of loss functions instead of ordering robustness as considered in Patton (2011). Hoga and Dimitriadis (2021) assume that the proxy enters the loss difference in the same way as the original observation, and in this setting show that only the mean allows for exactly robust loss functions.

We formally show in terms of the variance of differences of losses that using proxies will increase the power in comparative forecast testing by decreasing the variance of loss differences. Hoga and Dimitriadis (2021) have similar results for the mean, and also investigate formally the consequence on the power of DieboldMariano tests under local alternatives.

Finally, we illustrate the theoretical results for simulated highfrequency data using the threezone approach from Fissler et al. (2015) and Nolde and Ziegel (2017). We show that the choice of the proxies, as well as the choice of the loss function, has a pronounced effect on the comparative evaluation of forecasts: using highvolatility data and the QLIKE loss substantially improves the forecast evaluation.
The paper is structured as follows. In Section 2, we start with a motivating example; we recall strictly consistent loss functions for statistical functionals, and introduce a dynamic framework for forecast evaluation. Section 3 investigates the use of proxies to improve evaluations of forecasts of moments, and in Section 4 this is further discussed and extended to ratios of moments. Section 5 summarizes the results of a simulation study, where we consider comparing forecasts for second, third, and forth moments for GARCHtype time series.
2 Motivation and Basic Concepts
2.1 A motivating example
Let us first illustrate the use of different proxies and loss functions in DieboldMariano tests for equal forecast performance. Consider the following stylized scenario, where the aim is to distinguish between two competing forecasts. Observations correspond to logarithmic returns, and we assume that the true data generating process is a simple GARCH(1,1) model. The total length of the time series is , and we consider rolling onestepahead forecasts of the conditional variance using a moving time window, with window length T/3, resulting in forecasts.
There are four forecasters. The first one is lucky to use a GARCH(1,1) model for making predictions, forecasters 2, 3, and 4 use ARCH(1), ARCH(2), and ARCH(7) models, respectively. Clearly, we expect that the predictions from forecaster 1 outperform, in some sense, the other predictions. Moreover, we would expect that the ARCH(7) model beats the ARCH(1) and ARCH(2) models since the former should be a better approximation to a GARCH(1,1) process. Hence, our interest focuses on the null hypotheses
and if is rejected, then is worse than . To decide for or against , we use a DieboldMariano test based on the loss differences
where is some loss function, and materializes at day
. Under suitable conditions, the studentized test statistic
has a limiting standard normal distribution, and
is rejected for large values of .To evaluate the forecasts, the mean squared error loss , is used, together with the squared returns as an unbiased proxy for the true volatility. The first line of the following table shows the results of the DieboldMariano test if the different predictions are compared to the GARCH(1,1) model. Even if the values are positive (hence, slightly favor the GARCH model), they are nowhere near statistically significant. The comparison of the ARCH(7) model with the other two ARCH models even results in values close to zero.
GARCH(1,1)  ARCH(1)  ARCH(2)  ARCH(7)  

GARCH(1,1)    0.788  1.010  0.984 
ARCH(7)  0.984  0.193  0.152   
Next, assume that, besides the daily returns, also 5min returns are available for predicting the next day’s volatility. Thus, the squared returns are replaced by the realized volatilities , where are the intraday log returns. The outcomes of the DieboldMariano tests for equal predictive performance, now using the realized volatilities as proxies, are as follows:
GARCH(1,1)  ARCH(1)  ARCH(2)  ARCH(7)  

GARCH(1,1)    3.976  3.924  3.506 
ARCH(7)  3.506  2.474  2.352   
In comparison with the first table, the values in the first line are much larger, being statistically significant even on the 0.01level, and indicating the dominance of the prediction under the GARCH(1,1) model. The comparison of the ARCH(7) model with the other two ARCH models favors the ARCH(7) model, at least on the 0.05level.
Finally, the evaluator decides to replace the MSE with the QLIKE loss function , and gets the following results of the corresponding DieboldMariano tests.
GARCH(1,1)  ARCH(1)  ARCH(2)  ARCH(7)  

GARCH(1,1)    5.589  5.523  4.386 
ARCH(7)  4.386  3.642  3.492   
Now, all entries are even larger in absolute terms, and the ARCH(7) model dominates the competing ARCH models even on the 0.01level.
Clearly, these results are based on a specific realization of the time series. However, a closer look at this example in Section 5 reveals that this behavior is rather typical.
2.2 Loss functions, statistical functionals and dynamic forecasts
We start by recalling the concept of strictly consistent loss (or scoring) functions, see Gneiting (2011). Let be a class of distribution functions on a closed subset
, which we identify with their associated probability distributions, and let
be a (onedimensional) statistical functional (or parameter).A loss function (also scoring function) is a measurable map . It is interpreted as the loss if forecast is issued and materializes. is consistent for the parameter of functional relative to the class , if
(1) 
and indicates that expectation is taken under the distribution for , and we assume that the relevant expected values exist and are finite. Thus, the true functional minimizes the expected loss under . If, in addition,
then is strictly consistent for . The functional is called elicitable relative to the class
if it admits a strictly consistent loss function. For several functionals such as moments, quantiles, and expectiles,
Gneiting (2011) characterizes all strictly consistent loss functions under some smoothness and normalization conditions. See also Steinwart et al. (2014).When comparing two forecasts for a given and hence parameter , we say that dominates under for the loss if the difference of expected losses
(2) 
From (1), for a strictly consistent loss function, the true parameter dominates any other forecast.
Now let us consider a forecasting situation. Forecasts are issued on the basis of certain information. Let
be a probability space, let
be a subalgebra of , the information set at time , on the basis of which the forecast is issued. In finance, can include returns (including highfrequency) up to time as well other covariates observed up to time .The aim is to predict the functional
, say the mean or the volatility, of the random variable
, which will be observed at time (say one day ahead). For example, this may be the return from to over one day. More precisely, if denotes the conditional distribution of given , then the parameter of interest isWe note that

the forecast is based on the full information up to time . Thus, even if is a return over one day, for the forecast we use e.g. highfrequency data up to time if these are included in ,

the observation is special since the parameter is defined via its conditional distribution given .
Thus, to generate the forecast, even if the time horizon for the forecast is, say, one day, it is standard to use highfrequency information contained in up to time . Now, the issue in this paper is how to use additional information contained in , available at time , to improve forecast evaluation.
2.3 Comparative forecast evaluation
First let us recall the setting for comparative forecast evaluation based on .
A forecast at time is  in great generality  an measurable random variable . Now, if is a strictly consistent loss function for , then compared to the true parameter , we have the following:
Conditional dominance. It holds that
(3) 
Unconditional dominance. Further, it holds that
(4) 
with equality in (3) or (4) if and only if When comparing with the true conditional value of the function, , these two notions coincide.
We shall generally compare two potentially misspecified forecasts, that is measurable random variables and . Then, by definition, conditionally dominates for the loss function if
(5) 
with strict inequality on a set of positive probability, while unconditionally dominates for the loss function if
(6) 
a weaker notion. Now, we consider how additional information contained in (apart from ) may be used for forecast evaluation. In the context of high frequency financial data, apart from using the highfrequency data to generate forecasts over daily time horizons, we shall investigate these highfrequency data to obtain sharper forecast evaluation.
3 Proxies when Comparing Forecasts of Moments
Suppose that is an interval, is a measurable function such that for all . Then a classical result by Savage (1971), see also Gneiting (2011), characterizes the strictly consistent scoring functions for in the form
(7) 
where is a strictly convex function for which for all .
First, we formulate the following lemma in the static framework.
Lemma 1.
Consider (7) and forecasts .

The loss difference,
(8) depends on only through .

If and is a random variable (with a given distribution) such that (the moment is the same), then
(9) 
We have that
(10) Consequently if in addition to ii) it holds that , then we have that
(11)
Here, plays the role of the proxy that shall be used to improve forecast evaluation. Part (ii) shows that using instead of is valid if in the sense that dominance relations of forecasts are preserved when using , while (11) shows that evaluation of score differences is actually sharper based on instead of if .
Hoga and Dimitriadis (2021) call the equality of lossdifferences in (9) exact robustness. When assuming that the proxy enters the lossdifference in the same fashion as , they show that exact robustness can only hold for strictly consistent scoring functions of the mean. In our more flexible approach, we cover general moments and also ratios of moments, see below.
Now, let’s turn to the dynamic setting described in Section 2.
Theorem 2 (Forecast dominance testing).
Consider forecasting the conditional moment , and suppose that is measurable with
(12) 

For the loss difference (8), for any two forecasts and (measurable random variables),
(13) and hence in particular
(14) Thus, both conditional as well as unconditional dominance are preserved when using instead of in the forecast comparison.
The second part of the theorem shows that a variance reduction is achieved both for testing conditional as well as unconditional dominance.
4 Ratios of Moments and Further Parameters
Suppose that is an interval, and are measurable functions such that , for all . The target parameter is
Gneiting (2011) shows that strictly consistent loss functions for are of the form
(17) 
where it is additionally assumed that
Lemma 3.
Consider (17) and forecasts .

The loss difference,
(18) depends on only through and .

If , and are random variables (with given distributions) such that and (the moment is the same), then
(19) 
If in addition to (ii) we have that for we have that
(20)
Proof.
Theorem 4 (Forecast dominance testing: Ration of moments).
Consider forecasting the ratio of conditional moments , and suppose that are measurable with
(21) 
The proof is immediate from Lemma 3, and the final inequality (25) follows as (16) in Theorem 2. The condition for a potential variance reduction in Theorem 4, (ii), is more restrictive than that from Theorem 2, (ii), and apart from relating the variances of the two moments of and to those of the proxies, also involves conditional covariances.
Theorem 4
does not apply to measures such as skewness and kurtosis, which even for centered distributions are known not to allow for strictly consistent scoring functions.However, the revelation principle, Theorem 4 in
Gneiting (2011), and the elicitability (existence of a strictly consistent scoring function) and hence joint elicitability of moments implies that for centered distributions, these measures are elicitable when considered together with the second moment. Roughly speaking, for the skewness this involves the twodimensional parameter consisting of third and second moment, and for the kurtosis consisting of fourth and second moment. The analysis of the corresonding lossdifferences is then similar to that in Theorem 4.5 Simulations
5.1 General setup
Following Nolde and Ziegel (2017), in comparative backtesting, we are interested in the following null hypotheses
The forecast is used as a benchmark. If the hypothesis is rejected, then is worse than ; if is rejected, is better than . The error of the first kind for rejecting one of the two hypotheses, even though they are true, can be controlled by the level of significance. As in Nolde and Ziegel (2017), we define
(assuming firstorder stationarity). Then, dominance of over is equivalent to , and predicts at most as well as if . Therefore, the comparative backtesting hypotheses can be reformulated as
Forecast equality can be tested with the socalled DieboldMariano test (Diebold and Mariano, 1995; Giacomini and White, 2006; Diebold, 2015), which is based on normalized loss differences. Here, the test statistic is given by
where and
is an estimator of the longrun asymptotic variance of the loss differences. One possible choice for
iswhere denotes the lag sample autocovariance of the sequence of loss differences (Gneiting and Ranjan, 2011; Lerch et al., 2017). Another possible choice (Diks et al., 2011) is , where is the largest integer less than or equal to . As a compromise, we used
. Under the null hypothesis of a vanishing expected loss difference and some further regularity conditions, the test statistic
is asymptotically standard normally distributed. Therefore, we obtain an asymptotic level test of if we reject the null hypothesis when , and of if we reject the null hypothesis when .To evaluate the tests for a fixed significance level , we use the following threezone approach of Fissler et al. (2016). If is rejected at level , we conclude that the forecast is worse than , and we mark the result in red; similarly, if is rejected at level , forecast is better than , and we mark the result in green. Finally, if neither nor can be rejected, the marking is yellow.
5.2 Squared returns and realized volatility
Assume that the log returns follow a GARCH(1,1) process defined by
where
(26) 
and all independent. Assuming that is constant on , intraday returns are given by .
We use , and ; the first is a typical range using 5min returns, the latter corresponds to the use of halfhour returns at the New York Stock Exchange (NYSE).
As the total length of the time series, we take and .
For this time series, we generate rolling onestepahead forecasts of the conditional variance using a moving time window, with window length , refitted every 10 time steps, for GARCH(1,1), ARCH(1), ARCH(2), and ARCH(7) models. Hence, for , the fit is based on 500 values, and the DM tests use 1000 forecasts of each model.
All computations are done in R (R Core Team, 2021) using the R packages rugarch
(Ghalanos, 2020) and fGarch
(Wuertz et al., 2020).
To stabilize the results, the following figures show the means of the results of 50 replications (500 replications for Figures 13) of the DieboldMariano test. All figures use the threezone approach described in the previous section with the following modification: we simultaneously show rejection of at level and by marking in light red, red and dark red, respectively. Marking in light green, green and dark green signals rejection of at the three levels. Besides the forecasts from the different (G)ARCH models, we show the result for the optimal forecast, given by the true conditional volatilities. Each figure shows four plot matrices: in the left (right) column, the squared returns (realized volatilities) are used as proxies. In the upper row, the loss function is the mean squared error , whereas in the lower row, the QLIKE loss function is used.
Figure 1 shows the results of DieboldMariano tests under normal innovations, with and . The left panels show results for the squared returns , the right panels for the realized volatilities .
Let us first discuss the results shown in the lowerleft panel, i.e. for the QLIKE loss and using the squared returns as proxies. The second value in the left column, +1.532, is the average value of the DM test statistic comparing the forecast from the GARCH(1,1) model with the optimal forecast , the true conditional volatilities. The positive value hints at the superiority of , but the value is not statistically significant. The results are significant when comparing the ARCH(1), ARCH(2) and ARCH(7) model with the optimal forecast; here, the red color indicates a significant rejection of at the 0.05level. The light red entries in the second column show that forecasts from the ARCH(1) and ARCH(2) models are worse than the forecasts from the GARCH(1,1) model (which is the true data generating process) at level 0.1, but not on the 0.05level. The forecast from GARCH(7) is not significantly worse than the GARCH(1,1).
Now, let’s turn to the lowerright panel with the realized volatilities as proxies. Here, all corresponding entries are marked in dark red, signalizing the rejection of at level 0.01 in all cases. Hence, the power of the DM test is clearly higher by using realized volatilities compared to squared returns. Looking at the upper row, we see that the results for the MSE are similar from a qualitative point of view. However, there are fewer statistically significant entries compared to the QLIKE loss function. Hence, the latter allows for sharper forecast evaluation in this example.
Figure 2 shows results from the same setting as Fig. 1 apart from that we use , corresponding to the use of halfhourly returns, instead of . Hence, the results in the two left panels are the same as in Fig. 1 (up to simulation error). The right panels are similar as in Fig. 1, as well. A closer look shows that all entries have smaller absolute values, showing the decreasing power in differentiating forecasts.
In Figure 3, the realized volatilities are replaced by the adjusted intradaily log range, given by
(27) 
where denotes the price process. This volatility proxy is unbiased under the above assumptions; details can be found in Patton (2011), p. 250. All in all, forecast evaluation using the adjusted intradaily log range is less sharp compared to the realized volatilities with , but power is clearly higher than using squared returns as proxies.
We also replaced the normal distribution of the intraday innovations by centered skewed and longtailed distributions. For this, we used the normal inverse gaussian distribution
(BarndorffNielsen, 1997) with parametersThis results in and . Since the class of nig distributions with fixed shape parameters and is closed under convolution, the distribution of is given by , with , , and .
Fig. 4 shows the results of the DM tests as in Fig. 1, i.e. for , using nig instead of normally distributed innovations. Again, the results are qualitatively comparable to the results in Fig. 1, but the absolute values of the entries are generally smaller. Hence, the change in the distribution of the innovations has a negative effect on the power of the test. Note that this decrease of power is larger for the realized volatilities than for the squared returns. This can be explained by the fact that the skewness and kurtosis of the daily innovations are rather modest with values of 1 and 5.67, whereas the skewness and kurtosis of the intraday innovations are 10 and 269.8, respectively.
5.3 Higher moments
The use of realized higher moments, skewness, and kurtosis to estimate and forecast returns has become quite standard in the literature. For example, Neuberger (2012) analyzed realized skewness and showed that highfrequency data can be used to provide more efficient estimates of the skewness in price changes over a period. Amaya et al. (2015) constructed measures of realized daily skewness and kurtosis based on intraday returns, and analyzed momentbased portfolios. Recently, Shen et al. (2018) discussed the explanatory power of higher realized moments.
5.3.1 Third moment
Assuming , we are interested in the conditional third moment . Possible proxies for are the cubed return and the realized third moment . We use the GARCH(1,1) model of subsection 5.2, with the normal inverse Gaussian distribution for the innovations. Under this model, we obtain
Since
is an unbiased estimator of
. As forecast of , we use , where denotes the onestep ahead forecast of from the different (G)ARCH models.Figure 5 shows the results of DieboldMariano tests under innovations, with chosen such that . Skewness and kurtosis of the intraday innovations are and 37.67, respectively, compared to the values 1 and 5.67 of the daily innovations. Here, total length of the simulated time series is , and we use , i.e. halfhourly returns. The left panels show the results for the cubed returns , the right panels for the realized third moment .
At first glance, the results seem to be rather different from the corresponding ones for the volatility, since the number of significant entries is much lower (cp. Fig. 2). But they go in the same direction: use of the realized moments increases the power of the DM test when the optimal forecast competes against the other models, or when the true data generating process is compared with ARCH models.
We have also used in the simulations; the results (not shown) go in the same direction, but none of the values is statistically significant, even at the 0.1level.
5.3.2 Fourth moment
Here, we are interested in the conditional fourth moment . Again, we use the GARCH(1,1) model as in subsection 5.2, and obtain
Hence, unbiased proxies for are and the realized corrected fourth moment
As forecast of , we use .
The left and right panels of Fig. 6 show the results of the DM tests, using and the realized corrected fourth moment as proxies, respectively. The innovations are normally distributed; further, and . The general picture resembles strongly the results of the volatility forecasts in Fig. 2, and all conclusions also apply here, even though the actual entries are a bit smaller.
When replacing the normal by the nig innovations, the power of the DM test decreases strongly (cf. Fig. 7). On the other hand, the entries are somewhat larger as in forecasting the third moment (with ). Here, at least a few values are significant on the 0.1level.
5.4 An apARCH model for the fourth moment
Instead of modeling the volatility, and computing higher moments under this process, it is also possible to use suitable models for higher moments directly. Harvey and Siddique (1999, 2000)
, for example, considered autoregressive model for conditional skewness.
Lambert and Laurent (2002) used the asymmetric power (G)ARCH or APARCH model of Ding et al. (1993) to describe dynamics in skewed locationscale distributions. Brooks et al. (2005) used both separate and joint GARCH models for conditional variance and conditional kurtosis, whereas Lau (2015) modeled (standardized) realized moments by an exponentially weighted moving average.Hence, in this section, we model the fourth moment directly by an asymmetric power ARCH (apARCH) process (Ding et al., 1993). Specifically, the log returns follow an apARCH(1,1) model with
where , for , and all are independent. Assuming again that is constant on , intraday returns are given by . We use such that the unconditional variance is
Further, and . As in the last section, unbiased proxies for are and . As forecast of , we use , where denotes the onestep ahead forecast of from the different apARCH models, namely apARCH(1,1), apARCH(1), apARCH(2), and apARCH(3).
The left and right panels of Figure 8 show the results of the DM tests for the apARCH process with exponent 4, with and the realized corrected fourth moment, respectively, as proxies.
The visual comparison of the upperleft and lowerright panels is striking: in the latter, each result is significant, whereas the former shows no significant entries. Hence, using highfrequency data and a suitable loss function results in a highly improved forecast evaluation. Generally, the results are quite similar to the results for the fourth moment based on the GARCH process in Fig. 6.
Finally, we consider again volatility forecasts, but now based on the apARCH process with exponent 4. The results are shown in Figure 9. We see a slight increase in power compared to Fig. 8; similar as for the GARCH process, differentiating between volatility forecasts is easier compared to forecasts of the 4th moment in the case of the apARCH process at hand.
To sum up the results of the simulations, it has become obvious that using highfrequency data for the proxies improves the forecast evaluation in each example. In most cases, the effect is substantial. There is also an effect of the choice of the loss function: the power of the DM test improves when using the QLIKE loss compared to the MSE loss function.
Acknowledgements
We would like to thank Andrew Patton and Tilmann Gneiting for pointing out some relevant references, in particular the paper by Hoga and Dimitriadis (2021).
References
 Amaya et al. (2015) Amaya, D., Christoffersen, P., Jacobs, K., and Vasquez, A. (2015). Does realized skewness predict the crosssection of equity returns? Journal of Financial Economics, 118:135–167.
 BarndorffNielsen (1997) BarndorffNielsen, O. (1997). Normal inverse gaussian distributions and stochastic volatility modelling. Scandinavian Journal of Statistics, 24:1–13.
 Brooks et al. (2005) Brooks, C., Burke, S. P., Heravi, S., and Persand, G. (2005). Autoregressive conditional kurtosis. Journal of Financial Econometrics, 3:399–421.
 Corsi (2009) Corsi, F. (2009). A simple approximate longmemory model of realized volatility. Journal of Financial Econometrics, 7(2):174–196.
 Diebold (2015) Diebold, F. X. (2015). Comparing predictive accuracy, twenty years later: A personal perspective on the use and abuse of dieboldmariano tests. Journal of Business and Economic Statistics, 33:1–24.
 Diebold and Mariano (1995) Diebold, F. X. and Mariano, R. S. (1995). Comparing predictive accuracy. Journal of Business and Economic Statistics, 13:253–263.
 Diks et al. (2011) Diks, C., Panchenko, V., and van Dijk, D. (2011). Likelihoodbased scoring rules for comparing density forecasts in tails. J. Econometrics, 163:215–230.
 Ding et al. (1993) Ding, Z., Granger, C., and Engle, R. (1993). A long memory property of stock market returns and a new model. Journal of Empirical Finance, 83:83–106.
 Fissler et al. (2015) Fissler, T., Ziegel, J. F., and Gneiting, T. (2015). Expected shortfall is jointly elicitable with value at riskimplications for backtesting. arXiv preprint arXiv:1507.00244.
 Fissler et al. (2016) Fissler, T., Ziegel, J. F., and Gneiting, T. (2016). Expected shortfall is jointly elicitable with value at risk  implications for backtesting. Risk Magazine, January:58–61.
 Ghalanos (2020) Ghalanos, A. (2020). rugarch: Univariate GARCH models. R package version 1.44.
 Giacomini and White (2006) Giacomini, R. and White, H. (2006). Tests of conditional predictive ability. Econometrica, 74:1545–1578.
 Gneiting (2011) Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Statistical Association, 106(494):746–762.
 Gneiting and Ranjan (2011) Gneiting, T. and Ranjan, R. (2011). Comparing density forecasts using threshold and quantileweighted scoring rules. Journal of Business and Economic Statistics, 29:411–422.
 Hansen and Lunde (2006) Hansen, P. R. and Lunde, A. (2006). Consistent ranking of volatility models. Journal of Econometrics, 131(12):97–121.
 Harvey and Siddique (1999) Harvey, C. R. and Siddique, A. (1999). Autoregressive conditional skewness. The Journal of Financial and Quantitative Analysis, 34:465–487.
 Harvey and Siddique (2000) Harvey, C. R. and Siddique, A. (2000). Conditional skewness in asset pricing tests. Journal of Finance, 55:1263–1295.
 Hoga and Dimitriadis (2021) Hoga, Y. and Dimitriadis, T. (2021). On testing equal conditional predictive ability under measurement error. arXiv preprint arXiv:2106.11104.
 Koopman et al. (2005) Koopman, S. J., Jungbacker, B., and Hol, E. (2005). Forecasting daily variability of the s&p 100 stock index using historical, realised and implied volatility measurements. Journal of Empirical Finance, 12(3):445–475.
 Lambert and Laurent (2002) Lambert, P. and Laurent, S. (2002). Modeling skewness dynamics in series of financial data using skewed locationscale distributions. Working Paper, Université Catholique de Louvain and Université de Liège.
 Lau (2015) Lau, C. (2015). A simple normal inverse gaussiantype approach to calculate valueatrisk based on realized moments. Journal of Risk, 17:1–18.
 Laurent et al. (2013) Laurent, S., Rombouts, J. V., and Violante, F. (2013). On loss functions and ranking forecasting performances of multivariate volatility models. Journal of Econometrics, 173(1):1–10.
 Lerch et al. (2017) Lerch, S., Thorarinsdottir, T. L., Ravazzolo, F., and Gneiting, T. (2017). Forecaster’s dilemma: Extreme events and forecast evaluation. Statist. Sci., 32:106–127.
 Neuberger (2012) Neuberger, A. (2012). Realized skewness. Review of Financial Studies, 25:3423–3455.
 Nolde and Ziegel (2017) Nolde, N. and Ziegel, J. F. (2017). Elicitability and backtesting: Perspectives for banking regulation. The annals of applied statistics, 11:1833–1874.
 Patton (2011) Patton, A. J. (2011). Volatility forecast comparison using imperfect volatility proxies. Journal of Econometrics, 160:246–256.
 Patton (2020) Patton, A. J. (2020). Comparing possibly misspecified forecasts. Journal of Business & Economic Statistics, 38(4):796–809.
 R Core Team (2021) R Core Team (2021). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
 Savage (1971) Savage, L. J. (1971). Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66(336):783–801.
 Shen et al. (2018) Shen, K., Yao, J., and Li, W. K. (2018). On the surprising explanatory power of higher realized moments in practice. Statistics and Its Interface, 11:153–168.
 Steinwart et al. (2014) Steinwart, I., Pasin, C., Williamson, R., and Zhang, S. (2014). Elicitation and identification of properties. In Conference on Learning Theory, pages 482–526. PMLR.
 Wuertz et al. (2020) Wuertz, D., Setz, T., Chalabi, Y., Boudt, C., Chausse, P., and Miklovac, M. (2020). fGarch: Rmetrics  Autoregressive Conditional Heteroskedastic Modelling. R package version 3042.83.2.