DeepAI

# Using Proxies to Improve Forecast Evaluation

Comparative evaluation of forecasts of statistical functionals relies on comparing averaged losses of competing forecasts after the realization of the quantity Y, on which the functional is based, has been observed. Motivated by high-frequency finance, in this paper we investigate how proxies Ỹ for Y - say volatility proxies - which are observed together with Y can be utilized to improve forecast comparisons. We extend previous results on robustness of loss functions for the mean to general moments and ratios of moments, and show in terms of the variance of differences of losses that using proxies will increase the power in comparative forecast tests. These results apply both to testing conditional as well as unconditional dominance. Finally, we illustrate the theoretical results for simulated high-frequency data.

• 9 publications
• 7 publications
06/21/2021

### On Testing Equal Conditional Predictive Ability Under Measurement Error

Loss functions are widely used to compare several competing forecasts. H...
01/28/2022

### Dynamic Temporal Reconciliation by Reinforcement learning

Planning based on long and short term time series forecasts is a common ...
05/04/2020

### The Murphy Decomposition and the Calibration-Resolution Principle: A New Perspective on Forecast Evaluation

I provide a unifying perspective on forecast evaluation, characterizing ...
03/29/2018

### Tests for Forecast Instability and Forecast Failure under a Continuous Record Asymptotic Framework

We develop a novel continuous-time asymptotic framework for inference on...
04/24/2018

### DeepTriangle: A Deep Learning Approach to Loss Reserving

We propose a novel approach for loss reserving based on deep neural netw...
05/14/2021

### Threshold Martingales and the Evolution of Forecasts

This paper introduces a martingale that characterizes two properties of ...
06/25/2021

### Robust Real-Time Delay Predictions in a Network of High-Frequency Urban Buses

Providing transport users and operators with accurate forecasts on trave...

## 1 Introduction

Comparative evaluation of forecasts of statistical functionals is a standard issue in the realm of forecasting (Gneiting, 2011). It relies on comparing expected or averaged losses of competing forecasts, after the realization of the quantity , on which the functional is based, has been observed. The aim of this paper is to investigate how proxies for , which are observed together with , can be utilized to improve forecast comparisons in the sense that they result in the same ordering but bring an increase in the power of comparative forecast tests.

The motivation comes mainly from high-frequency finance, where high-frequency data are routinely used to generate forecasts - say of volatilities - also over moderate time horizons such as daily volatilities (Corsi, 2009). Our investigation shows how high-frequency data can be used to obtain sharper forecast evaluation. In comparative forecast evaluation, this would mean that when comparing two forecasts of daily volatilities in terms of expected values of loss functions, the better forecast can be determined with higher power when using these high-frequency data in the process of forecast evaluation.

Our main point of departure was Patton (2011), who showed that using various noisy volatility proxies - e.g. based on high frequency - is valid in comparative forecast evaluation, that is, preserves the order of the expected losses. Hansen and Lunde (2006) have similar results, while Laurent et al. (2013) provide a multivariate generalization of the characterization in Patton (2011), and Koopman et al. (2005) illustrate the use of realized measures for forecast comparisons on various observed high frequency data sets. We are interested in the comparison of different possibly misspecified forecasts, in which situation Patton (2020) shows that the ranking of the forecasts may depend on the loss function.

A very recent, closely related contribution is by Hoga and Dimitriadis (2021), who focus on predicting the mean, and illustrate their methods for GDP forecasts. Our contributions and their relation to the results in Hoga and Dimitriadis (2021) can be summarized as follows.

1. We extend the analysis from Patton (2011) and Hansen and Lunde (2006) about the validity when using various proxies from volatilities, that is, second moments, to general moments and beyond to ratios of moments. In the terminology of Hoga and Dimitriadis (2021), we rely on the concept of exact robustness of loss functions instead of ordering robustness as considered in Patton (2011). Hoga and Dimitriadis (2021) assume that the proxy enters the loss difference in the same way as the original observation, and in this setting show that only the mean allows for exactly robust loss functions.

2. We formally show in terms of the variance of differences of losses that using proxies will increase the power in comparative forecast testing by decreasing the variance of loss differences. Hoga and Dimitriadis (2021) have similar results for the mean, and also investigate formally the consequence on the power of Diebold-Mariano tests under local alternatives.

3. We show that our results apply both when testing conditional as well as unconditional dominance, see Nolde and Ziegel (2017)) for these notions, while Hoga and Dimitriadis (2021) focus on conditional dominance testing.

4. Finally, we illustrate the theoretical results for simulated high-frequency data using the three-zone approach from Fissler et al. (2015) and Nolde and Ziegel (2017). We show that the choice of the proxies, as well as the choice of the loss function, has a pronounced effect on the comparative evaluation of forecasts: using high-volatility data and the QLIKE loss substantially improves the forecast evaluation.

The paper is structured as follows. In Section 2, we start with a motivating example; we recall strictly consistent loss functions for statistical functionals, and introduce a dynamic framework for forecast evaluation. Section 3 investigates the use of proxies to improve evaluations of forecasts of moments, and in Section 4 this is further discussed and extended to ratios of moments. Section 5 summarizes the results of a simulation study, where we consider comparing forecasts for second, third, and forth moments for GARCH-type time series.

## 2 Motivation and Basic Concepts

### 2.1 A motivating example

Let us first illustrate the use of different proxies and loss functions in Diebold-Mariano tests for equal forecast performance. Consider the following stylized scenario, where the aim is to distinguish between two competing forecasts. Observations correspond to logarithmic returns, and we assume that the true data generating process is a simple GARCH(1,1) model. The total length of the time series is , and we consider rolling one-step-ahead forecasts of the conditional variance using a moving time window, with window length T/3, resulting in forecasts.

There are four forecasters. The first one is lucky to use a GARCH(1,1) model for making predictions, forecasters 2, 3, and 4 use ARCH(1), ARCH(2), and ARCH(7) models, respectively. Clearly, we expect that the predictions from forecaster 1 outperform, in some sense, the other predictions. Moreover, we would expect that the ARCH(7) model beats the ARCH(1) and ARCH(2) models since the former should be a better approximation to a GARCH(1,1) process. Hence, our interest focuses on the null hypotheses

 H0:Forecast h1,t predicts at least as well as% forecast h2,t,

and if is rejected, then is worse than . To decide for or against , we use a Diebold-Mariano test based on the loss differences

 Δn¯L =1/nn∑t=1L(h1,t,yt+1)−L(h2,t,yt+1),

where is some loss function, and materializes at day

. Under suitable conditions, the studentized test statistic

has a limiting standard normal distribution, and

is rejected for large values of .

To evaluate the forecasts, the mean squared error loss , is used, together with the squared returns as an unbiased proxy for the true volatility. The first line of the following table shows the results of the Diebold-Mariano test if the different predictions are compared to the GARCH(1,1) model. Even if the values are positive (hence, slightly favor the GARCH model), they are nowhere near statistically significant. The comparison of the ARCH(7) model with the other two ARCH models even results in values close to zero.

GARCH(1,1) ARCH(1) ARCH(2) ARCH(7)
GARCH(1,1) - 0.788 1.010 0.984
ARCH(7) -0.984 -0.193 0.152 -

Next, assume that, besides the daily returns, also 5-min returns are available for predicting the next day’s volatility. Thus, the squared returns are replaced by the realized volatilities , where are the intraday log returns. The outcomes of the Diebold-Mariano tests for equal predictive performance, now using the realized volatilities as proxies, are as follows:

GARCH(1,1) ARCH(1) ARCH(2) ARCH(7)
GARCH(1,1) - 3.976 3.924 3.506
ARCH(7) -3.506 2.474 2.352 -

In comparison with the first table, the values in the first line are much larger, being statistically significant even on the 0.01-level, and indicating the dominance of the prediction under the GARCH(1,1) model. The comparison of the ARCH(7) model with the other two ARCH models favors the ARCH(7) model, at least on the 0.05-level.

Finally, the evaluator decides to replace the MSE with the QLIKE loss function , and gets the following results of the corresponding Diebold-Mariano tests.

GARCH(1,1) ARCH(1) ARCH(2) ARCH(7)
GARCH(1,1) - 5.589 5.523 4.386
ARCH(7) -4.386 3.642 3.492 -

Now, all entries are even larger in absolute terms, and the ARCH(7) model dominates the competing ARCH models even on the 0.01-level.

Clearly, these results are based on a specific realization of the time series. However, a closer look at this example in Section 5 reveals that this behavior is rather typical.

### 2.2 Loss functions, statistical functionals and dynamic forecasts

We start by recalling the concept of strictly consistent loss (or scoring) functions, see Gneiting (2011). Let be a class of distribution functions on a closed subset

, which we identify with their associated probability distributions, and let

be a (one-dimensional) statistical functional (or parameter).

A loss function (also scoring function) is a measurable map . It is interpreted as the loss if forecast is issued and materializes. is consistent for the parameter of functional relative to the class , if

 for all x∈R, F∈Θ:EF[L(T(F),Y)]≤EF[L(x,Y)], (1)

and indicates that expectation is taken under the distribution for , and we assume that the relevant expected values exist and are finite. Thus, the true functional minimizes the expected loss under . If, in addition,

 EF[L(T(F),Y)]=EF[L(x,Y)]implies thatx=T(F),

then is strictly consistent for . The functional is called elicitable relative to the class

if it admits a strictly consistent loss function. For several functionals such as moments, quantiles, and expectiles,

Gneiting (2011) characterizes all strictly consistent loss functions under some smoothness and normalization conditions. See also Steinwart et al. (2014).

When comparing two forecasts for a given and hence parameter , we say that dominates under for the loss if the difference of expected losses

 EF[L(x,Y)]−EF[L(x′,Y)]<0. (2)

From (1), for a strictly consistent loss function, the true parameter dominates any other forecast.

Now let us consider a forecasting situation. Forecasts are issued on the basis of certain information. Let

be a probability space, let

be a sub--algebra of , the information set at time , on the basis of which the forecast is issued. In finance, can include returns (including high-frequency) up to time as well other covariates observed up to time .

The aim is to predict the functional

, say the mean or the volatility, of the random variable

, which will be observed at time (say one day ahead). For example, this may be the return from to over one day. More precisely, if denotes the conditional distribution of given , then the parameter of interest is

 ^Yt=T(FYt+1|Ft(ω,⋅))

We note that

• the forecast is based on the full information up to time . Thus, even if is a return over one day, for the forecast we use e.g. high-frequency data up to time if these are included in ,

• the observation is special since the parameter is defined via its conditional distribution given .

Thus, to generate the forecast, even if the time horizon for the forecast is, say, one day, it is standard to use high-frequency information contained in up to time . Now, the issue in this paper is how to use additional information contained in , available at time , to improve forecast evaluation.

### 2.3 Comparative forecast evaluation

First let us recall the setting for comparative forecast evaluation based on .

A forecast at time is - in great generality - an -measurable random variable . Now, if is a strictly consistent loss function for , then compared to the true parameter , we have the following:

Conditional dominance. It holds that

 E[L(^Yt,Yt+1)|Ft](ω)≤E[L(Zt,Yt+1)|Ft](ω) for P−a.e. ω∈Ω, (3)

Unconditional dominance. Further, it holds that

 (4)

with equality in (3) or (4) if and only if When comparing with the true conditional value of the function, , these two notions coincide.

Note that is used in the comparisons (3) and (4) by default.

We shall generally compare two potentially misspecified forecasts, that is -measurable random variables and . Then, by definition, conditionally dominates for the loss function if

 E[L(Z′t,Yt+1)|Ft](ω)≤E[L(Zt,Yt+1)|Ft](ω) for P−a.e. ω∈Ω, (5)

with strict inequality on a set of positive probability, while unconditionally dominates for the loss function if

 (6)

a weaker notion. Now, we consider how additional information contained in (apart from ) may be used for forecast evaluation. In the context of high frequency financial data, apart from using the high-frequency data to generate forecasts over daily time horizons, we shall investigate these high-frequency data to obtain sharper forecast evaluation.

## 3 Proxies when Comparing Forecasts of Moments

Suppose that is an interval, is a measurable function such that for all . Then a classical result by Savage (1971), see also Gneiting (2011), characterizes the strictly consistent scoring functions for in the form

 L(x,y)=ϕ(y)−ϕ(x)−ϕ′(x)(h(y)−x),y,x∈I, (7)

where is a strictly convex function for which for all .

First, we formulate the following lemma in the static framework.

###### Lemma 1.

Consider (7) and forecasts .

1. The loss difference,

 L(x,y)−L(x′,y) =ϕ(x′)(1−x′)−ϕ(x)(1−x)+(ϕ′(x′)−ϕ′(x))h(y) (8) =:LDiff(x,x′,h(y))

depends on only through .

2. If and is a random variable (with a given distribution) such that (the moment is the same), then

 EF[LDiff(x,x′,h(Y))]=E[LDiff(x,x′,~Y)]. (9)
3. We have that

 VarF(LDiff(x,x′,h(Y))=(ϕ′(x′)−ϕ′(x))2VarF(h(Y)). (10)

Consequently if in addition to ii) it holds that , then we have that

 Var(LDiff(x,x′,~Y))≤VarF(LDiff(x,x′,h(Y))). (11)

Here, plays the role of the proxy that shall be used to improve forecast evaluation. Part (ii) shows that using instead of is valid if in the sense that dominance relations of forecasts are preserved when using , while (11) shows that evaluation of score differences is actually sharper based on instead of if .

Hoga and Dimitriadis (2021) call the equality of loss-differences in (9) exact robustness. When assuming that the proxy enters the loss-difference in the same fashion as , they show that exact robustness can only hold for strictly consistent scoring functions of the mean. In our more flexible approach, we cover general moments and also ratios of moments, see below.

###### Proof.

Part 1. is easily checked by inserting the loss function (7).

Concerning part 2., inserting from (8), we get by assumption

 EF[LDiff(x,x′,h(Y))]−E[LDiff(x,x′,~Y)] = (ϕ′(x′)−ϕ′(x))(EF[h(Y)]−E[~Y])=0

Part 3. follows similarly easily.

Now, let’s turn to the dynamic setting described in Section 2.

###### Theorem 2 (Forecast dominance testing).

Consider forecasting the conditional moment , and suppose that is -measurable with

 E[~Yt+1|Ft]=E[h(Yt+1)|Ft] a.s. (12)
1. For the loss difference (8), for any two forecasts and (-measurable random variables),

 E[LDiff(Zt,Z′t,h(Yt+1))|Ft]=E[LDiff(Zt,Z′t,~Yt+1)|Ft], (13)

and hence in particular

 E[LDiff(Zt,Z′t,h(Yt+1))]=E[LDiff(Zt,Z′t,~Yt+1)]. (14)

Thus, both conditional as well as unconditional dominance are preserved when using instead of in the forecast comparison.

2. If in addition to (12) we have that

 Var(~Yt+1|Ft)≤Var(h(Yt+1)|Ft),

then

 Var(LDiff(Zt,Z′t,~Yt+1)|Ft)≤Var(LDiff(Zt,Z′t,h(Yt+1))|Ft) (15)

as well as

 (16)

The second part of the theorem shows that a variance reduction is achieved both for testing conditional as well as unconditional dominance.

###### Proof.

(i): (13) is (9) in Lemma 1, conditional on , while (14) follows from (13) by taking expected values.

(ii): (15) is (11) in Lemma 1, (iii), conditional on . As for (16), we have that

 Var(LDiff(Zt,Z′t,~Yt+1))

Since by (13),

 Var(E[LDiff(Zt,Z′t,~Yt+1)|Ft])=Var(E[LDiff(Zt,Z′t,h(Yt+1))|Ft])

the conclusion follows since

 =E[(ϕ′(Z′t)−ϕ′(Zt))2Var(~Yt+1|Ft)] ≤E[(ϕ′(Z′t)−ϕ′(Zt))2Var(h(Yt+1)|Ft)]

## 4 Ratios of Moments and Further Parameters

Suppose that is an interval, and are measurable functions such that , for all . The target parameter is

 T(F)=EF[h(Y)]EF[s(Y)].

Gneiting (2011) shows that strictly consistent loss functions for are of the form

 L(x,y)=s(y)(ϕ(y)−ϕ(x))−ϕ′(x)(h(y)−xs(y))−ϕ′(y)(h(y)−ys(y)),y,x∈I, (17)

where it is additionally assumed that

 EF[|h(Y)||ϕ′(Y)|]<∞,EF[|s(Y)||ϕ(Y)|]<∞,EF[|Y||s(Y)||ϕ(Y)|]<∞,F∈Θ.
###### Lemma 3.

Consider (17) and forecasts .

1. The loss difference,

 L(x,y)−L(x′,y) =(ϕ′(x′)−ϕ′(x))h(y)+(xϕ′(x)−ϕ(x)−x′ϕ′(x′)+ϕ(x′))s(y) (18) =:LDiff(x,x′,h(y),s(y))

depends on only through and .

2. If , and are random variables (with given distributions) such that and (the moment is the same), then

 EF[LDiff(x,x′,h(Y),s(Y))]=E[LDiff(x,x′,~Y1,~Y2)]. (19)
3. If in addition to (ii) we have that for we have that

 Var(LDiff(x,x′,~Y1,~Y2))≤VarF(LDiff(x,x′,h(Y),s(Y))). (20)
###### Proof.

The form (18) of the loss difference follows directly from inserting (17). Then (19) and (13) follow immediately from the form of the loss difference. ∎

###### Theorem 4 (Forecast dominance testing: Ration of moments).

Consider forecasting the ratio of conditional moments , and suppose that are -measurable with

 E[~Y(1)t+1|Ft]=E[h(Yt+1)|Ft],E[~Y(2)t+1|Ft]=E[s(Yt+1)|Ft] a.s. (21)
1. For the loss difference (18), for any two forecasts and (-measurable random variables),

 E[LDiff(Zt,Z′t,h(Yt+1),s(Yt+1))|Ft]=E[LDiff(Zt,Z′t,~Y(1)t+1,~Y(2)t+1)|Ft], (22)

and hence in particular

 (23)
2. If in addition to (12) we have that for all -measurable random variables , we have that

 V2Var(~Y(1)t+1|Ft)+W2Var(~Y(2)t+1|Ft)+2VWCov(~Y(1)t+1,~Y(2)t+1|Ft) ≤V2Var(h(Yt+1)|Ft)+W2Var(s(Yt+1)|Ft)+2VWCov(h(Yt+1),s(Yt+1)|Ft),

then

 Var(LDiff(Zt,Z′t,~Y(1)t+1,~Y(2)t+1)|Ft)≤Var(LDiff(Zt,Z′t,h(Yt+1),s(Yt+1))|Ft) (24)

as well as

 Var(LDiff(Zt,Z′t,~Y(1)t+1,~Y(2)t+1))≤Var(LDiff(Zt,Z′t,h(Yt+1),s(Yt+1))) (25)

The proof is immediate from Lemma 3, and the final inequality (25) follows as (16) in Theorem 2. The condition for a potential variance reduction in Theorem 4, (ii), is more restrictive than that from Theorem 2, (ii), and apart from relating the variances of the two moments of and to those of the proxies, also involves conditional covariances.

Theorem 4

does not apply to measures such as skewness and kurtosis, which even for centered distributions are known not to allow for strictly consistent scoring functions.However, the revelation principle, Theorem 4 in

Gneiting (2011), and the elicitability (existence of a strictly consistent scoring function) and hence joint elicitability of moments implies that for centered distributions, these measures are elicitable when considered together with the second moment. Roughly speaking, for the skewness this involves the two-dimensional parameter consisting of third and second moment, and for the kurtosis consisting of fourth and second moment. The analysis of the corresonding loss-differences is then similar to that in Theorem 4.

## 5 Simulations

### 5.1 General setup

Following Nolde and Ziegel (2017), in comparative backtesting, we are interested in the following null hypotheses

 H−0:Forecast h1,t predicts at least as % well as h2,t, H+0:Forecast h1,t predicts at most as well% as h2,t.

The forecast is used as a benchmark. If the hypothesis is rejected, then is worse than ; if is rejected, is better than . The error of the first kind for rejecting one of the two hypotheses, even though they are true, can be controlled by the level of significance. As in Nolde and Ziegel (2017), we define

 λ =limn→∞1nn∑t=1E[L(h1,t,Yt+1)−L(h2,t,Yt+1)]=E[L(h1,t,Yt+1)−L(h2,t,Yt+1)]

(assuming first-order stationarity). Then, dominance of over is equivalent to , and predicts at most as well as if . Therefore, the comparative backtesting hypotheses can be reformulated as

 H−0:λ≤0,H+0:λ≥0.

Forecast equality can be tested with the so-called Diebold-Mariano test (Diebold and Mariano, 1995; Giacomini and White, 2006; Diebold, 2015), which is based on normalized loss differences. Here, the test statistic is given by

 S =√nΔn¯L^τ,

where and

is an estimator of the long-run asymptotic variance of the loss differences. One possible choice for

is

 ^τ2

where denotes the lag sample autocovariance of the sequence of loss differences (Gneiting and Ranjan, 2011; Lerch et al., 2017). Another possible choice (Diks et al., 2011) is , where is the largest integer less than or equal to . As a compromise, we used

. Under the null hypothesis of a vanishing expected loss difference and some further regularity conditions, the test statistic

is asymptotically standard normally distributed. Therefore, we obtain an asymptotic level- test of if we reject the null hypothesis when , and of if we reject the null hypothesis when .

To evaluate the tests for a fixed significance level , we use the following three-zone approach of Fissler et al. (2016). If is rejected at level , we conclude that the forecast is worse than , and we mark the result in red; similarly, if is rejected at level , forecast is better than , and we mark the result in green. Finally, if neither nor can be rejected, the marking is yellow.

### 5.2 Squared returns and realized volatility

Assume that the log returns follow a GARCH(1,1) process defined by

 σ2t=a0+a1r2t−1+bσ2t−1,rt=σtεt,

where

 εt =m∑i=1εt,i,εt,i=N(0,1/m), i=1,…,m, (26)

and all independent. Assuming that is constant on , intraday returns are given by .

We use , and ; the first is a typical range using 5-min returns, the latter corresponds to the use of half-hour returns at the New York Stock Exchange (NYSE). As the total length of the time series, we take and . For this time series, we generate rolling one-step-ahead forecasts of the conditional variance using a moving time window, with window length , refitted every 10 time steps, for GARCH(1,1), ARCH(1), ARCH(2), and ARCH(7) models. Hence, for , the fit is based on 500 values, and the DM tests use 1000 forecasts of each model. All computations are done in R (R Core Team, 2021) using the R packages rugarch (Ghalanos, 2020) and fGarch (Wuertz et al., 2020).

To stabilize the results, the following figures show the means of the results of 50 replications (500 replications for Figures 1-3) of the Diebold-Mariano test. All figures use the three-zone approach described in the previous section with the following modification: we simultaneously show rejection of at level and by marking in light red, red and dark red, respectively. Marking in light green, green and dark green signals rejection of at the three levels. Besides the forecasts from the different (G)ARCH models, we show the result for the optimal forecast, given by the true conditional volatilities. Each figure shows four plot matrices: in the left (right) column, the squared returns (realized volatilities) are used as proxies. In the upper row, the loss function is the mean squared error , whereas in the lower row, the QLIKE loss function is used.

Figure 1 shows the results of Diebold-Mariano tests under normal innovations, with and . The left panels show results for the squared returns , the right panels for the realized volatilities .

Let us first discuss the results shown in the lower-left panel, i.e. for the QLIKE loss and using the squared returns as proxies. The second value in the left column, +1.532, is the average value of the DM test statistic comparing the forecast from the GARCH(1,1) model with the optimal forecast , the true conditional volatilities. The positive value hints at the superiority of , but the value is not statistically significant. The results are significant when comparing the ARCH(1), ARCH(2) and ARCH(7) model with the optimal forecast; here, the red color indicates a significant rejection of at the 0.05-level. The light red entries in the second column show that forecasts from the ARCH(1) and ARCH(2) models are worse than the forecasts from the GARCH(1,1) model (which is the true data generating process) at level 0.1, but not on the 0.05-level. The forecast from GARCH(7) is not significantly worse than the GARCH(1,1).

Now, let’s turn to the lower-right panel with the realized volatilities as proxies. Here, all corresponding entries are marked in dark red, signalizing the rejection of at level 0.01 in all cases. Hence, the power of the DM test is clearly higher by using realized volatilities compared to squared returns. Looking at the upper row, we see that the results for the MSE are similar from a qualitative point of view. However, there are fewer statistically significant entries compared to the QLIKE loss function. Hence, the latter allows for sharper forecast evaluation in this example.

Figure 2 shows results from the same setting as Fig. 1 apart from that we use , corresponding to the use of half-hourly returns, instead of . Hence, the results in the two left panels are the same as in Fig. 1 (up to simulation error). The right panels are similar as in Fig. 1, as well. A closer look shows that all entries have smaller absolute values, showing the decreasing power in differentiating forecasts.

In Figure 3, the realized volatilities are replaced by the adjusted intra-daily log range, given by

 (maxslogPs−minslogPs)/(2√log2),t−1

where denotes the price process. This volatility proxy is unbiased under the above assumptions; details can be found in Patton (2011), p. 250. All in all, forecast evaluation using the adjusted intra-daily log range is less sharp compared to the realized volatilities with , but power is clearly higher than using squared returns as proxies.

We also replaced the normal distribution of the intraday innovations by centered skewed and long-tailed distributions. For this, we used the normal inverse gaussian distribution

(Barndorff-Nielsen, 1997) with parameters

 α =2,β=1,γ=√α2−β2,δ=γ3/α2/m,μ=−δβ/γ.

This results in and . Since the class of nig distributions with fixed shape parameters and is closed under convolution, the distribution of is given by , with ,  , and .

Fig. 4 shows the results of the DM tests as in Fig. 1, i.e. for , using nig instead of normally distributed innovations. Again, the results are qualitatively comparable to the results in Fig. 1, but the absolute values of the entries are generally smaller. Hence, the change in the distribution of the innovations has a negative effect on the power of the test. Note that this decrease of power is larger for the realized volatilities than for the squared returns. This can be explained by the fact that the skewness and kurtosis of the daily innovations are rather modest with values of 1 and 5.67, whereas the skewness and kurtosis of the intraday innovations are 10 and 269.8, respectively.

### 5.3 Higher moments

The use of realized higher moments, skewness, and kurtosis to estimate and forecast returns has become quite standard in the literature. For example, Neuberger (2012) analyzed realized skewness and showed that high-frequency data can be used to provide more efficient estimates of the skewness in price changes over a period. Amaya et al. (2015) constructed measures of realized daily skewness and kurtosis based on intraday returns, and analyzed moment-based portfolios. Recently, Shen et al. (2018) discussed the explanatory power of higher realized moments.

#### 5.3.1 Third moment

Assuming , we are interested in the conditional third moment . Possible proxies for are the cubed return and the realized third moment . We use the GARCH(1,1) model of subsection 5.2, with the normal inverse Gaussian distribution for the innovations. Under this model, we obtain

 ρt =E[r3t|Ft−1]=E[σ3tε3t|Ft−1]=σ3tE[ε3t], E[ε3t] =E⎡⎣(m∑i=1εt,i)3⎤⎦=E⎡⎣∑iε3t,i+3∑i

Since

 E[RM(3)t|Ft−1] =σ3tm∑i=1E[ε3t,i]=σ3tE[ε3t],

is an unbiased estimator of

. As forecast of , we use , where denotes the one-step ahead forecast of from the different (G)ARCH models.

Figure 5 shows the results of Diebold-Mariano tests under innovations, with chosen such that . Skewness and kurtosis of the intraday innovations are and 37.67, respectively, compared to the values 1 and 5.67 of the daily innovations. Here, total length of the simulated time series is , and we use , i.e. half-hourly returns. The left panels show the results for the cubed returns , the right panels for the realized third moment .

At first glance, the results seem to be rather different from the corresponding ones for the volatility, since the number of significant entries is much lower (cp. Fig. 2). But they go in the same direction: use of the realized moments increases the power of the DM test when the optimal forecast competes against the other models, or when the true data generating process is compared with ARCH models.

We have also used in the simulations; the results (not shown) go in the same direction, but none of the values is statistically significant, even at the 0.1-level.

#### 5.3.2 Fourth moment

Here, we are interested in the conditional fourth moment . Again, we use the GARCH(1,1) model as in subsection 5.2, and obtain

 E[ε4t] =∑iE[ε4t,i]+6∑i

Hence, unbiased proxies for are and the realized corrected fourth moment

 cRM(4)t =m∑i=1r4t,i+6∑i

As forecast of , we use .

The left and right panels of Fig. 6 show the results of the DM tests, using and the realized corrected fourth moment as proxies, respectively. The innovations are normally distributed; further, and . The general picture resembles strongly the results of the volatility forecasts in Fig. 2, and all conclusions also apply here, even though the actual entries are a bit smaller.

When replacing the normal by the nig innovations, the power of the DM test decreases strongly (cf. Fig. 7). On the other hand, the entries are somewhat larger as in forecasting the third moment (with ). Here, at least a few values are significant on the 0.1-level.

### 5.4 An apARCH model for the fourth moment

Instead of modeling the volatility, and computing higher moments under this process, it is also possible to use suitable models for higher moments directly. Harvey and Siddique (1999, 2000)

, for example, considered autoregressive model for conditional skewness.

Lambert and Laurent (2002) used the asymmetric power (G)ARCH or APARCH model of Ding et al. (1993) to describe dynamics in skewed location-scale distributions. Brooks et al. (2005) used both separate and joint GARCH models for conditional variance and conditional kurtosis, whereas Lau (2015) modeled (standardized) realized moments by an exponentially weighted moving average.

Hence, in this section, we model the fourth moment directly by an asymmetric power ARCH (apARCH) process (Ding et al., 1993). Specifically, the log returns follow an apARCH(1,1) model with

 σ4t=ω+αr4t−1+βσ4t−1,rt=σtεt,

where , for , and all are independent. Assuming again that is constant on , intraday returns are given by . We use such that the unconditional variance is

 σ2 =(ω1−E(ε41)α−β)2/δ=√2.

Further, and . As in the last section, unbiased proxies for are and . As forecast of , we use , where denotes the one-step ahead forecast of from the different apARCH models, namely apARCH(1,1), apARCH(1), apARCH(2), and apARCH(3).

The left and right panels of Figure 8 show the results of the DM tests for the apARCH process with exponent 4, with and the realized corrected fourth moment, respectively, as proxies.

The visual comparison of the upper-left and lower-right panels is striking: in the latter, each result is significant, whereas the former shows no significant entries. Hence, using high-frequency data and a suitable loss function results in a highly improved forecast evaluation. Generally, the results are quite similar to the results for the fourth moment based on the GARCH process in Fig. 6.

Finally, we consider again volatility forecasts, but now based on the apARCH process with exponent 4. The results are shown in Figure 9. We see a slight increase in power compared to Fig. 8; similar as for the GARCH process, differentiating between volatility forecasts is easier compared to forecasts of the 4th moment in the case of the apARCH process at hand.

To sum up the results of the simulations, it has become obvious that using high-frequency data for the proxies improves the forecast evaluation in each example. In most cases, the effect is substantial. There is also an effect of the choice of the loss function: the power of the DM test improves when using the QLIKE loss compared to the MSE loss function.

## Acknowledgements

We would like to thank Andrew Patton and Tilmann Gneiting for pointing out some relevant references, in particular the paper by Hoga and Dimitriadis (2021).

## References

• Amaya et al. (2015) Amaya, D., Christoffersen, P., Jacobs, K., and Vasquez, A. (2015). Does realized skewness predict the cross-section of equity returns? Journal of Financial Economics, 118:135–167.
• Barndorff-Nielsen (1997) Barndorff-Nielsen, O. (1997). Normal inverse gaussian distributions and stochastic volatility modelling. Scandinavian Journal of Statistics, 24:1–13.
• Brooks et al. (2005) Brooks, C., Burke, S. P., Heravi, S., and Persand, G. (2005). Autoregressive conditional kurtosis. Journal of Financial Econometrics, 3:399–421.
• Corsi (2009) Corsi, F. (2009). A simple approximate long-memory model of realized volatility. Journal of Financial Econometrics, 7(2):174–196.
• Diebold (2015) Diebold, F. X. (2015). Comparing predictive accuracy, twenty years later: A personal perspective on the use and abuse of diebold-mariano tests. Journal of Business and Economic Statistics, 33:1–24.
• Diebold and Mariano (1995) Diebold, F. X. and Mariano, R. S. (1995). Comparing predictive accuracy. Journal of Business and Economic Statistics, 13:253–263.
• Diks et al. (2011) Diks, C., Panchenko, V., and van Dijk, D. (2011). Likelihoodbased scoring rules for comparing density forecasts in tails. J. Econometrics, 163:215–230.
• Ding et al. (1993) Ding, Z., Granger, C., and Engle, R. (1993). A long memory property of stock market returns and a new model. Journal of Empirical Finance, 83:83–106.
• Fissler et al. (2015) Fissler, T., Ziegel, J. F., and Gneiting, T. (2015). Expected shortfall is jointly elicitable with value at risk-implications for backtesting. arXiv preprint arXiv:1507.00244.
• Fissler et al. (2016) Fissler, T., Ziegel, J. F., and Gneiting, T. (2016). Expected shortfall is jointly elicitable with value at risk - implications for backtesting. Risk Magazine, January:58–61.
• Ghalanos (2020) Ghalanos, A. (2020). rugarch: Univariate GARCH models. R package version 1.4-4.
• Giacomini and White (2006) Giacomini, R. and White, H. (2006). Tests of conditional predictive ability. Econometrica, 74:1545–1578.
• Gneiting (2011) Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Statistical Association, 106(494):746–762.
• Gneiting and Ranjan (2011) Gneiting, T. and Ranjan, R. (2011). Comparing density forecasts using threshold- and quantile-weighted scoring rules. Journal of Business and Economic Statistics, 29:411–422.
• Hansen and Lunde (2006) Hansen, P. R. and Lunde, A. (2006). Consistent ranking of volatility models. Journal of Econometrics, 131(1-2):97–121.
• Harvey and Siddique (1999) Harvey, C. R. and Siddique, A. (1999). Autoregressive conditional skewness. The Journal of Financial and Quantitative Analysis, 34:465–487.
• Harvey and Siddique (2000) Harvey, C. R. and Siddique, A. (2000). Conditional skewness in asset pricing tests. Journal of Finance, 55:1263–1295.
• Hoga and Dimitriadis (2021) Hoga, Y. and Dimitriadis, T. (2021). On testing equal conditional predictive ability under measurement error. arXiv preprint arXiv:2106.11104.
• Koopman et al. (2005) Koopman, S. J., Jungbacker, B., and Hol, E. (2005). Forecasting daily variability of the s&p 100 stock index using historical, realised and implied volatility measurements. Journal of Empirical Finance, 12(3):445–475.
• Lambert and Laurent (2002) Lambert, P. and Laurent, S. (2002). Modeling skewness dynamics in series of financial data using skewed location-scale distributions. Working Paper, Université Catholique de Louvain and Université de Liège.
• Lau (2015) Lau, C. (2015). A simple normal inverse gaussian-type approach to calculate value-at-risk based on realized moments. Journal of Risk, 17:1–18.
• Laurent et al. (2013) Laurent, S., Rombouts, J. V., and Violante, F. (2013). On loss functions and ranking forecasting performances of multivariate volatility models. Journal of Econometrics, 173(1):1–10.
• Lerch et al. (2017) Lerch, S., Thorarinsdottir, T. L., Ravazzolo, F., and Gneiting, T. (2017). Forecaster’s dilemma: Extreme events and forecast evaluation. Statist. Sci., 32:106–127.
• Neuberger (2012) Neuberger, A. (2012). Realized skewness. Review of Financial Studies, 25:3423–3455.
• Nolde and Ziegel (2017) Nolde, N. and Ziegel, J. F. (2017). Elicitability and backtesting: Perspectives for banking regulation. The annals of applied statistics, 11:1833–1874.
• Patton (2011) Patton, A. J. (2011). Volatility forecast comparison using imperfect volatility proxies. Journal of Econometrics, 160:246–256.
• Patton (2020) Patton, A. J. (2020). Comparing possibly misspecified forecasts. Journal of Business & Economic Statistics, 38(4):796–809.
• R Core Team (2021) R Core Team (2021). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
• Savage (1971) Savage, L. J. (1971). Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66(336):783–801.
• Shen et al. (2018) Shen, K., Yao, J., and Li, W. K. (2018). On the surprising explanatory power of higher realized moments in practice. Statistics and Its Interface, 11:153–168.
• Steinwart et al. (2014) Steinwart, I., Pasin, C., Williamson, R., and Zhang, S. (2014). Elicitation and identification of properties. In Conference on Learning Theory, pages 482–526. PMLR.
• Wuertz et al. (2020) Wuertz, D., Setz, T., Chalabi, Y., Boudt, C., Chausse, P., and Miklovac, M. (2020). fGarch: Rmetrics - Autoregressive Conditional Heteroskedastic Modelling. R package version 3042.83.2.