Multivariate financial returns exhibit similar patterns. Moreover, besides time-dependent volatilities, they also display time-dependent correlations (Patton2006)
. Covariances are the product of variances and correlations. Therefore these temporal dependencies are likewise present in covariance matrices. To capture these propertiesengle1995multivariate
proposed a multivariate extension of GARCH called BEKK, based on synthesized work of Baba, Engle, Kraft, and Kroner. An alternative to BEKK are stochastic volatility models, including recent non-parametric models based on generalized Wishart processes(Wilson11; fox2011bayesian). In practice BEKK performs similarly to the generalized Wishart processes for modeling financial data. However, BEKK is known to suffer from the following disadvantages:
The parameters of BEKK are usually fit by maximum likelihood. The large number of parameters in BEKK and local maxima in the likelihood function often lead to overfitting.
The parameter values in BEKK are constant. However, financial markets are dynamic and market conditions change with time. BEKK does not naturally capture these shifts in market conditions.
Finally, maximum likelihood fit of the parameters of BEKK involves solving a non-linear optimization process which is computationally expensive and infeasible in high-dimensions.
To address the difficulties mentioned above, we present a novel dynamic model for describing time-dependent covariance matrices which extends BEKK as follows: Instead of computing point estimates, as in BEKK, we follow a fully Bayesian approach and compute posterior probabilities for the parameter values. This reduces the detrimental effect of multiple maxima in the likelihood function and limits overfitting problems. In addition to this, the new dynamic model incorporates a diffusion process for parameter values. At each point in time, every parameter is slightly modified by a random perturbation. These perturbations allow the model to adapt its parameters to changes in market conditions. Finally, Bayesian inference is performed using a regularized auxiliary particle filter(liu1999combined). This technique is very efficient in terms of computational cost and allows our method to scale up to high dimensions.
The performance of the new dynamic model for time-changing covariance matrices is evaluated in a series of experiments with real financial returns. The proposed model is compared with the standard BEKK model and a variant of BEKK that assumes multivariate Student’s innovations. Finally, we compare the proposed model against the generalized Wishart processes. Overall, the new dynamic model for time-varying covariance matrices obtains the best predictive performance.
The rest of the document is organized as follows: Section 2 describes standard heteroscedastic models such as GARCH and BEKK. Section 3 introduces our novel Bayesian Multivariate Dynamic Covariance model (BMDC). Section 4 reviews current models for dynamic covariances in machine learning. Section 5 describes the particle filter algorithm for making inference in BMDC. Experiments comparing BMDC with BEKK are included in Section 6 and additional experiments comparing BMDC with the generalized Wishart process are included in Section 7. Finally, Section 8 contains the conclusions of the study.
2 Review of GARCH and BEKK
GARCH is the standard time-series model for univariate heteroscedastic data. GARCH assumes Gaussian noise or innovations and produces a sequence of time-varying variances that follow an Autoregressive and Moving Average (ARMA) process with autoregression on previous variance values and moving average on previous squared time-series values:
The generative model is flexible and can produce a variety of clustering behavior of high and low volatility periods for different settings of the model coefficients, and . Maximum likelihood is used to learn these coefficients with and usually set to 1 to reduce overfitting problems.
BEKK (engle1995multivariate) is a popular multivariate extension of GARCH, where the dynamic covariance matrix for the data follows an ARMA process:
where and are coefficient matrices for dimensional data and is a triangular matrix with non-zero entries. Therefore BEKK(,) is a highly parameterized model with a total of parameters.
Restricted versions of BEKK are used in practice to mitigate overfitting problems. The order parameters and are set to 1 and the matrices and are constrained to be diagonal. The expression for is now
This diagonal BEKK model (engle1995multivariate) has only parameters. We will use this model as the baseline for comparison because it often has better predictive performance than other versions of BEKK and its computational cost is also lower. Subsequent references to BEKK will mean the diagonal BEKK model shown in (5).
Standard BEKK has Gaussian innovations. However, there is strong evidence that financial returns are heavy-tailed. harvey1992unobserved and fiorentini2003maximum incorporate heavy-tails in BEKK by modeling using a multivariate Student’s distribution with degrees of freedom, zero mean, and scale matrix :
The degrees of freedom, , should be larger than two for to have a well defined covariance. Finally, the expression for ensures the expected covariance of is . We will refer to the diagonal BEKK model with Student’s innovations as BEKK-T.
3 Bayesian Multivariate Dynamic Covariance Models
A major limitation of BEKK is that the parameter matrices , and
are assumed to be constant. This is unrealistic for financial data, where market fundamentals are expected to change with time. A heuristic solution is to run BEKK over different windows of historical data. The problem is then how to choose the window size, with trade-offs between noisy but reactive estimates of the model parameters for small windows and more stable but constant parameter estimates for large windows.
As a more principled approach to capture changes in market dynamics, we introduce a novel Bayesian Multivariate Dynamic Covariance model (BMDC) with time-varying parameters. For this, we replace (5) by
where , and are time-dependent matrices. Let and . Figure 1 shows the corresponding graphical models for BEKK and BMDC.
In BMDC, the dynamic parameters in follow a diffusion process in which is obtained by adding a small random perturbation to the parameters in . Since and are diagonal, let and denote
-dimensional vectors with the diagonal elements of these matrices and letbe the vector with the upper triangular terms of . Then we specify the following diffusion process for , and :
where the hyper-parameters , and control the amount of drift in the system. In practice, vague priors are chosen for the value of these hyper-parameter, with mean prior drift set to zero, . was set to so that , and can move up to , where is the number of observations. Other values of were tried with little difference in predictive performance. The prior for the initial state , and is also vague, taking into account the constraint so that does not diverge, where calculates the determinant of a matrix.
An important property of BMDC is that its predictive distribution is heavy-tailed. This is desirable as financial time series have fat tails (Cont2001). The heavy-tails in the predictions of BMDC arise because
where the distribution of is obtained by integrating (9) with respect to the posterior for . Since is Gaussian, (14) is an infinite mixture of multivariate Gaussians and it will generally be heavy-tailed. This implies that the predictive density in the BMDC model will also have heavy tails.
In addition to BMDC, we propose a variant of this model with Student’s innovations, which we denote BMDC-T. In this case, a vague log-normal prior is placed on the parameter corresponding to the number of degrees of freedom of the Student’s innovations:
4 Related Work
The GARCH and BEKK models described in Section 2 constitute the most popular family of volatility models. They are characterized by the latent covariance matrix being dependent on both its most recent past value and the previous time-series observation . Another class of models is the family of Stochastic Volatility models (harvey1994multivariate; chib2009multivariate; gourieroux2009wishart; philipov2006factor). In this case, is conditionally independent of given its previous value . That is, graphically, in a stochastic volatility model there will be no arrow from node to node in the graphs shown in Figure 1. However, will still be conditionally independent of for and given and . This latter condition does not hold in recent generalizations of these models based on generalized Wishart processes (GWP) (fox2011bayesian; Wilson11). In this case, there are dependencies between all the latent covariance matrices and the model can produce complex non-linear patterns in the evolution of covariance matrices. However, it is not clear that such flexibility is necessary for successfully modeling financial data. Our experiments show that BMDC outperforms GWP on this task, which confirms that this is not the case.
5 Inference with Particle Filters
Inference for the proposed Bayesian models is performed using particle filters (doucet2001sequential). Particle filters seem a natural choice to do online inference for these non-linear and non-Gaussian sequential models. BMDC has Gaussian likelihoods, but is non-linear in the model parameters , and . Furthermore, particle filters can easily accommodate models with Student’s innovations, such as BMDC-T.
We analyzed several particle filtering methods, including Resample-Move (gilks2001following), Regularized Sequential Importance Sampling (musso2001improving), Regularized Sequential Importance Resampling (gordon1993novel), and Regularized Auxiliary Particle Filter (RAPF) (liu1999combined) to perform inference in the proposed models. RAPF, a hybrid version of regularized (musso2001improving) and auxiliary particle filters (pitt1999filtering), exhibited the best performance in terms of predictiveness. This agrees with results given by casarin2009online. An implementation of the RAPF for BMDC is shown in Algorithm 1. A detailed description of the algorithm is given in the following paragraph.
liu1999combined introduced the RAPF, which combines the Regularized Particle Filter with the Auxiliary Particle Filter. The Regularized Particle Filter allows joint inference of the diffusion hyper-parameters, , , , and the hidden states, , and . This method explores the posterior distribution of the model parameters by taking kernelized steps sequentially, thereby avoiding particle death. The shrinkage kernel, (17) and (19), avoids over-dispersion problems in standard Regularized Particle Filters and improves prediction accuracy. The shrinkage kernel retains the previous mean, while the posterior variance does not diverge:
The Auxiliary Particle Filter part of the algorithm can be viewed as interchanging the importance sampling and resampling steps in traditional particle filters such as Sequential Importance Resampling. The resampling, (18), is performed on , which are predicted estimates of
given particles from the previous time step. Importance samples are then generated from the resampled point estimates. Particle diversity is maintained if the predicted estimates from the previous step are close to the true state. Auxiliary particle filtering is less sensitive to outliers and works well when the predicted estimates are good representations of the unknown states.
There are two algorithm parameters in Algorithm 1. The number of particles and the shrinkage parameter . We followed liu1999combined and fixed . This leads to a little shrinkage and avoids parameter variance exploding. If shrinkage is large () then the particle filter would propose from a less heavy-tailed proposal distribution, with the parameters set to be their empirical means from the previous step. When the empirical means are inaccurate, then the proposed particles will not be representative of the hidden state. The number of particles, , can affect the accuracy of the predictions, and is investigated in Section 6. The computational complexity of the algorithm is at each time step, where is the dimension of the data, since importance sampling at each step requires likelihood computations, each one with cost .
6 Experiments for BMDC vs. BEKK
We evaluated the performance of BMDC, BEKK and their Student’s variants in several experiments with multivariate financial time series. The high computational cost of BEKK limited the maximum number of different financial assets in each of the analyzed series to five. We considered daily foreign exchange (FX) and daily equity returns, as well as intraday FX returns. The daily FX time series contain a total of 780 returns from January 2008 to January 2011. The daily equity time series contain 3000 returns from Jan 2000 to Dec 2011. The intraday FX time series consist of 5000 five-minute returns, which covered approximately the first six trading months of 2008. The price data were pre-processed to eliminate spurious prices. In particular, we eliminated prices corresponding to times when markets were closed or not liquid. All the time series were standardized to have zero mean and unit standard deviation.
To illustrate the usefulness of considering time-varying parameter values, we compared the predictive performance of BMDC and BMDC-T with BEKK and BEKK-T. In addition, we compared the execution times of i) the RAPF method used by the Bayesian models and ii) the maximum likelihood estimation method used by the BEKK models. Finally, we studied the sensitivity of RAPF to the number of particles used.
The performance of each method is measured in terms of the predictive log-likelihood on the first return out of the training set. During the experiments, each method receives an initial time-series of length 50. The different methods are trained on that data and then a one-step forward prediction is made. The predictive log-likelihood is evaluated on the next observation out of the training set. Then the training set is augmented with the new observation and the training and prediction steps are repeated again. The process is repeated sequentially until no further data is received.
6.1 Results for BMDC vs. BEKK
Table 1 shows the average predictive log-likelihood for BEKK, BEKK-T, BMDC and BMDC-T on twenty-one daily FX time series sorted by the dimensionality of the data. The method with the best predictive performance on each time series is highlighted in bold. Corresponding results are shown in tables 2 and 4 for fifteen intraday FX and thirty daily equity time series respectively. Overall, the best performing method is BMDC-T followed by BEKK-T and BMDC. The worst performing method is BEKK.
We perform a statistical test to determine whether differences among BEKK, BEKK-T, BMDC and BMDC-T are significant. These methods are compared to each other using the multiple comparison approach described by demšar2006statistical. In this comparison framework, all the methods are ranked according to their performance on different tasks. Statistical tests are then applied to determine whether the differences among the average ranks of the methods are significant. In our case, each of the datasets analyzed represents a different task. A Friedman rank sum test rejects the hypothesis that all methods have equivalent performance at . Pairwise comparisons between all the methods with a Nemenyi test at a 95% confidence level are summarized in Figure 3. The methods whose average ranks across datasets differ more than a critical distance (segment labeled CD in the figure) show significant differences in performance at this confidence level. The Nemenyi test confirms that BMDC-T is superior to all the other methods. Additionally, the dynamic methods are superior to their static counterparts with BMDC outperforming BEKK and BMDC-T beating BEKK-T. Finally, BMDC is not statistically different from BEKK-T.
The plot in Figure 3
shows that the heavy-tailed models BMDC-T and BEKK-T perform better than the non-heavy-tailed BEKK. Although BMDC assumes a Gaussian likelihood its posterior predictive distribution is heavy-tailed, as discussed in Section3. Figure 2 confirms this by showing the logarithm of the posterior predictive density produced by BMDC on a particular instance of the analyzed time series. To produce this plot, we evaluated on a grid of values for one dimension of , averaging over the available particles which approximate the posterior distribution of and . This is compared to the plot obtained by evaluating on the posterior mean estimate of and , which is approximated by the empirical mean computed across all the particles.
To understand the superior performance of BMDC relative to BEKK, we plot the average predictive log-likelihood of each method against the number of observations in Figure 4. The plot shows typical average predictive log-likelihood for a Daily FX time series. BMDC-T is clearly the most predictive method for any number of observations. BEKK and BEKK-T underperform BMDC and BMDC-T early on, when relatively few observations are available to fit the models. These two methods are susceptible to overfitting at early stages (especially BEKK-T in this case). With more data, BEKK-T is less susceptible to overfitting and outperforms BEKK. However, BEKK and BEKK-T still perform worse than BMDC and BMDC-T when the amount of training data increases. This confirms that BMDC and BMDC-T are better models for dynamic covariances in financial data, not only because BEKK and BEKK-T suffer from initial overfitting problems. As a note of financial interest, the average predictive log-likelihood dips after 150 observations. This corresponds to the highly turbulent period near the end of 2008 with large swings in financial asset prices resulting in lower predictive log-likelihood.
A major advantage of BMDC and BMDC-T with the RAPF method for Bayesian inference is scalability. Table 3 shows prediction sensitivity of the regularized particle filter to different number of particles and total execution times in minutes in parentheses for the BMDC-T model on the daily FX dataset with 780 observations, ordered by data dimension. BEKK-T predictions and execution times are also provided for comparison, except for datasets of ten or higher dimensions where BEKK-T failed to finish in a reasonable amount of time. BEKK-T was implemented using numerical optimization routines provided by Kevin Sheppard 111http:///www.kevinsheppard.com/wiki/UCSD_GARCH/.
Clearly particle filtering is much faster than the numerical optimization of the log-likelihood function used by BEKK. In fact, particle filtering can be used for intraday trading and hedging since each sequential update is on the order of seconds for five dimensional time series with particles. The computational cost of the particle filter is at each step, as described in Section 5. The computational cost for BEKK is at each step, where denotes the average number of numerical iterations needed per cycle and is the length of the training data currently seen. The power in this cost originates because, at each iteration of the optimization process, BEKK has to compute the gradient of parameters, where each gradient computation requires operations. This cost makes BEKK infeasible for large datasets. In contrast, the cost of the RAPF at each step does not depend on and only scales as . This allows the RAPF to analyze long high-dimensional time series.
Table 3 also shows the sensitivity of the predictions of BMDC-T to the number of particles . Increasing improves performance, but has relatively small effect for low dimensional datasets. For higher dimensions, that is, , improvements are more substantial, but do not scale linearly with . Since run times are linear in , but improvements in predictive performance are not, we can choose such that inference and prediction are done in a desired amount of time for high dimensional datasets.
7 Experiments for BMDC vs. GWP
We performed another series of experiments comparing BMDC, BEKK with full parameter matrices (BEKK-Full) and the diagonal version of BEKK with the generalized Wishart process (GWP) proposed by Wilson11. For these experiments, we used the two financial datasets analyzed previously by these authors. The first one corresponds to the daily returns of three currencies with respect to the US dollar: the Canadian dollar, the Euro and the British pound. This three-dimensional time series contains 400 observations from 15/7/2008 to 15/2/2010. This datasets is refered to as FX. The second dataset was generated from the daily returns on five equity indexes: NASDAQ, FTSE, TSE, NIKKEI, and the Dow Jones Composite over the period from 15/2/1990 to 15/2/2010. A sequence of time-varying empirical covariance matrices was obtained from the returns of these indexes and then used to produce a return series by sampling from at each time step for a total of 400 steps. This dataset is refered to as EQUITY. In this case, the time series were not standardized to have zero mean and unit standard deviation to be consistent with (Wilson11).
We followed the same experimental protocol as in the previous section. During the experiments, each method receives an initial time-series of length 200. We then make predictions one step forward for 200 iterations. In these experiments, the predictions for BMDC were generated in the same way as in (Wilson11), that is, instead of averaging over the available particles, we just evaluate on the posterior mean estimate of and , which is approximated by the empirical mean computed across all the particles. The predictions for GWP were done similarly, by evaluating a Gaussian likelihood on the posterior mean of
, as approximated by averaging over the samples produced by a Markov chain Monte Carlo method.
In this section we do not evaluate the performance of BMDC-T. The reason for this is that GWP assumes a Gaussian likelihood for the data and the Student’s likelihood used by BMDC-T would give an advantage to this method with respect to GWP.
7.1 Results for BMDC vs. GWP
Table 5 shows the cumulative predictive log-likelihood obtained by each method on the FX and EQUITY datasets. The method with the best predictive performance is highlighted in bold. In both datasets, BMDC is the best performer. BEKK-Full underperforms diagonal BEKK, which was used as a benchmark throughout this paper. This is likely due to worse overfitting problems in BEKK-Full, which is more highly parameterized. GWP and the different BEKK methods have mixed performance. GWP outperforms BEKK and BEKK-Full on the EQUITY dataset, which was generated from an empirical time-varying estimate of the return covariances. By contrast, BEKK outperforms GWP on the real-world FX dataset. Note the positive predictive log-likelihoods result from keeping the same experimental protocol as the GWP paper, where the return data were not rescaled to have zero mean and unit standard deviation.
We have introduced a novel Bayesian Multivariate Dynamic Covariance model (BMDC) with time-varying parameters that follow a diffusion process. The proposed model can adapt its parameters to changing dynamics in financial markets, which results in significant improvements in prediction performance over standard econometric models such as BEKK and other more recent methods such as the generalized Wishart process. In addition to this, we have presented an inference method based on particle filtering that yields substantial savings in computation time, enabling scalable inference to high-dimensional and high-frequency datasets.
JMHL is supported by Infosys Labs, Infosys Limited.