1 Introduction
Univariate financial returns are heteroscedastic, that is, the volatility or standard deviation of financial returns is not constant, but changes with time
(Cont2001). Several univariate models have been proposed to capture this property. The best known are the Autoregressive Conditional Heteroscedasticity model (ARCH) (engle1982autoregressive) and its extension, the Generalized Autoregressive Conditional Heteroscedasticity model (GARCH) (bollerslev1986generalized). This topic has recently received attention in the machine learning community, with the development of copula processes (NIPS2010_0784) and heteroskedastic Gaussian processes (Lazaro2011).Multivariate financial returns exhibit similar patterns. Moreover, besides timedependent volatilities, they also display timedependent correlations (Patton2006)
. Covariances are the product of variances and correlations. Therefore these temporal dependencies are likewise present in covariance matrices. To capture these properties
engle1995multivariateproposed a multivariate extension of GARCH called BEKK, based on synthesized work of Baba, Engle, Kraft, and Kroner. An alternative to BEKK are stochastic volatility models, including recent nonparametric models based on generalized Wishart processes
(Wilson11; fox2011bayesian). In practice BEKK performs similarly to the generalized Wishart processes for modeling financial data. However, BEKK is known to suffer from the following disadvantages:
The parameters of BEKK are usually fit by maximum likelihood. The large number of parameters in BEKK and local maxima in the likelihood function often lead to overfitting.

The parameter values in BEKK are constant. However, financial markets are dynamic and market conditions change with time. BEKK does not naturally capture these shifts in market conditions.

Finally, maximum likelihood fit of the parameters of BEKK involves solving a nonlinear optimization process which is computationally expensive and infeasible in highdimensions.
To address the difficulties mentioned above, we present a novel dynamic model for describing timedependent covariance matrices which extends BEKK as follows: Instead of computing point estimates, as in BEKK, we follow a fully Bayesian approach and compute posterior probabilities for the parameter values. This reduces the detrimental effect of multiple maxima in the likelihood function and limits overfitting problems. In addition to this, the new dynamic model incorporates a diffusion process for parameter values. At each point in time, every parameter is slightly modified by a random perturbation. These perturbations allow the model to adapt its parameters to changes in market conditions. Finally, Bayesian inference is performed using a regularized auxiliary particle filter
(liu1999combined). This technique is very efficient in terms of computational cost and allows our method to scale up to high dimensions.The performance of the new dynamic model for timechanging covariance matrices is evaluated in a series of experiments with real financial returns. The proposed model is compared with the standard BEKK model and a variant of BEKK that assumes multivariate Student’s innovations. Finally, we compare the proposed model against the generalized Wishart processes. Overall, the new dynamic model for timevarying covariance matrices obtains the best predictive performance.
The rest of the document is organized as follows: Section 2 describes standard heteroscedastic models such as GARCH and BEKK. Section 3 introduces our novel Bayesian Multivariate Dynamic Covariance model (BMDC). Section 4 reviews current models for dynamic covariances in machine learning. Section 5 describes the particle filter algorithm for making inference in BMDC. Experiments comparing BMDC with BEKK are included in Section 6 and additional experiments comparing BMDC with the generalized Wishart process are included in Section 7. Finally, Section 8 contains the conclusions of the study.
2 Review of GARCH and BEKK
GARCH is the standard timeseries model for univariate heteroscedastic data. GARCH assumes Gaussian noise or innovations and produces a sequence of timevarying variances that follow an Autoregressive and Moving Average (ARMA) process with autoregression on previous variance values and moving average on previous squared timeseries values:
(1)  
(2) 
The generative model is flexible and can produce a variety of clustering behavior of high and low volatility periods for different settings of the model coefficients, and . Maximum likelihood is used to learn these coefficients with and usually set to 1 to reduce overfitting problems.
BEKK (engle1995multivariate) is a popular multivariate extension of GARCH, where the dynamic covariance matrix for the data follows an ARMA process:
(3)  
(4) 
where and are coefficient matrices for dimensional data and is a triangular matrix with nonzero entries. Therefore BEKK(,) is a highly parameterized model with a total of parameters.
Restricted versions of BEKK are used in practice to mitigate overfitting problems. The order parameters and are set to 1 and the matrices and are constrained to be diagonal. The expression for is now
(5) 
This diagonal BEKK model (engle1995multivariate) has only parameters. We will use this model as the baseline for comparison because it often has better predictive performance than other versions of BEKK and its computational cost is also lower. Subsequent references to BEKK will mean the diagonal BEKK model shown in (5).
Standard BEKK has Gaussian innovations. However, there is strong evidence that financial returns are heavytailed. harvey1992unobserved and fiorentini2003maximum incorporate heavytails in BEKK by modeling using a multivariate Student’s distribution with degrees of freedom, zero mean, and scale matrix :
(6)  
(7)  
(8) 
The degrees of freedom, , should be larger than two for to have a well defined covariance. Finally, the expression for ensures the expected covariance of is . We will refer to the diagonal BEKK model with Student’s innovations as BEKKT.
3 Bayesian Multivariate Dynamic Covariance Models
A major limitation of BEKK is that the parameter matrices , and
are assumed to be constant. This is unrealistic for financial data, where market fundamentals are expected to change with time. A heuristic solution is to run BEKK over different windows of historical data. The problem is then how to choose the window size, with tradeoffs between noisy but reactive estimates of the model parameters for small windows and more stable but constant parameter estimates for large windows.
As a more principled approach to capture changes in market dynamics, we introduce a novel Bayesian Multivariate Dynamic Covariance model (BMDC) with timevarying parameters. For this, we replace (5) by
(9) 
where , and are timedependent matrices. Let and . Figure 1 shows the corresponding graphical models for BEKK and BMDC.
In BMDC, the dynamic parameters in follow a diffusion process in which is obtained by adding a small random perturbation to the parameters in . Since and are diagonal, let and denote
dimensional vectors with the diagonal elements of these matrices and let
be the vector with the upper triangular terms of . Then we specify the following diffusion process for , and :(10)  
(11)  
(12)  
(13) 
where the hyperparameters , and control the amount of drift in the system. In practice, vague priors are chosen for the value of these hyperparameter, with mean prior drift set to zero, . was set to so that , and can move up to , where is the number of observations. Other values of were tried with little difference in predictive performance. The prior for the initial state , and is also vague, taking into account the constraint so that does not diverge, where calculates the determinant of a matrix.
An important property of BMDC is that its predictive distribution is heavytailed. This is desirable as financial time series have fat tails (Cont2001). The heavytails in the predictions of BMDC arise because
(14) 
where the distribution of is obtained by integrating (9) with respect to the posterior for . Since is Gaussian, (14) is an infinite mixture of multivariate Gaussians and it will generally be heavytailed. This implies that the predictive density in the BMDC model will also have heavy tails.
In addition to BMDC, we propose a variant of this model with Student’s innovations, which we denote BMDCT. In this case, a vague lognormal prior is placed on the parameter corresponding to the number of degrees of freedom of the Student’s innovations:
4 Related Work
The GARCH and BEKK models described in Section 2 constitute the most popular family of volatility models. They are characterized by the latent covariance matrix being dependent on both its most recent past value and the previous timeseries observation . Another class of models is the family of Stochastic Volatility models (harvey1994multivariate; chib2009multivariate; gourieroux2009wishart; philipov2006factor). In this case, is conditionally independent of given its previous value . That is, graphically, in a stochastic volatility model there will be no arrow from node to node in the graphs shown in Figure 1. However, will still be conditionally independent of for and given and . This latter condition does not hold in recent generalizations of these models based on generalized Wishart processes (GWP) (fox2011bayesian; Wilson11). In this case, there are dependencies between all the latent covariance matrices and the model can produce complex nonlinear patterns in the evolution of covariance matrices. However, it is not clear that such flexibility is necessary for successfully modeling financial data. Our experiments show that BMDC outperforms GWP on this task, which confirms that this is not the case.
5 Inference with Particle Filters
Inference for the proposed Bayesian models is performed using particle filters (doucet2001sequential). Particle filters seem a natural choice to do online inference for these nonlinear and nonGaussian sequential models. BMDC has Gaussian likelihoods, but is nonlinear in the model parameters , and . Furthermore, particle filters can easily accommodate models with Student’s innovations, such as BMDCT.
We analyzed several particle filtering methods, including ResampleMove (gilks2001following), Regularized Sequential Importance Sampling (musso2001improving), Regularized Sequential Importance Resampling (gordon1993novel), and Regularized Auxiliary Particle Filter (RAPF) (liu1999combined) to perform inference in the proposed models. RAPF, a hybrid version of regularized (musso2001improving) and auxiliary particle filters (pitt1999filtering), exhibited the best performance in terms of predictiveness. This agrees with results given by casarin2009online. An implementation of the RAPF for BMDC is shown in Algorithm 1. A detailed description of the algorithm is given in the following paragraph.
liu1999combined introduced the RAPF, which combines the Regularized Particle Filter with the Auxiliary Particle Filter. The Regularized Particle Filter allows joint inference of the diffusion hyperparameters, , , , and the hidden states, , and . This method explores the posterior distribution of the model parameters by taking kernelized steps sequentially, thereby avoiding particle death. The shrinkage kernel, (17) and (19), avoids overdispersion problems in standard Regularized Particle Filters and improves prediction accuracy. The shrinkage kernel retains the previous mean, while the posterior variance does not diverge:
(15)  
(16) 
The Auxiliary Particle Filter part of the algorithm can be viewed as interchanging the importance sampling and resampling steps in traditional particle filters such as Sequential Importance Resampling. The resampling, (18), is performed on , which are predicted estimates of
given particles from the previous time step. Importance samples are then generated from the resampled point estimates. Particle diversity is maintained if the predicted estimates from the previous step are close to the true state. Auxiliary particle filtering is less sensitive to outliers and works well when the predicted estimates are good representations of the unknown states.
There are two algorithm parameters in Algorithm 1. The number of particles and the shrinkage parameter . We followed liu1999combined and fixed . This leads to a little shrinkage and avoids parameter variance exploding. If shrinkage is large () then the particle filter would propose from a less heavytailed proposal distribution, with the parameters set to be their empirical means from the previous step. When the empirical means are inaccurate, then the proposed particles will not be representative of the hidden state. The number of particles, , can affect the accuracy of the predictions, and is investigated in Section 6. The computational complexity of the algorithm is at each time step, where is the dimension of the data, since importance sampling at each step requires likelihood computations, each one with cost .
6 Experiments for BMDC vs. BEKK
We evaluated the performance of BMDC, BEKK and their Student’s variants in several experiments with multivariate financial time series. The high computational cost of BEKK limited the maximum number of different financial assets in each of the analyzed series to five. We considered daily foreign exchange (FX) and daily equity returns, as well as intraday FX returns. The daily FX time series contain a total of 780 returns from January 2008 to January 2011. The daily equity time series contain 3000 returns from Jan 2000 to Dec 2011. The intraday FX time series consist of 5000 fiveminute returns, which covered approximately the first six trading months of 2008. The price data were preprocessed to eliminate spurious prices. In particular, we eliminated prices corresponding to times when markets were closed or not liquid. All the time series were standardized to have zero mean and unit standard deviation.
To illustrate the usefulness of considering timevarying parameter values, we compared the predictive performance of BMDC and BMDCT with BEKK and BEKKT. In addition, we compared the execution times of i) the RAPF method used by the Bayesian models and ii) the maximum likelihood estimation method used by the BEKK models. Finally, we studied the sensitivity of RAPF to the number of particles used.
The performance of each method is measured in terms of the predictive loglikelihood on the first return out of the training set. During the experiments, each method receives an initial timeseries of length 50. The different methods are trained on that data and then a onestep forward prediction is made. The predictive loglikelihood is evaluated on the next observation out of the training set. Then the training set is augmented with the new observation and the training and prediction steps are repeated again. The process is repeated sequentially until no further data is received.
Dataset  BEKK  BEKKT  BMDC  BMDCT 

EUR  
AUD  
BRL  
GBP  
CHF  
AUD,JPY  
EUR,CHF  
EUR,GBP  
BRL,EUR  
ZAR,AUD  
CHF,EUR,JPY  
JPY,AUD,NZD  
BRL,AUD,ZAR  
TWD,JPY,KRW  
CAD,MXN,BRL  
JPY,AUD,EUR,CHF  
BRL,MXN,CAD,AUD  
BRL,ZAR,AUD,NOK  
JPY,AUD,GBP,EUR,CHF  
BRL,MXN,CAD,AUD,ZAR  
NZD,AUD,JPY,EUR,SEK 
Dataset  BEKK  BEKKT  BMDC  BMDCT 

AUDJPY  
AUDUSD  
EURAUD  
EURCHF  
EURCZK  
EURGBP,EURUSD  
EURHUF,GBPJPY  
EURJPY,GBPUSD  
EURNOK,NZDUSD  
EURSEK,USDCAD  
EURUSD,USDCHF,USDJPY  
GBPJPY,USDJPY,USDMXN  
GBPUSD,USDMXN,USDNOK  
NZDUSD,USDNOK,USDSEK  
USDCAD,USDSEK,USDSGD 
6.1 Results for BMDC vs. BEKK
Table 1 shows the average predictive loglikelihood for BEKK, BEKKT, BMDC and BMDCT on twentyone daily FX time series sorted by the dimensionality of the data. The method with the best predictive performance on each time series is highlighted in bold. Corresponding results are shown in tables 2 and 4 for fifteen intraday FX and thirty daily equity time series respectively. Overall, the best performing method is BMDCT followed by BEKKT and BMDC. The worst performing method is BEKK.
We perform a statistical test to determine whether differences among BEKK, BEKKT, BMDC and BMDCT are significant. These methods are compared to each other using the multiple comparison approach described by demšar2006statistical. In this comparison framework, all the methods are ranked according to their performance on different tasks. Statistical tests are then applied to determine whether the differences among the average ranks of the methods are significant. In our case, each of the datasets analyzed represents a different task. A Friedman rank sum test rejects the hypothesis that all methods have equivalent performance at . Pairwise comparisons between all the methods with a Nemenyi test at a 95% confidence level are summarized in Figure 3. The methods whose average ranks across datasets differ more than a critical distance (segment labeled CD in the figure) show significant differences in performance at this confidence level. The Nemenyi test confirms that BMDCT is superior to all the other methods. Additionally, the dynamic methods are superior to their static counterparts with BMDC outperforming BEKK and BMDCT beating BEKKT. Finally, BMDC is not statistically different from BEKKT.
The plot in Figure 3
shows that the heavytailed models BMDCT and BEKKT perform better than the nonheavytailed BEKK. Although BMDC assumes a Gaussian likelihood its posterior predictive distribution is heavytailed, as discussed in Section
3. Figure 2 confirms this by showing the logarithm of the posterior predictive density produced by BMDC on a particular instance of the analyzed time series. To produce this plot, we evaluated on a grid of values for one dimension of , averaging over the available particles which approximate the posterior distribution of and . This is compared to the plot obtained by evaluating on the posterior mean estimate of and , which is approximated by the empirical mean computed across all the particles.To understand the superior performance of BMDC relative to BEKK, we plot the average predictive loglikelihood of each method against the number of observations in Figure 4. The plot shows typical average predictive loglikelihood for a Daily FX time series. BMDCT is clearly the most predictive method for any number of observations. BEKK and BEKKT underperform BMDC and BMDCT early on, when relatively few observations are available to fit the models. These two methods are susceptible to overfitting at early stages (especially BEKKT in this case). With more data, BEKKT is less susceptible to overfitting and outperforms BEKK. However, BEKK and BEKKT still perform worse than BMDC and BMDCT when the amount of training data increases. This confirms that BMDC and BMDCT are better models for dynamic covariances in financial data, not only because BEKK and BEKKT suffer from initial overfitting problems. As a note of financial interest, the average predictive loglikelihood dips after 150 observations. This corresponds to the highly turbulent period near the end of 2008 with large swings in financial asset prices resulting in lower predictive loglikelihood.
Dataset  BEKKT  N=1000  N=4000  N=9000  N=25000 

1D  
2D  
3D  
4D  
5D  
10D  
20D 
A major advantage of BMDC and BMDCT with the RAPF method for Bayesian inference is scalability. Table 3 shows prediction sensitivity of the regularized particle filter to different number of particles and total execution times in minutes in parentheses for the BMDCT model on the daily FX dataset with 780 observations, ordered by data dimension. BEKKT predictions and execution times are also provided for comparison, except for datasets of ten or higher dimensions where BEKKT failed to finish in a reasonable amount of time. BEKKT was implemented using numerical optimization routines provided by Kevin Sheppard ^{1}^{1}1http:///www.kevinsheppard.com/wiki/UCSD_GARCH/.
Clearly particle filtering is much faster than the numerical optimization of the loglikelihood function used by BEKK. In fact, particle filtering can be used for intraday trading and hedging since each sequential update is on the order of seconds for five dimensional time series with particles. The computational cost of the particle filter is at each step, as described in Section 5. The computational cost for BEKK is at each step, where denotes the average number of numerical iterations needed per cycle and is the length of the training data currently seen. The power in this cost originates because, at each iteration of the optimization process, BEKK has to compute the gradient of parameters, where each gradient computation requires operations. This cost makes BEKK infeasible for large datasets. In contrast, the cost of the RAPF at each step does not depend on and only scales as . This allows the RAPF to analyze long highdimensional time series.
Table 3 also shows the sensitivity of the predictions of BMDCT to the number of particles . Increasing improves performance, but has relatively small effect for low dimensional datasets. For higher dimensions, that is, , improvements are more substantial, but do not scale linearly with . Since run times are linear in , but improvements in predictive performance are not, we can choose such that inference and prediction are done in a desired amount of time for high dimensional datasets.
Dataset  BEKK  BEKKT  BMDC  BMDCT 

A  
AA  
AAPL  
ABC  
ABT  
AFL  
AGN  
AIG  
AIV  
AKAM  
ACE,ADSK  
ADBE,AEE  
ADI,AEP  
ADM,AES  
ADP,AET  
AKS,AMGN  
ALL,AMT  
ALTR,AMZN  
AMAT,AN  
AMD,ANF  
ADSK,AFL,AKS  
AEE,AGN,ALL  
AEP,AIG,ALTR  
AES,AIV,AMAT  
AET,AKAM,AMD  
AMGN,AON,APOL  
AMT,APA,ARG  
AMZN,APC,ATI  
AN,APD,AVB  
ANF,APH,AVP 
Dataset  BEKKFull  BEKK  GWP  BMDC 

FX (3D)  
Equity (5D) 
7 Experiments for BMDC vs. GWP
We performed another series of experiments comparing BMDC, BEKK with full parameter matrices (BEKKFull) and the diagonal version of BEKK with the generalized Wishart process (GWP) proposed by Wilson11. For these experiments, we used the two financial datasets analyzed previously by these authors. The first one corresponds to the daily returns of three currencies with respect to the US dollar: the Canadian dollar, the Euro and the British pound. This threedimensional time series contains 400 observations from 15/7/2008 to 15/2/2010. This datasets is refered to as FX. The second dataset was generated from the daily returns on five equity indexes: NASDAQ, FTSE, TSE, NIKKEI, and the Dow Jones Composite over the period from 15/2/1990 to 15/2/2010. A sequence of timevarying empirical covariance matrices was obtained from the returns of these indexes and then used to produce a return series by sampling from at each time step for a total of 400 steps. This dataset is refered to as EQUITY. In this case, the time series were not standardized to have zero mean and unit standard deviation to be consistent with (Wilson11).
We followed the same experimental protocol as in the previous section. During the experiments, each method receives an initial timeseries of length 200. We then make predictions one step forward for 200 iterations. In these experiments, the predictions for BMDC were generated in the same way as in (Wilson11), that is, instead of averaging over the available particles, we just evaluate on the posterior mean estimate of and , which is approximated by the empirical mean computed across all the particles. The predictions for GWP were done similarly, by evaluating a Gaussian likelihood on the posterior mean of
, as approximated by averaging over the samples produced by a Markov chain Monte Carlo method.
In this section we do not evaluate the performance of BMDCT. The reason for this is that GWP assumes a Gaussian likelihood for the data and the Student’s likelihood used by BMDCT would give an advantage to this method with respect to GWP.
7.1 Results for BMDC vs. GWP
Table 5 shows the cumulative predictive loglikelihood obtained by each method on the FX and EQUITY datasets. The method with the best predictive performance is highlighted in bold. In both datasets, BMDC is the best performer. BEKKFull underperforms diagonal BEKK, which was used as a benchmark throughout this paper. This is likely due to worse overfitting problems in BEKKFull, which is more highly parameterized. GWP and the different BEKK methods have mixed performance. GWP outperforms BEKK and BEKKFull on the EQUITY dataset, which was generated from an empirical timevarying estimate of the return covariances. By contrast, BEKK outperforms GWP on the realworld FX dataset. Note the positive predictive loglikelihoods result from keeping the same experimental protocol as the GWP paper, where the return data were not rescaled to have zero mean and unit standard deviation.
8 Conclusion
We have introduced a novel Bayesian Multivariate Dynamic Covariance model (BMDC) with timevarying parameters that follow a diffusion process. The proposed model can adapt its parameters to changing dynamics in financial markets, which results in significant improvements in prediction performance over standard econometric models such as BEKK and other more recent methods such as the generalized Wishart process. In addition to this, we have presented an inference method based on particle filtering that yields substantial savings in computation time, enabling scalable inference to highdimensional and highfrequency datasets.
Acknolewdgements
JMHL is supported by Infosys Labs, Infosys Limited.
Comments
There are no comments yet.