Macroeconomic forecasting is one of the central tasks in Economics. Broadly speaking, there are two approaches, structural and nonstructural forecasting. Structural forecasting, which aligns itself with economic theory, and hence rises and falls with that, recedes following the decline of Keynesian theory. In recent years, new dynamic stochastic general equilibrium theory has been developed, and structural macroeconomic forecasting is poised for resurgence. Nonstructural forecasting, in contrast, attempts to exploit the reduced-form correlations in observed macroeconomic time series, has little reliance on economic theory, has always been working well and continues to be improved. Various univariate and multivariate time series analyzing techniques have been proposed, e.g. the auto regression (AR), moving average (MA), autoregressive moving average (ARMA), generalized autoregressive conditional heteroskedasticity (GARCH), vector auto regression (VAR) models among many others. A very challenging issue for this nonstructural approach is to determine which variables and (their) lags are relevant. If we omit some “important” variables by mistake, it potentially creates an omitted variable bias with adverse consequences for both structural analysis and forecasting. For example, Christiano et al. (1999) points out that the positive reaction of prices in response to a monetary tightening, the so-called price puzzle, is an artefact resulting from the omission of forward-looking variables, such as the commodity price index. Recently, Bańbura et al. (2010) shows that, when using the cross-sectional dimension related shrinkage, the forecasting performance of small monetary vector auto regression can be improved by adding additional macroeconomic variables and sectoral information. To illustrate this, we consider an example of interest rate forecasting. Nowadays people primarily use univariate or multivariate time series models, e.g. the Vasicek, CIR, Jump-Diffusion, Regime-Switching, and time-varying coefficients models, all of which are mostly based on the information from the interest rate time series itself. However, in practice, the central bank (Fed) bases their decisions of interest rate adjustment (as a monetary policy instrument) heavily on the national macroeconomic situation by taking many macro and financial measures into account. Bringing in this additional spatial (over the space of variables instead of from a geographic point of view; also used in future for convenience) information will therefore help improve its forecasting performance. Another example about the interactions between macroeconomics and finance comes from modeling credit defaults by also using macroeconomic information, since variation in aggregate default rates over time presumably reflects changes in general economic conditions also. Figlewski et al. (2006) find credit events are significantly affected by macroeconomic factors. Not only macroeconomics could affect finance, finance could also affect macroeconomics. For example, the economic crisis typically starts from the stock market crash. All of these call for an integrated analysis of macroeconomics and finance. Thus recently there has been a growing trend of using large panel macroeconomic and financial time series for forecasting, impulse response study and structural analysis, Forni et al. (2000), Stock and Watson (2002a), Stock and Watson (2002b), also seen at Forni et al. (2005), Stock and Watson (2005b), Giannone et al. (2005), and Bańbura et al. (2010) for latest advancements.
Besides its presence in empirical macroeconomics, high dimensional data, where information often scatters through a large number of interrelated time series, is also attracting increasing attention in many other fields of economics and finance. In neuro-economics and behavioral finance, one uses high dimensional functional magnetic resonance imaging data (fMRI) to analyze the brain’s response to certain risk related stimuli as well as identifying its activation area,Worsley et al. (2002) and Myšičková et al. (2011). In quantitative finance, one studies the dynamics of the implied volatility surface for risk management, calibration and pricing purposes, Fengler et al. (2007). Other examples and research fields for very large dimensional time series include mortality analysis, Lee and Carter (1992); bond portfolio risk management or derivative pricing, Nelson and Siegel (1987) and Diebold and Li (2006); international economy (many countries); industrial economy (many firms); quantitative finance (many assets) analysis among many others.
On the methodology side, if people still use either low dimensional (multivariate) time series techniques on a few subjectively (or from some background knowledge) selected variables or high dimensional “static” methods which are initially designed for independent data, they might either disregard potentially relevant information (temporal dynamics and spatial dependence) to produce suboptimal forecasts, or bring in additional risk. Examples include the already mentioned prize puzzle and interest rate forecasting problems. The more scattered and dynamic the information is, the severer this loss becomes. This modeling becomes more challenging under the situation that macroeconomic data we typically deal with has only low frequencies, e.g. monthly or yearly. For example, the popularly used dataset introduced by Stock and Watson (2005a) contains monthly macro indicators covering a broad range of categories including income, industrial production, capacity, employment and unemployment, consumer prices, producer prices, wages, housing starts, inventories and orders, stock prices, interest rates for different maturities, exchange rates and money aggregates and so on. The time span is from January to December (so ). In summary, we can see that the challenge of modeling high dimensional time series, especially the macroeconomic ones, comes from a mixture of serial correlation (temporal dynamics), high dimensional (spatial) dependence structure and moderate sample size (relative to dimensionality and lags). To this end, an integrated solution addressing these three challenges simultaneously is appealing.
To circumvent this problem, dynamic factor models have been considered to be quite successful recently in the analysis of large panels of time series data, Forni et al. (2000), Stock and Watson (2002a), Stock and Watson (2002b), also seen at Forni et al. (2005), Giannone et al. (2005), Park et al. (2009) and Song et al. (2010) (nonstationary case). They rely on the assumption that the bulk of dynamics interrelations within a large dataset can be explained and represented by a few common factors (low dimensional time series). Less general models in the literature include static factor models proposed by Stock and Watson (2002a), Stock and Watson (2002b) and exact factor model suggested by Sargent and Sims (1977) and Geweke (1977).
Compared with the well studied dynamic factor models through the use of dynamic principal component analysis, the vector auto regressive (VAR) models have several natural advantages. For example, compared with the dynamic factor models’ typical-step estimation procedure: dimension reduction first and low dimensional time series modeling, the VAR approach is able to model the high dimensional time series in one step, which may lead to greater efficiency. It also allows variable-to-variable relationship (impulse response) analysis and facilitates corresponding interpretation, which is not feasible in the factor modeling setup since the variables are “represented” by the corresponding factors. Historically, the VAR models are not appropriate for analyzing high dimensional time series because they involve the estimation of too many (, where is the dimensionality and is the number of lags) parameters. Thus they are primarily implemented on relatively low dimensional situations, e.g. the Baysian VARs (BVAR) by Doan et al. (1984) or still through the idea of factor modeling, e.g. the factor-augmented VAR (FAVAR) by Bernanke et al. (2005). However, based on recent advances in variable selection, shrinkage and regularization theory from Tibshirani (1996), Zou (2006) and Yuan and Lin (2006), large unrestricted vector auto regression becomes an alternative for the analysis of large dynamic systems. Therefore, the VAR framework can also be applied to empirical problems that require the analysis of more than a handful of time series. Mol et al. (2008) and Bańbura et al. (2010) proceed that from the Bayesian point of view. Chudik and Pesaran (2007) consider the case that and both and are large through some “neighboring” procedure, which can be viewed as a special case of the “segmentized grouping” as we study here (details in Subsection 2.4). In the univariate case (), Wang et al. (2007) studies the regression coefficient and autoregressive order shrinkage and selection via the lasso when is large and is small (relative to ).
In this article, we will study the large vector auto regressions when and is moderate (relative to ). Comparing to prior works in (large) vector auto regressions, the novelty of this article lies in the following perspectives. First, from the variable selection and regularization point of view, the theoretical properties of many existing methods have been established under the independent scenario, which is rarely met in practice and contradicts the original time series setup (if used directly). Disregarding the serial correlation in variable selection and regularization can be dangerous in the sense that various risk bounds in fact depend on the degree of time dependence, as we will illustrate later. We propose a new methodology to address this serial correlation (time dependence) issue together with high dimensionality and moderate sample size, which enables us to obtain the consistency of variable selection even under the dependent scenario, i.e. to reveal the equilibrium among them. Second, our method is able to do variable selection and lag selection simultaneously. In previous literature, variable selection is usually carried out first, and then the corresponding estimate’s performances w.r.t. different number of lags are compared through some information criteria to select the “optimal” number of lags. By doing so, we neglect the “interaction” between variable selection and lag selection. Additionally, when the number of the lag’s candidates to be searched over is large, it is also computationally inefficient, due to the cost of the repeated variable selection procedures. Third, we differentiate the variable of interest’s own lags (abbreviated as own lags afterwards) from the ones of other variables (abbreviated as others’ lags afterwards). Their relative weights are also allowed to be varied when predicting different variables, while in other literature, they are assumed to stay the same. This is due to the fact that the dynamic of some variables is driven by itself, while for a different variable, it might be driven by the dynamics of others. When we include a vast number of macroeconomic and financial time series, assuming the same weight seems to be too restrictive. Fourth, our method is based on a more computationally efficient approach, which mostly uses the existing packages, e.g. the LARS (least angle regression) package, developed by Efron et al. (2004), while most other works in the literature go through the Bayesian approach that requires the choice of priors.
The rest of the article is organized as follows. In the next section, we present the main ingredients of the large vector autoregressive model (Large VAR) together with several corresponding estimation procedures and comparisons among them. The estimates’ properties are presented in Section3. In Section 4, the method is applied to the motivating macroeconomic forecasting problem, which shows that it outperforms some major existing method. Section 5 contains concluding remarks with discussions about relaxing some assumptions. All technical proofs are sketched in the appendix.
2 The Large VAR Model and Its Estimation
In this section, we introduce the model with three different estimates first, then discuss the data driven choice of hyperparameters to optimize the forecasting performance, provide a numerical algorithm, and finally summarize comparisons among these three estimates.
2.1 The Model
Assume that the high dimensional time series is generated from
(the lags of ) with ;
are autoregressive matrices, where is the number of lags, is the matrix containing all coefficients , and is the th column of and , th row of and respectively;
, where is a -dimensional noise and independent of .
All and are assumed to have mean zero. The covariance matrix of , , is assumed to be independent of . Here we assume to be diagonal, say . In our case, it is justified by the fact that the variables in the panel we will consider for estimation are standardized and demeaned. Similar assumption is also carried out in Mol et al. (2008). The relaxation allowing nonzero off-diagonal entries is discussed in Section 5.
We can see that given large and , we have to estimate a total of parameters, which is much larger than the moderate number of observations , i.e.
. Consequently, ordinary least squares estimation is not feasible. Additionally, due to the structural change points in the macro and financial data (although not explored in this paper), the effective number of observations used for estimation could be much smaller than the original. Thus we can see that on one hand, we do not want to impose any restrictions on the parameters and attain some general representations; on the other hand, it is known that making the model unnecessarily complex can degrade the efficiency of the resulting parameter estimate and yield less accurate predictions, as well as making interpretation and variable selection difficult. Hence, to avoid over fitting, regularization and variable selection are necessary. In the following, we are going to discuss the estimation procedure with different kinds of regularization (illustrated in Figure 1). Before moving on, we incorporate the following very mild belief, as also considered in Bańbura et al. (2010): the more recent lags should provide more reliable information than the more distant ones, which tries to strike a balance between attaining model simplicity and keeping the historic information.
2.2 Universal Grouping
Without loss of generality, we start from considering one coefficient matrix, say with entries . Inspired by Bańbura et al. (2010), we note the fact that the dynamic of some variable is driven by itself, while for a different variable, it might be driven by the dynamics of others. Consequently, we treat the variables’ own lags (diagonal terms of ) different from others’ lags (off-diagonal terms of ) and impose different regularizations for them. We assume that the off-diagonal coefficients of are not only sparse, but also have the same sparsity pattern across different columns, which we call group sparsity. Thus we base our selection solution on group Lasso techniques (Yuan and Lin (2006)) for the off-diagonal terms and Lasso techniques (Tibshirani (1996)) for the diagonal terms here. We use to denote the vector composed of and to denote the diagonal matrix where is the positive real-valued weight associated with the th variable for . It is included here primarily for practical implementation since if is chosen as the , it is equivalent (subsection 2.6
for details) to standardize the predictors so that they all have zero mean and unit variance,Tibshirani (1996), which is also preferable for comparisons to prior works.
Specifically, given the above notations, we use the group Lasso type penalty and Lasso type penalty to impose regularizations on other regressors’ lags and predicted variables’ own lags respectively and have the following penalty for the matrix:
with some generic constant .
The hyperparameter controls the extent to which others’ lags are less (more) “important” than the own lags. When is large, the penalty assigned to own lags is larger than to others’ lags. As a result, it is more likely that the off-diagonal entries are shrunk to instead of the diagonal ones, which corresponds to the case that the variable’s dynamic is driven by itself, and vice versa when is small.
The item reflects different regularization for different lags (over time). It becomes smaller when gets larger. This is consistent with the previous belief: the more recent lags should provide more reliable information than the more distant ones. Thus as a result, large amounts of shrinkage are towards the more distant lags, whereas small amounts of shrinkage are towards the more recent ones. The hyperparameter governs the relative importance of distant lags w.r.t. the more recent ones. Other decreasing functions of , e.g. could also be used. However, we do not consider a general representation (and use a data driven way to estimate correspondingly) to avoid too many tuning parameters, especially when .
with hyperparameters , and . We call this estimate the universal grouping estimate. As the number of variables increases, the autocoefficients should be shrunk more in order to avoid over-fitting, as already discussed by Mol et al. (2008).
Using the group Lasso type regularization for the off-diagonal terms actually poses some strong assumptions on the underlying structure, which is not realistic from an economic point of view. Remark 2.2.1 First, we just have one hyperparameter () to control the relative weights between own lags and others’ lags. This means that the weights between own’s lags and others’ lags are the same across different dimensions which is hardly met in practice. Correspondingly, when we select the “optimal” to optimize the forecasting performance, we are actually optimizing the averaged forecasting performance for all variables instead of the variable of particular interest. This might produce suboptimal forecasts. Bańbura et al. (2010) considers a special case that own lags are always more “important” than others’ lags, which might be less general than ours. Remark 2.2.2 Second, using the norm might shrink all off-diagonal terms in the same row () to zero simultaneously, which implicitly means that, for the th corresponding variable, we assume it is either significant for all the other variables or not for any other variables at all. This is, again, too strong from an economic point of view.
2.3 No Grouping
To amend the deficiencies of the universal grouping estimate, we estimate the autocoefficient matrix column by column instead of all at once. Without loss of generality, we consider the th column here. Since is a vector, we can use the Lasso type penalties for both own lags and others’ lags. By following similar ideas and abbreviations in subsection 2.2, we have equation (5) to get :
with hyperparameters , and . The subindex is added to and to emphasize that they could vary when estimating different ’s, . We call this estimate the no grouping estimate.
Remark 2.3.1 Because of different ’s () for different columns’ estimates , we allow individualized weights between own lags and others’ lags and could tune ’s and ’s to produce optimal forecasting performance for each variable of interest, say the th. Remark 2.3.2 Also for the same reason, we could get rid of the disadvantage that all off-diagonal terms in one row might be shrunk to simultaneously.
For simplicity of notation, we drop the common subindex and write , , , , , , and (5) becomes:
with and .
2.4 Segmentized Grouping
Since the large panel of macroeconomic and financial data sets usually have some natural “segment” structure, e.g. multiple interest rate time series w.r.t. different maturities, different number of employees w.r.t. different industrial sectors, different price indices w.r.t. different goods etc, if we take this information into account instead of estimating either all at once or column by column, we could also do it segment by segment. Without loss of generality, we consider the th segment . is the index set for the th segment and denotes the cardinality of the set . is the corresponding part of . We also use to denote the diagonal matrix with diagonal entries and to denote the diagonal matrix with diagonal entries .
Under this situation, we have: own lags, others’ (in the same segment) lags and others’ (outside the segment) lags for the estimation of the th segment’s corresponding autoregressive coefficients . Also following similar ideas and abbreviations in subsection 2.2, we have the following estimation equation:
with hyperparameters , and , where is the overall number of segments. We call this estimate the segmentized grouping estimate.
Chudik and Pesaran (2007) consider the case and is large (relative to ) through some “neighboring” procedure, which can be viewed as a special case of the “segmentized grouping” we studied here.
2.5 Forecast Evaluation and Choice of Parameters
The three penalization methods discussed above critically depend on penalty parameter selection for their performance in model selection, parameter estimation and prediction accuracy. Here we have hyper-parameters (universal grouping), (universal grouping), and , and choose them via a data driven “rolling scheme”. To simulate real-time forecasting, we conduct an out-of-sample experiment. Let and denote the beginning and the end of the evaluation sample respectively. The point estimate of the th variable’s forecast is denoted by based on , the information up to time . The point estimate of the one-step-ahead forecast is computed as in equation (5), and the -step-ahead forecasts are computed in similar spirit. Out-of-sample forecast accuracy is measured in terms of mean squared forecast error (MSFE):
We report results for MSFE relative to the benchmark (random walk with drift) model’s (abbreviated as ), as also considered by Bańbura et al. (2010), i.e.
The parameters are estimated using the observations from the most recent years (rolling scheme) as illustrated in Figure 2. The parameters are set to yield a desired fit for the variable(s) of interest from to . In other words, to obtain the desired magnitude of fit, the search is performed over a grid of and to minimize (universal grouping); and to minimize (no grouping); and to minimize (segmentized grouping) respectively. Due to computational cost, we prefix to be or first, and then do the search of ’s and ’s over loose grids. For the nice performing ’s and ’s, we search over denser grids around them afterwards. The parfor command in Matlab is used to facilitate parallel computations to fasten this process. Also, using the least angle regression package provided at www-stat.stanford.edu/tibs/glmnet-matlab, makes the computation time together with the parameter selection very moderate in our experience.
Motivated by the adaptive lasso procedure, Zou (2006), if we define
, where is the diagonal matrix with diagonal entries , is the Kronecker product and is the identity matrix;
and note the fact that in (2) is the same as , we have the following estimation procedure (the proof is very simple and hence is omitted):
At Step (2), motivated by Wang et al. (2007), as we have more than one penalty terms (mixed Lasso and group Lasso), we could iterate between penalties to solve it as the standard (group) Lasso problem.
For the “no grouping” estimate, by noting that and , the estimation procedure above is equivalent to:
Generate with for estimating ;
Corresponding to (9), solve:
Output with , minimizing (9) (no grouping).
At Step (2), we could avoid iterating between multiple penalties and just solve it as the standard Lasso, e.g. by using the least angle regression package provided at www-stat.stanford.edu/tibs/glmnet-matlab.
In “large , small ” paradigms, to get parsimonious models, shrinkage with penalization in model selection can shrink insignificant regression coefficients towards zero exactly, but at the same time, significant coefficients are shrunk as well though they are retained in selected working models, Wainwright (2009) and Huang et al. (2008). To this end, we only use our method for the variable (and lag) selection, but not for estimation, Chernozhukov et al. (2011). Thus, we implement the ordinary least squares estimation for the selected variables (and lags) from (8), (9) and (10) w.r.t. three different estimates.
Now it is a matter of what kind of regularization techniques among these three choices to use in practice. First, as already discussed in Remark 2.2.1 and 2.3.1, from the allowing individualized (for the variable of particular interest) weights between own lags and others’ lags and individualized forecasting performance optimization point of view, the “no grouping” approach is the best, the “universal grouping” one is the worst, and the “segmentized grouping” one is in between. Second, as in Remark 2.2.2 and 2.3.2, from whether all off-diagonal autocoefficients in one row are shrunk to zero point of view, the “no grouping” one is still favored. Third, as in subsection 2.5, the tuning parameters w.r.t. the universal grouping, no grouping or segmentized grouping estimates are selected to optimize the averaged forecasting performance for all variables, the specific variable’s forecasting performance or the averaged forecasting performance for the variables in the same segment respectively. When different variables’ time series have very distinct patterns, this individualized optimization is preferred. Fourth, for the estimation of large coefficient matrices, due to the strong group-sparse assumption on the underlying structure as mentioned in subsection 2.2, the group lasso type estimator actually has a sharper theoretical risk bound, Huang and Zhang (2009)
for more details. In particular, they show that group Lasso is more robust to noise due to the stability associated with group structure and thus requires a smaller sample size to satisfy the sparse eigenvalue condition required in modern sparsity analysis. And the universal grouping estimate is also more computationally efficient since the whole autocoefficient matrix is estimated at once. However, note that the statistical error is a combination of modeling error and estimation error. Even though the group Lasso type estimate might have smaller estimation error, due to the strong assumption to the underlying structure, the overall risk might not be smaller, as we discussed in subsection2.2. Moreover, the typical macroeconomic data has low frequency, i.e. monthly. Thus the computational cost is not a severe problem since we only need to update the model once per month at most. Due to all these, we suggest the no grouping estimate for practical implementation as a compromise between flexibility and realization of assumptions. For this reason and technical simplicities, we mainly study the theoretical properties of the no grouping estimate as defined in (6) afterwards.
3 Estimates’ Properties
In this section, we first show that, under the time series setup, if we just use the classic Lasso estimator, the risk bound will depend on the time dependence level as in Theorem 3.1. To circumvent this problem, through reweighting over time, our estimate in (6) can still produce an estimator, which is shown in Theorem 3.2 and 3.3, to be equivalent to an appropriate oracle. The techniques of the proofs are closely built upon those in Lounici et al. (2009), Bickel et al. (2009), Lounici (2008) and Wang et al. (2007).
3.1 Dependence Matters?
Now we will illustrate how the temporal dependence level affects the risk bounds of the Lasso type estimator. For technical simplicities, we consider the univariate AR() (or MA()) model with , i.e. for equations (1) and (2):
with the regressors , the coefficients and the error term . We also define as a matrix with the th entry as and . (or ) corresponds to the AR() (or MA()) model. In this situation, since there are no “others lags” () and is a vector, the standard Lasso estimator is defined through:
We assume there is a true coefficient for (12) and define and . Before moving on, we recall the fractional cover theory based definition first, which was introduced by Janson (2004) and can be viewed as a generalization of -dependency. Given a set
and random variables, , we say:
A subset of is if the corresponding random variables are independent.
A family of subsets of is a of if .
A family of pairs , where and is a fractional cover of if , i.e. for each .
A (fractional) cover is if each set in it is independent.
is the size of the smallest proper cover of , i.e. the smallest such that is the union of independent subsets.
is the minimum of over all proper fractional covers .
Notice that, in spirit of these notations, and depend not only on but also on the family . Further note that (unless ) and that if and only if the variables are independent, i.e. is a measure of the dependence structure of . For example, if only depends on but is independent of all , we will have independent sets:
s.t. . So (if ).
Before stating the first main result of this section, we make the following two assumptions.
With a high probability
With a high probability, , the random variables and satisfy
for some constants .
There exists a positive number such that
where denotes the complement of the set of indices , denotes the vector formed by the coordinates of the vector w.r.t. the index set .
Assumption 3.2 is the restricted eigenvalue assumption from Bickel et al. (2009), which is essentially a restriction on the eigenvalues of the Gram matrix as a function of sparsity . To see this, recall the definitions of restricted eigenvalues and restricted correlations in Bickel et al. (2009):
where denotes the cardinality of and is the submatrix of obtained by removing from the columns that do not correspond to the indices in . Lemma 4.1 in Bickel et al. (2009) shows that if the restricted eigenvalue of the Gram matrix satisfies for some integer , Assumption 3.2 holds.
We can now state our first main result.
Consider the model (12) for , and random variables . Let the random variables and satisfy Assumption 3.1 for any , all diagonal elements of the matrix euqal to , and . Furthermore, let be defined as in Assumption 3.2, and be the maximum eigenvalue of the matrix . Let Then with probability at least , for any solution of (13), we have: