Inference in Bayesian Additive Vector Autoregressive Tree Models

by   Florian Huber, et al.

Vector autoregressive (VAR) models assume linearity between the endogenous variables and their lags. This linearity assumption might be overly restrictive and could have a deleterious impact on forecasting accuracy. As a solution, we propose combining VAR with Bayesian additive regression tree (BART) models. The resulting Bayesian additive vector autoregressive tree (BAVART) model is capable of capturing arbitrary non-linear relations between the endogenous variables and the covariates without much input from the researcher. Since controlling for heteroscedasticity is key for producing precise density forecasts, our model allows for stochastic volatility in the errors. Using synthetic and real data, we demonstrate the advantages of our methods. For Eurozone data, we show that our nonparametric approach improves upon commonly used forecasting models and that it produces impulse responses to an uncertainty shock that are consistent with established findings in the literature.



There are no comments yet.


page 1

page 2

page 3

page 4


Nonparametric Tests in Linear Model with Autoregressive Errors

In the linear regression model with possibly autoregressive errors, we p...

Distributionally Robust Optimization with Correlated Data from Vector Autoregressive Processes

We present a distributionally robust formulation of a stochastic optimiz...

Large Order-Invariant Bayesian VARs with Stochastic Volatility

Many popular specifications for Vector Autoregressions (VARs) with multi...

Semiparametric Mixed-Scale Models Using Shared Bayesian Forests

This paper demonstrates the advantages of sharing information about unkn...

A sparse Bayesian hierarchical vector autoregressive model for microbial dynamics in a wastewater treatment plant

Proper function of a wastewater treatment plant (WWTP) relies on maintai...

Variable Grouping Based Bayesian Additive Regression Tree

Using ensemble methods for regression has been a large success in obtain...

Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression

To model categorical response variables given their covariates, we propo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In macroeconomics and finance, most models commonly employed to study the transmission of economic shocks or produce predictions assume linearity in their parameters and are fully parametric. One prominent example of a frequently used fully parametric and linear model is the vector autoregression (VAR) that is extensively used in central banks and academia (see Sims, 1980; Doan et al., 1984; Litterman, 1986; Sims and Zha, 1998). In normal times, with macroeconomic relations remaining stable, this linearity assumption might describe the data well. In turbulent times, however, key transmission channels often change and more flexibility could be necessary.

The linearity assumption has been questioned several times in the recent literature. For instance, Granger and Terasvirta (1993)

show that macroeconomic quantities might exhibit non-linear relations and thus assuming linearity might be overly restrictive. As a solution, researchers increasingly wish to estimate non-linear models that assume the parameters to change over time and/or shock variances to be time-varying. These time-varying parameter (TVP) models, however, assume that within each point in time, the relationship between the endogenous and explanatory variables is linear. For recent contributions, see

Primiceri (2005); Kalli and Griffin (2014); Bitto and Frühwirth-Schnatter (2019); Huber et al. (2019); Chan et al. (2020).

Another strand of the literature proposes Bayesian nonparametric time series models in order to relax the linearity assumption. Particularly, several recent papers (see, e.g. Bassetti et al., 2014; Kalli and Griffin, 2018; Billio et al., 2019) propose novel methods that assume the transition densities to be nonparametric. These techniques are characterized by featuring an infinite dimensional parameter space that is flexibly adjusted to the complexity of the data at hand. For a review on Bayesian nonparametric methods, see Hjort et al. (2010)

. Within the nonparametric paradigm, there has been a number of popular methods that make use of machine learning techniques such as boosting

(Freund and Schapire, 1997; Friedman, 2001)

, bagging and random forests

(Breiman, 2001), decision trees and Mondrian forests (Roy and Teh, 2009; Lakshminarayanan et al., 2014) and Bayesian tree models (Chipman et al., 2002; Gramacy and Lee, 2008; Chipman et al., 2010; Linero, 2018). In this paper, we focus on Bayesian tree models applied to multivariate time series regressions.

The main contribution of the present paper is to connect the literature on Bayesian additive regression trees (BART, see Chipman et al., 2010)

with the literature on VAR models. BART is a nonparametric regression approach which allows for unveiling non-linear relations between a set of endogenous and explanatory variables without much input needed from the researcher. Intuitively speaking, it models the conditional mean of the regression model by summing over a large number of trees which are, by themselves, constrained through a regularization prior. The resulting individual trees will take a particularly simple form and can thus be classified as ”weak learners”. Each single tree only explains a small fraction of the variation of the response variable while a large sum of them is capable of describing the data extremely well.

It is precisely this intuition on which we build on when we generalize these techniques to a multivariate setting. More precisely, we assume that a potentially large dimensional vector of endogenous variables is driven by the lags of the response variables. As opposed to VAR models which assume linearity between the explanatory and endogenous variables, we capture these relations using BART. The resulting model, labeled the Bayesian Additive Vector Autoregressive tree (BAVART) model, is a highly flexible variant that can be used for forecasting and impulse response analysis.

Estimation and inference is carried out using Markov Chain Monte Carlo (MCMC) techniques. Since the error covariance matrix is non-diagonal, we propose methods that allow for equation-by-equation estimation of the multivariate model. These techniques imply that model estimation scales well in high dimensions and permits estimation of huge dimensional models. Another crucial feature of our approach is that it allows for flexibly handling heteroscedasticity. We control for heteroscedasticity by proposing a stochastic volatility specification. This feature is crucial for producing precise density forecasts.

We illustrate our BAVART model using macroeconomic data for the Eurozone. In this application, we assess the forecasting accuracy of our method relative to competitors that are commonly used in the literature. Since interest not only centers on one-step-ahead predictive distributions but also on multi-step-ahead forecasts, we provide algorithms to simulate from the relevant predictive distributions. These techniques are then used in a second application where we consider the effects of economic uncertainty on the Eurozone. We show how our model can be used to compute generalized impulse response functions (SGIRFs) and find responses that are consistent with established findings in the literature.

The remainder of this paper is organized as follows. Section 2 discusses the BART model in the context of the homoscedastic regression model while Section 3 extends this method to the VAR case and proposes the BAVART specification and how we control for heteroscedasticity. Section 4 presents the results of the forecasting and impulse response exercise. Finally, the last section summarizes our findings and concludes the paper.

2 Bayesian Additive Regression Tree Models

In this section, we briefly review the BART approach for regression on which we build on in later sections. For an extensive introduction, see Hill et al. (2020). Let denote the dimensional response vector and be a -dimensional matrix of exogenous variables with being the covariates in time . We assume the following relationship between and :

where denotes the error variance and is a (potentially) highly non-linear function. The BART model is then obtained by approximating through a sum of many regression trees as follows:


In Eq. (1), corresponds to a single tree model with denoting the tree structure associated with the binary tree and is the vector of terminal node parameters associated with and are the leaves of the tree. The binary trees are constructed by considering splitting rules of the form or with being a partition set. These rules typically depend on selected columns of , denoted as , and a threshold . The set is then defined by a unique partition of the predictor space such that or .

The step function is constant over the elements of :

Hence, the set defines a tree-specific unique partition of the covariate space such that the function returns a specific value for specific values of .

To avoid overfitting, the trees are encouraged to be small (i.e. take a particularly simple form) and the terminal node parameters to be shrunk to zero. If the first tree, , is a weak learner and fitted in a reasonable way, the corresponding tree structure will be very simple and elements in will be pushed towards zero. This implies that the first tree will explain a small fraction of the variation in . Subtracting from yields a new conditional model with transformed and the next tree will be fitted. This procedure is repeated for a sufficiently large number of trees until the fit of the additive model becomes reasonably good.

We illustrate BART using a simple example. In this simple example, we use a single regression tree model (i.e. ). The case of several regression trees is considered afterwards.

Example 1.

Consider the regression tree in Figure 1. We include two covariates . In the figure, each rectangle refers to a splitting rule and henceforth label this a node. At the top node (also called root node or simply root), we have the condition that asks whether . If this condition is not true, then we arrive at the terminal node, and set . By contrast, if the condition is fulfilled we move to the next node. The corresponding splitting rule states that if , we reach a terminal node with whereas if , we set . Using a single tree thus yields three possible values for the conditional mean, and and implies abrupt shifts between them. If is evolving smoothly over time, using a single tree would thus lead to a rather simplistic conditional mean structure.




Figure 1: Example of binary regression tree, , with internal nodes labeled by their splitting rules and leaf nodes labeled with the corresponding parameters , with and .
Example 2.

Since the tree model in the first example is too simplistic, our second example deals with a more flexible conditional mean structure. Let us consider a sum of regression trees with trees and covariates, depicted in Figure 2. Within a given tree, the same intuition as in Example 1 applies. However, since we now use two trees we gain more flexibility in modeling the conditional mean of . This is because the conditional mean at a given point in time is equal to the sum of the terminal node parameters for the two trees and the combination of the two parameters depend on the specific set of decision rules across both trees. This allows for a substantially richer mean structure and increased flexibility as opposed to a single tree model.

Regression tree, Regression tree,








Figure 2: Example of sum of regression tree, , with internal nodes labeled by their splitting rules and leaf nodes labeled with the corresponding parameters , where and .

The second example shows how flexibility is increased by adding more trees and illustrates how BART handles non-linearities in a flexible manner. In particular, each regression tree is a simple step-wise function and when we sum the different regression trees, we gain flexibility. The resulting additive model essentially allows for approximating non-linearities without prior assumptions on the specific form on the non-linearities.

3 BART models for multivariate time series analysis

3.1 Bayesian Additive Vector Autoregressive Tree Models

In this section we generalize the model outlined in the previous section to the multivariate case. Consider a -dimensional vector of endogenous variables . Stacking the rows yields a matrix . We assume that depends on a matrix with each being a -dimensional vector of lagged endogenous variables. The Bayesian Additive Vector Autoregressive Tree (BAVART) model is then given by:


where and

. For the moment, we assume that the error variance-covariance matrix

is time-invariant. In Eq. (2), is defined in terms of equation-specific functions :

Similarly to the standard BART specification, we approximate each through a sum of regression trees:

Here, denotes an equation-specific step function with arguments and . As before, the individual tree structures are associated with a -dimensional vector of terminal node coefficients associated with denoting the number of leaves per tree in equation .

As stated in Section 2, a regression tree creates a partition of the covariate space into subgroups. Internal nodes of regression trees are equipped with splitting rules that are characterized by some column of , labeled , and a specific threshold which can take any value in the range of . Such a decision rule takes the form or and thus allocates observations according to this condition.

We estimate the BAVART model by exploiting its structural form. Using the structural form of the VAR to speed up computation has been done in several recent papers (see, e.g., Carriero et al., 2019; Huber et al., 2020). In our case, Eq. (2) can be written as follows:


whereby is defined similarly to , is a -dimensional lower triangular matrix with and denotes a matrix of orthogonal shocks with with denoting a -dimensional diagonal matrix with the variances on its main diagonal.

Conditional on , this form permits equation-by-equation estimation since the shocks are independent. This leads to enormous computational gains. Notice that the equation can be written as:


Here, we let denote a regression tree while and refers to the column of and , respectively. denotes the element of .

Notice that Eq. (4) is a generalized additive model that consists of a non-parametric part and a regression part . For , the model reduces to a standard BART specification without including the contemporenous values of the other endogenous variables.

The key idea behind this formulation is that, for a sufficiently large number of trees, we approximate non-linear relations between and its lags while allowing for linear relations between the contemporaneous values of . After obtaining estimates of the trees and the free elements of , we can solve Eq. (3) for its reduced form representation as follows:


with , and .

3.2 Allowing for Heteroscedasticity

Up to this point we assumed the error variance to be constant. If the researcher wishes to relax this assumption, several feasible options exist. In a recent paper, Pratola et al. (2019) propose a combination of a standard additive BART model with a multiplicative BART specification to flexibly control for heteroscedasticity. But modeling heteroscedasticity with BART implies that we need some information on how the volatility of economic shocks depends on additional covariates (which potentially differ from ). As a simple yet flexible solution we adopt a standard stochastic volatility (SV) model. SV models are frequently used in macroeconomics and finance and possess a proven track record for forecasting applications (see, e.g., Clark, 2011; Clark and Ravazzolo, 2015). Our SV specification simply assumes that is time-varying:

with the time-varying variances . Their logarithms follow an AR(1) model:


Here, is the unconditional mean, is the persistence parameter and is the error variance of the log-volatility process. In our empirical work, all models feature stochastic volatility of this form.

3.3 The Prior and Posterior Simulation

The Bayesian approach calls for the specification of suitable priors over the parameters of the model. Here we mainly follow the different strands of the literature our approach combines and thus only provide a brief summary. In particular, we focus on the priors associated with the trees and the terminal node parameters for and

. We assume that the hyperparameters across priors are the same for each equation.

For each equation , the joint prior structure is given by:

At this level, the prior implies independence between and the remaining model parameters. Chipman et al. (2010) further introduce the following factorization:

This structure indicates that the tree components are independent of each other and of the remaining parameters while the prior on depends on the tree structure.

The prior on the tree structure is specified along the lines suggested in Chipman et al. (2010). More precisely, we assume a tree generating stochastic process that follows three recursive steps per tree. Assume initially , then we have the following steps:

  1. Start with the trivial tree that consists of a single terminal node (or only its root), which will be denoted by for each equation and .

  2. Split the terminal node

    with probability


    where denotes the depth of the tree, and are prior hyperparameters. In detail, a node at depth spawns children with and as the tree grows,

    increases while the prior probability decreases, making it less likely that a node spawns children. This implies that the probability that a given node becomes a bottom node increases and thus a penalty on tree complexity is introduced.

  3. If the current node is split, we introduce a splitting rule drawn from the distribution function and use this rule to spawn a left and right children node. In particular, the rule is set such that a given ”splitting” covariate is chosen with probability

    . As described above, conditional on a chosen covariate, one needs to estimate a threshold. The prior on this threshold is, in the absence of substantial information, assumed to be uniformly distributed over the range of


  4. Once we obtain all terminal nodes (i.e. no node is split anymore), we will label this new tree and return to step (2).

On the terminal node parameter , we use a conjugate Gaussian prior distribution , where . Recall that is the number of trees and

is the number of prior standard deviations between

and . Hence, larger values of increasingly push towards zero.

For the free elements in , we introduce a Horseshoe prior on each element of for every and

where and

denotes the half-Cauchy distribution. The prior specification on the parameters of the log-volatility equation follows the setup proposed in

Kastner and Frühwirth-Schnatter (2014). In details, we use a zero mean Gaussian prior with variance on the unconditional mean , a Beta prior on a transformation of the persistence parameter, and a Gamma prior on the error variance of the log-volatility process .

These priors can be combined with the likelihood to yield a joint posterior distribution over the coefficients and latent states in the model. To simulate from this joint posterior distribution we adopt a MCMC algorithm. This algorithm simulates all quantities in an equation-specific manner. Since all steps necessary are standard, we only provide an overview. Appendix A provides more details. Here it suffices to say that we sample all quantities related to the trees (i.e. for all ) using the algorithm outlined in Chipman et al. (2010). The latent states and the coefficients associated with the state equation of the log-volatilities are obtained through the efficient algorithm discussed in Kastner and Frühwirth-Schnatter (2014). Conditional on the trees and log-volatilities, the posterior of

is multivariate Gaussian and takes a standard form since the resulting conditional model is a linear regression model with heteroscedastic shocks. Finally, the parameters of the Horseshoe prior are simulated using techniques outlined in

Makalic and Schmidt (2015).

4 Empirical Work

4.1 Data Overview

In this section we briefly discuss the dataset adopted. Instead of focusing on US data, we apply our approach to monthly Eurozone data that ranges from January 1999 to January 2019. Our dataset is a selection of time series from the popular Euro Area Real Time Database (EA-RTD).

For the in-sample results discussed in the next sub-section, we consider a medium-scale model that includes six macroeconomic and financial time series. These six time series are augmented by an uncertainty index (abbreviated as Unc) which is estimated using the approach outlined in Jurado et al. (2015). Apart from the uncertainty indicator, we include the growth rate of the Dow Jones Euro Stoxx 50 price index (DJE50), HICP inflation (C_OV), unemployment rate (UNETO), 3-month Euribor (EUR3M), industrial production (XCONS) and the yield on Euro area 10-year benchmark government bonds (10Y).

In the forecasting application, we consider three different datasets. A small one that includes only HICP inflation, the unemployment rate and the 3-month Euribor. These three variables will consequently form our set of focus variables. Moreover, we consider the medium-sized dataset discussed above but without the uncertainty index as well as a large dataset. The large dataset augments the medium-sized one by adding the real effective exchange rate (ERC_BGR), the M2 money base (M2_V_NC), the one-month forward price of Brent crude oil (OILBR) and the spread between ten and two year government bond yields (YY).

All variables are transformed to be approximately stationary. We include one lag of the endogenous variables.111Using a single lag allows for simple inspection of several features of the model. Including more lags is straightforward but, as discussed in the text, lags larger than one only rarely show up in the splitting rules.

4.2 In-sample Results

In this section, we illustrate key features of the BART approach using a subset of the dataset discussed in the previous section.

Unc 31 10 11 6 9 14 9
DJE50 11 40 9 9 6 10 10
EUR3M 12 12 45 13 9 11 14
C_OV 11 12 13 55 12 14 12
UNETO 13 10 11 9 54 12 11
XCONS 12 10 9 10 10 35 12
10Y 11 10 5 5 7 11 41
Table 1: Number of times a given covariate shows up across splitting rules in a medium-scale VAR with one lag.

Variable importance is gauged by considering the posterior median of the number of times a given quantity shows up in a splitting rule. These frequencies are shown in Table 1. The columns refer to the different equations and the rows to the corresponding covariates. By simply inspecting the main diagonal elements of the table we directly observe that the first, own lag of a given variable within some equation appears to be important. This finding holds for all equations and resembles key results of the literature on Bayesian VARs which states that the AR(1) term explains most variation in . However, here it is worth stressing that our model is far from sparse and also the lags of other variables seem to play a role in determining the dynamics of a given endogenous variable.

In principle, we could also use Table 1 to construct a restricted VAR model. This can be achieved by introducing a threshold (such as including only variables that show up at least times). Using this (arbitrary) threshold would lead to a model that assumes uncertainty to depend only on its lagged value, the short-term interest rate, the unemployment rate and industrial production. Similarly, stock returns only depend on lagged stock returns, short-term interest rates and inflation. Short-term interest rates would only depend on inflation and lagged interest rates while inflation only depends on lagged inflation and the short rate. The unemployment rate is determined exclusively by its own lag and (lagged) inflation, closely resembling a Phillips curve-type relationship. Furthermore, we find that industrial production growth depends on lagged inflation, unemployment, uncertainty and its lag. Finally, 10-year bond yields depend on lagged bond yields, short-term interest rates, inflation and industrial production. This simple example essentially shows that, conditional on choosing a suitable cut-off value, one can investigate the determinants of each equation in a multivariate model and potentially use these to construct a restricted linear VAR model.

4.3 Forecasting results

In this section, we investigate whether our combination of BART and VAR models yields better forecasts of short-term interest rates, the unemployment rate and inflation. Before discussing the precise design of the forecast exercise, the question on how to compute the predictive distribution naturally arises. Producing one-step-ahead forecasts is computationally easy since conditional on the tree structures, splitting values and splitting covariates it is straightforward to compute a prediction for . More precisely (and with a slight abuse of notation), the one-step-ahead predictive distribution is given by:

where denotes the full history of and is a generic notation that summarizes all parameters and latent states (i.e. the tree structures, terminal nodes, log-volatilities etc.). The conditional density is:

is obtained by using (6) to predict the log-volatilities and using these predictions to form . Higher order forecasts are then computed iteratively by first simulating from the one-step-ahead predictive distributions to obtain a prediction for . This prediction is then used to form which is consequently used to predict . We repeat this procedure until we have a prediction with denoting the desired forecast horizon.

The design of the forecasting exercise is the following. Considering a hold-out period that runs from to ( monthly observations), we use an initial estimation period that lasts until to produce the -step-ahead predictive distributions for . After these are obtained, we repeat the procedure until we reach the final point in our sample. The predictions are then evaluated by using relative mean squared forecast errors (MSFEs) and the continuous ranked probability score (CRPS). The MSFEs allow us to gauge the quality of point forecasts while the CRPS is used to evaluate the quality of the predictive distribution.

We compare our BAVART model to several commonly used multivariate time series models in the literature. First, we use a set of VAR models that feature different shrinkage priors. The first is the non-conjugate Minnesota prior that allows for asymmetric treatment of ”own” lags (defined as lags of a given endogenous variable in ) and ”other” lags (defined of the lags of the other variables in ). The corresponding shrinkage hyperparameters are estimated using a random walk Metropolis Hastings step. Since this prior is often used by practitioners, we use it as a baseline model to which we benchmark our models. Second, we use the stochastic search variable selection (SSVS) prior of George et al. (2008). The third prior is the Normal-Gamma prior proposed in Griffin and Brown (2010) that has been applied to VARs in Huber and Feldkircher (2019). Moreover, we use a TVP-VAR with a NG shrinkage prior on the state innovation variances and the initial state of the system. Finally, all models are estimated using a stochastic volatility specification. Recall that all results are relative to the Minnesota VAR.

The results of our forecasting exercise are shown in Table 2. Our forecasting horse race draws a rich picture on relative model performance. We consider models of different sizes, priors, assumptions on the conditional mean and forecast horizons. For all these specifications, we provide a table that summarizes forecasting performance across the three focus variables and for point and density forecasts.

Small (M = 3) Medium (M = 7) Large (M = 10)
EUR3M 1.021 0.977 0.984 1.014 1.033 0.987 0.985 1.009 1.018 0.977 0.977 1.005
(0.924) (0.984) (0.987) (0.997) (0.958) (0.992) (0.99) (0.996) (0.951) (0.983) (0.981) (0.985)
UNETO 0.905 0.86 0.881 0.725 0.991 0.877 0.937 0.838 0.947 0.891 0.929 0.859
(1.053) (0.932) (0.941) (0.849) (1.05) (0.931) (0.958) (0.89) (1.001) (0.926) (0.945) (0.882)
C_OV 0.922 1.006 1.016 1.108 0.932 0.998 1.002 1.034 0.933 1.009 1.007 1.04
(0.878) (1.005) (1.012) (1.102) (0.899) (0.996) (0.997) (1.022) (0.909) (1.007) (1.002) (1.025)
EUR3M 0.942 0.974 0.978 0.996 1.255 0.938 0.887 1.043 1.156 0.962 1.006 1.501
(0.867) (0.984) (0.987) (1.003) (1.014) (0.969) (0.947) (0.971) (0.916) (0.959) (0.971) (1.053)
UNETO 0.858 0.939 0.937 0.757 2.321 1.444 1.439 4.592 1.883 1.786 1.769 3.357
(0.811) (0.965) (0.959) (0.82) (1.55) (1.074) (1.079) (1.547) (1.449) (1.11) (1.13) (1.6)
C_OV 0.745 0.998 1.012 1.177 0.731 0.989 0.995 1.144 0.72 1.024 1.02 1.143
(0.769) (1) (1.008) (1.168) (0.75) (0.988) (0.99) (1.114) (0.755) (1.018) (1.012) (1.121)
Table 2: Relative mean squared forecast errors (MSFEs) and continuous ranked probability scores (CRPS, in parentheses) to the Minnesota VAR.

Starting with the one-step-ahead horizon and point forecasts, we observe that for the interest rate and unemployment, our BAVART setup (irrespective of model size) is dominated by linear VAR models in terms of MSFEs. More precisely, the small-scale SSVS (for interest rates) and TVP-VARs (for unemployment) yield the lowest prediction errors. But notice that most of the ratios are close to one, indicating that the BAVART is never far off and thus highly competitive. Inflation appears to be an exception. Across all model sizes considered, we find that our model generally yields the smallest forecast errors. This suggests that controlling for non-linearities pays off when interest centers on forecasting inflation.

Turning to higher order forecasts provides a relatively similar picture. For longer-run forecasts, we again identify models which improve upon the BAVART setup for interest rate and unemployment forecasts. These improvements, however, are again quite small. One striking difference between the one-step- and the three-steps-ahead case is the strong performance of the Minnesota VAR for larger information sets and especially for unemployment. The results for inflation are again consistent with the findings for the one-step-ahead forecasts but appear to be more pronounced. Across model sizes, BAVART produces multi-step-ahead inflation predictions which feature the lowest forecast errors across all models considered.

Considering only point forecasts imply that we do not factor in how well a given model predicts higher order moments of the predictive distribution. We evaluate density forecasts using CRPSs. These are a generalization of the MSFE and consider the qualities of the full predictive distribution under scrutiny. Similarly to the one-step-ahead point forecasts, we find that most values are below but close to unity. This suggests that using the BAVART model improves upon the Minnesota benchmark in most cases. When compared to the other models, we can again identify a model that improves upon BAVART for interest rates and unemployment. For inflation, the CRPSs corroborate the findings for the point forecasts. For all information sets, the BAVART model produces the most precise predictive densities.

When we consider three-months-ahead predictive densities, the pattern observed above remains mainly in place. While there are models which yield more precise density forecasts than our proposed setup, these differences are often very small. By contrast, our model dominates when we consider higher-order inflation predictions. In this case, the performance gains vis-a-vis the competing models are appreciable.

To sum up, our forecasting exercise shows that the BAVART model works remarkably well for forecasting Eurozone inflation. For unemployment and interest rates, our proposed specification is slightly outperformed but both point and density forecasts are still competitive.

4.4 Impulse Responses to an Uncertainty Shock

In this section we briefly illustrate how the BAVART model can be used to carry out structural inference. Since our model is dynamic, we can consider how a shock at time affects the quantities in the system over time. Because of the highly non-linear nature of our model, we resort to structural generalized impulse response (SGIRF) functions (see Koop et al., 1996).

Let denote the structural shocks and denotes a selection vector that equals in the position. Hence, yields the structural shock. The SGIRF to the structural shock is then defined as the difference between a forecast that assumes (while setting the other shocks to zero) and the unconditional forecast (i.e. with ). The posterior distribution of the SGIRFs are computed by repeating this procedure during MCMC sampling.

Figure 3

shows the dynamic responses to a one standard deviation uncertainty shock. The dashed gray lines represent 16th and 84th posterior credible intervals. For brevity, we only report impulse responses that are significant (i.e. the credible intervals do not cover zero) for at least one month across the impulse response horizon.

Figure 3: Dynamic responses of selected macroeconomic quantities to an uncertainty shock. The dashed gray lines represent th and th posterior credible intervals.

The responses to an uncertainty shock largely mirror the ones reported elsewhere in the literature (see, among many others, Bloom, 2009; Jurado et al., 2015). The uncertainty index displays an immediate increase that fades out after the first five months. By contrast, we observe that inflation falls due to decreasing demand that drives down prices. Notice that prices react with a lag of one month. Since the uncertainty index is ordered first in this is purely driven by our shrinkage prior on the covariance parameters. This sluggish reaction of prices to uncertainty shocks reflects the fact that prices tend to be sticky and it takes some time for companies to adjust due to menu costs.

Turning to stock market reactions shows that equities display a rather short lived decline that fades out after around five months. Finally, real activity measured through industrial production also decreases. Interestingly, we do not observe a rebound in real activity arising from a ”wait-and-see mechanism” reported in, e.g., Bloom (2009).

To sum up, this brief structural exercise shows that BART is capable of producing dynamic responses that are consistent with the ones commonly found in the literature.

5 Conclusions

VAR models assume that the lagged dependent variables influence the contemporaneous values in a linear fashion. In this paper, we relax this by blending the literature on Bayesian additive regression tree (BART) models and VARs. The resulting model, labeled the Bayesian additive vector autoregressive tree (BAVART) model, can handle non-linear relations between the endogenous and the exogenous variables. Our proposed model is, moreover, capable of handling stochastic volatility in the shocks. This feature is crucial for obtaining precise density forecasts. We illustrate our model using synthetic as well as real data. Using data on the Eurozone shows that the BAVART model is capable of producing precise forecasts and that it yields reasonable impulse responses to an uncertainty shock.


  • Bassetti et al. (2014) Bassetti, F., Casarin, R., and Leisen, F. (2014).

    Beta-product dependent Pitman–Yor processes for Bayesian inference.

    Journal of Econometrics, 180(1):49–72.
  • Billio et al. (2019) Billio, M., Casarin, R., and Rossini, L. (2019). Bayesian nonparametric sparse VAR models. Journal of Econometrics, 212(1):97–115.
  • Bitto and Frühwirth-Schnatter (2019) Bitto, A. and Frühwirth-Schnatter, S. (2019). Achieving shrinkage in a time-varying parameter model framework. Journal of Econometrics, 210(1):75–97.
  • Bloom (2009) Bloom, N. (2009). The impact of uncertainty shocks. Econometrica, 77(3):623–685.
  • Breiman (2001) Breiman, L. (2001). Random Forests. Machine Learning, 45(1):5–32.
  • Carriero et al. (2019) Carriero, A., Clark, T. E., and Marcellino, M. (2019).

    Large Bayesian vector autoregressions with stochastic volatility and non-conjugate priors.

    Journal of Econometrics, 212(1):137–154.
  • Chan et al. (2020) Chan, J. C. C., Eisenstat, E., and Strachan, R. W. (2020). Reducing the state space dimension in a large TVP-VAR. Journal of Econometrics.
  • Chipman et al. (1998) Chipman, H. A., George, E. I., and McCulloch, R. E. (1998). Bayesian CART model search. Journal of the American Statistical Association, 93(443):935–948.
  • Chipman et al. (2002) Chipman, H. A., George, E. I., and McCulloch, R. E. (2002). Bayesian Treed Models. Machine Learning, 48(1):299–320.
  • Chipman et al. (2010) Chipman, H. A., George, E. I., and McCulloch, R. E. (2010). BART: Bayesian additive regression trees. Ann. Appl. Stat., 4(1):266–298.
  • Clark (2011) Clark, T. E. (2011). Real-time density forecasts from bayesian vector autoregressions with stochastic volatility. Journal of Business & Economic Statistics, 29(3):327–341.
  • Clark and Ravazzolo (2015) Clark, T. E. and Ravazzolo, F. (2015). Macroeconomic forecasting performance under alternative specifications of time-varying volatility. Journal of Applied Econometrics, 30(4):551–575.
  • Doan et al. (1984) Doan, T., Litterman, R., and Sims, C. (1984). Forecasting and conditional projection using realistic prior distributions. Econometric Reviews, 3(1):1–100.
  • Freund and Schapire (1997) Freund, Y. and Schapire, R. E. (1997). A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1):119–139.
  • Friedman (2001) Friedman, J. H. (2001).

    Greedy function approximation: A gradient boosting machine.

    Ann. Statist., 29(5):1189–1232.
  • George et al. (2008) George, E. I., Sun, D., and Ni, S. (2008). Bayesian stochastic search for VAR model restrictions. Journal of Econometrics, 142(1):553–580.
  • Gramacy and Lee (2008) Gramacy, R. B. and Lee, H. K. H. (2008). Bayesian Treed Gaussian Process Models With an Application to Computer Modeling. Journal of the American Statistical Association, 103(483):1119–1130.
  • Granger and Terasvirta (1993) Granger, C. W. and Terasvirta, T. (1993). Modelling non-linear economic relationships. OUP Catalogue.
  • Griffin and Brown (2010) Griffin, J. E. and Brown, P. J. (2010). Inference with normal-gamma prior distributions in regression problems. Bayesian analysis, 5(1):171–188.
  • Hill et al. (2020) Hill, J., Linero, A., and Murray, J. (2020). Bayesian Additive Regression Trees: A review and look forward. Annual Review of Statistics and Its Application, 7(1):251–278.
  • Hjort et al. (2010) Hjort, N. L., Holmes, C., Müller, P., and Walker, S. G. (2010). Bayesian Nonparametrics. Cambridge University Press, Cambridge.
  • Huber and Feldkircher (2019) Huber, F. and Feldkircher, M. (2019).

    Adaptive shrinkage in Bayesian vector autoregressive models.

    Journal of Business & Economic Statistics, 37(1):27–39.
  • Huber et al. (2019) Huber, F., Kastner, G., and Feldkircher, M. (2019). Should I stay or should I go? A latent threshold approach to large-scale mixture innovation models. Journal of Applied Econometrics, 34(5):621–640.
  • Huber et al. (2020) Huber, F., Koop, G., and Onorante, L. (2020). Inducing sparsity and shrinkage in time-varying parameter models. Journal of Business & Economic Statistics, pages 1–15.
  • Jurado et al. (2015) Jurado, K., Ludvigson, S. C., and Ng, S. (2015). Measuring uncertainty. American Economic Review, 105(3):1177–1216.
  • Kalli and Griffin (2014) Kalli, M. and Griffin, J. E. (2014). Time-varying sparsity in dynamic regression models. Journal of Econometrics, 178(2):779–793.
  • Kalli and Griffin (2018) Kalli, M. and Griffin, J. E. (2018). Bayesian nonparametric vector autoregressive models. Journal of Econometrics, 203(2):267–282.
  • Kastner and Frühwirth-Schnatter (2014) Kastner, G. and Frühwirth-Schnatter, S. (2014). Ancillarity-sufficiency interweaving strategy (ASIS) for boosting MCMC estimation of stochastic volatility models. Computational Statistics & Data Analysis, 76:408–423.
  • Kastner and Huber (2020) Kastner, G. and Huber, F. (2020). Sparse Bayesian Vector Autoregressions in Huge Dimensions. Journal of Forecasting.
  • Koop et al. (2019) Koop, G., Korobilis, D., and Pettenuzzo, D. (2019). Bayesian compressed vector autoregressions. Journal of Econometrics, 210(1):135–154.
  • Koop et al. (1996) Koop, G., Pesaran, M. H., and Potter, S. M. (1996). Impulse response analysis in nonlinear multivariate models. Journal of econometrics, 74(1):119–147.
  • Lakshminarayanan et al. (2014) Lakshminarayanan, B., Roy, D. M., and Teh, Y. W. (2014). Mondrian Forests: Efficient Online Random Forests. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 27, pages 3140–3148. Curran Associates, Inc.
  • Linero (2018) Linero, A. R. (2018). Bayesian Regression Trees for High-Dimensional Prediction and Variable Selection. Journal of the American Statistical Association, 113(522):626–636.
  • Litterman (1986) Litterman, R. B. (1986). Forecasting with Bayesian Vector Autoregressions – Five Years of Experience. Journal of Business & Economic Statistics, 4(1):25–38.
  • Makalic and Schmidt (2015) Makalic, E. and Schmidt, D. F. (2015). A simple sampler for the horseshoe estimator. IEEE Signal Processing Letters, 23(1):179–182.
  • Pratola et al. (2019) Pratola, M. T., Chipman, H. A., George, E. I., and McCulloch, R. E. (2019). Heteroscedastic BART via Multiplicative Regression Trees. Journal of Computational and Graphical Statistics, pages 1–13.
  • Primiceri (2005) Primiceri, G. E. (2005). Time Varying Structural Vector Autoregressions and Monetary Policy. The Review of Economic Studies, 72(3):821–852.
  • Roy and Teh (2009) Roy, D. M. and Teh, Y. W. (2009). The Mondrian Process. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 21, pages 1377–1384. Curran Associates, Inc.
  • Sims (1980) Sims, C. A. (1980). Macroeconomics and Reality. Econometrica, 48(1):1–48.
  • Sims and Zha (1998) Sims, C. A. and Zha, T. (1998). Bayesian Methods for Dynamic Multivariate Models. International Economic Review, 39(4):949–968.

Appendix A Posterior Approximation of the Trees

In this appendix, we provide details on the MCMC algorithm used to simulate from the joint posterior distribution of the trees and the terminal node parameters. The remaining quantities take well known forms and are very easy to simulate using textbook results for the Gaussian linear regression model.

As stated in Section (3.3) and following Carriero et al. (2019), Koop et al. (2019) and Kastner and Huber (2020), we rely on a structural representation of the model that entails equation-by-equation estimation.

We simulate from the conditional posterior of by conditioning on all trees except for the tree. This is achieved by using the Bayesian backfitting strategy discussed in Chipman et al. (2010). More precisely, the likelihood function depends on through the partial residuals

The algorithm proposed in Chipman et al. (2010) then proceeds by integrating out and then drawing with Metropolis-Hastings (MH) algorithm detailed in Chipman et al. (1998).

This step is implemented by generating a candidate tree from a proposal distribution and then accept the proposed value with probability:

The first term is the ratio of the probability of how the previous tree moves to the new tree against the probability of how the new tree moves to the previous one. The second term is the likelihood ratio of the new tree against the previous tree and the last term denotes the likelihood ratio of the new against the previous tree under the prior distribution.

The proposal distribution features four local steps (Chipman et al., 1998). The first is step grows the tree by splitting a node into two different nodes. The second step combines two non-terminal nodes into a single terminal node. The third step swaps splitting rules between two terminal nodes and the final step changes the splitting rule of a single non-terminal node.

After sampling the tree structure we can easily simulate from the posterior distribution of the terminal nodes. Since the prior is conjugate the resulting posterior will also be a Gaussian distribution that takes a standard form and does not explicitly depend on the tree structure but only on the subset of elements in the residuals allocated to a specific terminal node.