The novel coronavirus (covid-19) outbreak has drawn attention to the modeling of rare events such as pandemics and disasters. A difficulty in predicting their occurrences and designing polices to mitigate their impact is that there are few such data points even over a long span. After all, the CDC has only documented four influenza pandemics in the U.S. with deaths in excess of 100,000 over a 120 year period starting in 1900.111These are the Spanish flu in 1918 (675,000 US deaths), the H2N2 virus in 1957-58 (116K US deaths), H3N2 virus in 1968 (100,000 US deaths), the H1N1 virus in 2009 (12,500 US deaths). Source https://www.cdc.gov/flu/pandemic-resources/basics/past-pandemics.html. For natural disasters, the 12,000 deaths from the Galveston hurricane of 1900 remains a record, with the 1200 deaths from Katrina coming in a distinct second in terms of casualties. Worldwide, only seven earthquakes since 1500 were larger than 9 in magnitude,222Source: https://en.wikipedia.org/wiki/Lists_of_earthquakes#Largest_earthquakes_by_magnitude. and September 11 was the only terror attack on U.S. soil with more than 300 deaths, let alone 3000. Nonetheless, when a rare disaster strikes, it strikes in a ferocious manner as covid-19 reminds us. To learn from what has happened, researchers often study each event in isolation, or pool information from events occurring at different times and locations under the important but questionable assumption that the parameters of interest are the same across events, time, and space.333For a review of methodologies used, see Botzen, Deschenes, and Sanders (2019). We will apply standard time series methodology to analyze dynamic effects due to rare events, which embodies the use of heavy-tailed primitive shocks and a method for separating disaster-type shocks from the others.
To fix ideas, consider the financial cost and number of deaths due to natural disasters over the sample 1980:1-2019:12 shown in Figure 1.444The raw data files can be downloaded at the noaa website www.iii.org/table-archive/2142. The two series augmented to include 9/11 as in Ludvigson, Ma, and Ng (2020a). As is characteristic of heavy tailed data, the two series are dominated by four events: Hurricane Katrina in 2005, 9/11, Hurricanes Harvey/Maria/Irma in a four week span in the summer of 2017, and superstorm Sandy in 2012. Heavy-tailed data have infinite variance and pack a lot of information in a few observations, so in principle, the dynamic effects of disaster shocks should be consistently estimable. Indeed, if all variables in a multivariate system have heavy tails, we show below that the least squares estimator will converge at a fast rate of where is the index of the heavy-tailed shock and is the sample size. Though the distribution theory is a bit nonstandard, the regression framework is the same as the standard case when all variables have light tails.
But while many macroeconomic time series have excess kurtosis, they do not fit the characterization of heavy tails. For example, unemployment and industrial production have kurtosis of less than 10, while the two disaster series shown in Figure 1 have kurtosis in excess of 70, and the estimated tail index of approximately one suggests a distribution with infinite variance and possibly infinite mean.555The method often used to estimate the tail index is due to Hill (1975). As point of reference, stock returns data up to 1992 have kurtosis around 30. See Mills (1995). Beare and Toda (2020) analyzed covid-19 case across US counties and finds that the right tail of the distribution has a Pareto exponent close to one. This motivates a new multivariate framework in which finite and infinite variance shocks co-exist in such a way that the economic variables can be affected by heavy tailed shocks but not dominated by them. In our HL (’heavy-light’) framework presented in Section 3, the coefficient estimates on the infinite variance regressors are consistent at a rate of , which is still faster than the usual rate of . We then show that the infinite variance shock can be identified from information in the tails and estimated by a distance-covariance based independent component analysis (ICA) that is also valid in the heavy-tailed case as recently shown in Davis and Fernandes (2021). Even though the variance of the primitive shock does not exist, the Choleski decomposition of the sample covariance remains valid and is a useful way to prewhiten the data prior to ICA estimation.
Our assumption is that the primitive shocks
are mutually independent, which is stronger than mutual orthogonality that only rules out correlations in the second moments. The stronger assumption allows us to use independence as a specification test of the SVAR identifying restrictions irrespective of how they are motivated. For example, the ordering of the variables used in Choleski decomposition can be formally tested. The test can be applied to overidentified and exactly identified models and is valid when the shocks have thin or heavy tails. Our implementation is based on a permutation test for independence since its robustness to infinite variance data can be verified. This procedure is similar to the method proposed byMatteson and Tsay (2017).
The rest of the paper is structured as follows. Section 2 summarizes the key properties of heavy-tailed linear processes and discusses the implications for VAR estimation. Section 3 considers a new framework that allows a regressor with infinite variance to meaningfully impact a dependent variable with finite variance. Consistency and limiting behavior of the least squares estimator for parameters in a VAR are shown. Disaster shocks are then identified from the sign and the kurtosis of the estimated independent components analysis via distance covariance. Implementation of an independence test is then discussed. Simulations are presented in Section 4 and two applications are considered. In the first, the independence test is used to shed light on the conflicting evidence regarding the role of uncertainty in economic fluctuations. In the second, the new SVAR framework finds that disaster shocks have short lived but significant economic impact. Finally, the appendix contains background material on distance covariance as well as proofs of the main results in Section 3.
2 Heavy-Tailed Linear Processes
Disaster events are rare and heavy tails is a distinct characterization of their probabilistic structure. Well known heavy-tailed distributions include the Student-, , Fréchet, as well as infinite variance stable and Pareto distributions.
be the distribution of an IID sequence of random variable. Then has Pareto-like tails with tail index if
where is a finite and positive constant and as
. Examples include the Cauchy and Pareto distributions. The Gaussian distribution has ‘thin’ tails that decay faster than an exponential and is not included in this class. The results that follow can be extended to a more general condition oncalled regular variation in which (1) is replaced by
The normalizing constants in such an extension become less explicit so that we stick to the Pareto-like tail assumption for tractability.
Let be the
-th quantile ofand
the corresponding quantile for the joint distribution of the product. Distributions with Pareto-like tails have . Since for continuous , (1) implies
It is possible for the population variance to exist but the population kurtosis to be undefined. But even if the population moments do not exist, the sample moments can still have well defined limits. If has a Pareto-like tails with index , also has Pareto-like tails with index , and it holds that
where is the sample mean, and for , are stable random variables with exponents, respectively. Their joint distributions can be found in Davis and Resnick (1986).
To gain a sense of the tail properties of the data under investigation, we will make use of the fact that if is a IID Pareto process with tail index , then the sample kurtosis has the property (see Cohen, Davis, and Samorodnitsky (2020)) that
The limit of kurtosis, scaled by the sample size, is a random variable between zero and one so the maximum kurtosis that can be observed asymptotically is . Tabulating the distribution for and , we see that the quantiles roughly double with .
As point of reference, the disaster series shown in Figure 1 has and kurtosis of around 70, which is in the lower 10-th percentile.666The distribution of can be approximated by simulating times where is drawn from the exponential distribution.
is drawn from the exponential distribution.The number of death series has kurtosis of 147 and is in the 30-th percentile. In contrast, the kurtosis a typical of macro economic time series is under 10, hence the theory for heavy tails would be inappropriate. A multivariate system of time series with different tail properties thus necessitates a different setup.
There is a large literature on robust and quantile estimation of the parameters in a linear model to guard against extreme values which explicitly down-weights outliers.Blattberg and Sargent (1971) and Kadiyala (1972)
show that the least squares estimator is unbiased when the error in the regression model is drawn from a general symmetric stable Paretian distributions but it is not the best linear unbiased estimator. In the Cauchy case when, the BLUE is defined by the observation of and at which is at a maximum. A different viewpoint, also the one taken in this paper, is that the extreme values are of interest.777See, for example, two special issues on heavy-tailed data, Paolella, Renault, Samorodnitsky, and Varedas (2013) and Dufour and Kims (2014). Under this assumption and fixed regressors, Mikosch and de Vries (2013)
provide a finite sample analysis of the tail probabilities of the single equation CAPM estimates to understand why they vary significantly across reported studies. We are interested in estimating dynamic causal effects in a multivariate setting when the regressors are stochastic, and one of the primitive shocks has heavy tails.
2.1 Implications for VAR Estimation
Consider mean zero variables represented by a VAR(p):
where . Provided that det for all such that , exists, the moving-average representation of the model is where with .
Since the regressors in every equation are the same, systems and equation by equation OLS will yield numerically identical estimates and are characterized by
These errors are mapped to a vector of primitive shocks via a (time invariant) matrix :
where is usually assumed to be mean zero, mutually and serially uncorrelated and with being a diagonal matrix. See, for example, Stock and Watson (2015) and Kilian and Lutkepohl (2017). The reduced form errors are usually assumed to have ‘light tails’ which is possible only if has light tails. A model that satisfies these standard assumptions will be referred to as the LL (light-light) hereafter. Under regularity conditions for least squares estimation, is consistent and asymptotically normal.
The modeling issues that arise when one of the primitive shocks in a SVAR has infinite variance are best understood in the and case. Consider first a HH (heavy-heavy) model in which both shocks have heavy tails.
Let be an IID sequence of random variables with Pareto-like tails (i.e., equation (1)) with index and if . Let
If the sequence of constants are such that for some , then
the process exists with probability one and is strictly stationary.
Let be the sample autocorrelation at lag and suppose that for some . Then for ,
where , ) are stable random variables with indices and , respectively, and whose joint distributions are given in Davis and Resnick (1986). If , then the latter convergence also holds if provided is replaced by its mean-corrected version, .
, where is a stable random variable with index .
By restricting attention to , we only consider processes with infinite variance. Even though is not covariance stationary (since ), part (i) states that the process exists and is strictly stationary. The stated results for the sample covariance and sample autocorrelation are due to Davis and Resnick (1986, Theorem 3.3) and also hold when is centered for . Note that the convergence of is faster than the rate obtained for finite innovation variance. Part (iii) also applies when is replaced by some variable satisfying a strong mixing condition.
For VAR estimation, Lemma 1 then implies that
It then follows from continuous mapping that the least squares estimator is super consistent:
Though the analysis is straightforward, this setup is problematic because if and both have infinite variance, and must also have infinite variance. But a typical economic time series does not resemble the two series shown in Figure 1. Not only are the disaster series much less persistent, their kurtosis (over 70) are orders of magnitude larger than for variables like output growth, inflation, and interest rates.
3 A VAR with Heavy and Light Tailed Shocks
Our goal is a model in which (i) a heavy tail shock co-exists with a light tail shock , and (ii) is influenced by the current and past values of but not dominated by them in a sense to be made precise. We consider the HL (heavy-light) model derived from the SVAR
The sequences of n-dimension random vectors are iid and the components, are also independent. The will have Pareto-like tails with index and , while the remaining shocks will have thin tails with mean zero and variance 1.
The coefficient matrices for and the matrix will satisfy the following conditions.
as , where and are constants with
Assumption HL(ii) corresponds to a model specification for the data that depends on , the sample size. So is in fact a triangular array that depends on . However, since the context will be clear when the assumption HL is in effect, we will suppress this explicit dependence in our notation for . The primitive shocks and are assumed to be independent (over ) to exploit the results stated in Lemma 1. The assumption does not preclude time varying second moments, though it is stronger than mutual orthogonality of typically assumed in SVAR modeling. Assumption HL (i) restricts attention to processes with tail index and thus excludes Cauchy shocks. The assumption that the thin tail shocsk have unit variance is without loss of generality, but it is important that their variances are finite. Since the variations of will dominate those of when both are present, will have heavy tails and exhibit the large spikes originating from .
Specializing to the bivariate case with and , assumption HL(ii) is motivated by the fact that cannot have finite variance unless and for all . But the dynamic effects of on would then be zero at all lags by assumption, rendering the empirical exercise meaningless. Thus, the equation is modified to dampen the influence of on at rate , where is determined by the fact that has a limit. Localizing and to zero effectively makes exogenous to , but only asymptotically.888A similar effect can be achieved by replacing the heavy-tailed shock by its truncated version for some constant . This was the approach taken by Amsler and Schmidt (2012).
We show in the Appendix that under Assumption HL in the bivariate case,
where the limits have a stable distribution with index and , respectively. Thus the sample first and second moments of have (random and possibly constant) limits even though one of its shocks has infinite variance. The implications for least squares estimation of the HL model can be summarized as follows.
Suppose that the data are generated by (5), which for tractability . If Assumption HL holds, then the least squares estimate of satisfies
The convergence rate for is , which is . This is slower than the rate for in the HH model because one of the infinite variance regressors in the HH model is replaced by one that has finite variance. The convergence rate for can be written as which is slower than the rate for in the LL model because the variations in this equation are dominated by those from lags of , hampering identification of . Now the convergence rate for can be written as which appears to be faster than the rate obtained for in the LL model. This implies that that is consistent since . Hence in the HL model, the local parameter is consistently estimable.
3.1 Identification of and the Primative Shocks
The structural moving-average representation of the model is
where and . The effects of on are given by the first column of which depends on and . Hence to estimate the dynamic causal effects of , we also need to consistently estimate when has infinite variance.
As is well known, is not uniquely identified from the second moments of alone even when has finite variance because has the same covariance structure as for any orthonormal matrix . It is necessary to impose restrictions. The validity of an exactly identified model is difficult to test. Though Choleski decomposition is known to depend on the ordering of variables, practitioners can do little more than report informal robustness checks. The following lemma suggests that a test is possible if we are willing to assume that are mutually independent.
Let , where is a vector of mutually independent components of which at most one is Gaussian and is an invertible matrix with inverse . If the components of are pairwise independent, where is an invertible matrix, then where is a permutation matrix and is a diagonal matrix. Further, the components of must be mutually independent.
Proof. The proof of this result follows directly from Skitovich-Darmois theorem as described in the proof of Theorem 10 in Comon (1994). Since , where the components of can be written as
the independence of and components implies that for . That is, the first column of contains at most one nonzero value. A similar conclusion holds for all the columns of . Hence is product of a permutation matrix times a diagonal matrix , i.e., . In other words, or as was to be shown. It follows by the form of that the components of must be mutually independent.
Lemma 2 can also be seen as a statement about conditions for identification as it implies that subject to permutations of rows and changes of scale/sign, there is only one that can be used to recover the shocks from . As discussed in Gouriéroux, Monfort, and Renne (2017), local identification can fail when the shocks are independent because of scale changes, a problem that is usually resolved by normalizing the variances to fix the scale. However, failure of global identification arising from permutation and sign changes requires additional assumptions, and these are presumed to be embodied in the matrix. A test for independence of is thus a test for the validity of the identifying assumptions. A consistent test should reject with probability tending to one as for any modulo permutations and scale/sign changes. For example, if is incorrectly assumed to be lower triangular but an entry in the upper triangular matrix is non-zero,
should fail an independence test. The test is valid for any identification strategy, regardless of whether the model is over or exactly identified models, and applies whether the data have fat or thin tails. We will be using a distance covariance based permutation test which is known to control Type I error and also robust to the possibility of heavy tails. This will be further explained below.
3.2 Independent Components Analysis and Estimation of
The relationship between the vector of primitive shocks and error terms is
where is an -vector of iid random variables with mean zero and is an matrix with inverse . Based on a sample , the idea is to estimate by minimizing a function of the distance covariance function. Independent components analysis (ICA) is widely used to identify a linear mixture of non-Gaussian signals. Whereas PCA uses the sample covariance to find uncorrelated signals, ICA typically uses higher-order moments to find independent signals.999The two will give similar results when the higher-order statistics add little information. For a recent review, see Hyvarinen (2013). In the ICA literature, is known the mixing matrix and the unmixing matrix. ICA has been applied to finite variance SVARs in which global identification is achieved by imposing additional restrictions such as lower triangularity of .101010 See, for example, Moneta, Entner, Hoyer, and Coad (2013); Hyvarinen, Zhang, Shimizu, and Hoyer (2010); Gouriéroux, Monfort, and Renne (2017); Maxand (2020), Lanne, Meitz, and Saikkonen (2017). We will impose restrictions on . Following Chen and Bickel (2006) we assume that the parameter space of unmixing matrices is given by consisting of invertible matrices for which a) each of its rows has norm 1; b) the element with maximal modulus in each row is positive; c) the rows are ordered by ; for , if and only if there exits such that , and . Further it is assumed that the true unmixing matrix .
Our problem is non-standard because the shock of interest has heavy tails. But disasters are extreme adverse events with the distinctive feature of heavy tails. Though the population kurtosis does not exist if the variance is infinite, the sample kurtosis divided by the sample size will converge in distribution to a nondegenerate limit (see (4)) in the infinite variance case. Hence assuming no two components have the same kurtosis, the kurtosis ranking of is necessarily unique. The ranking together with a sign restriction that associates extreme events as ’bad shocks’ give a globally unique solution. The estimate of is then computed as the inverse of the unmixing matrix. Permutation of rows and rescaling can then be applied, if necessary, to compute the desired estimate of . In practice, the estimated residuals from fitting a VAR model to the data will play the role of the . It remains to find an ICA estimator that works in the infinite variance case.
3.3 Distance Covariance
There exist many ICA estimators for identifying the source process, which in our case correspond to the primitive shocks . Some procedures evaluate negative entropy (also known as negentropy) and take as solution the that maximizes non-Gaussianity of , while others maximize an approximate likelihood using, for example, log-concave densities.. The popular fast ICA algorithm of Hyvaninen, Karhunen, and Oja (2001) is a fixed-point algorithm for pseudo maximum-likelihood estimation. A different class of procedures take as the starting point that if the signals are mutually independent at any given
, their joint density factorizes into the product of their marginals. This suggests to evaluate the distance between the joint density (or characteristic function) and the product of the marginals.111111See, for example, Bach and Jordan (2001) and Eriksson and Koivunen (2003), and Hyvaninen and Oja (2000) for an overview of the methods used in signal processing. Statistical procedures include Chen and Bickel (2006), Hastie and Tibshirani (2003), Hastie and Tibshirani (2003), Samworth and Yuan (2012), Gouriéroux, Monfort, and Renne (2017). Chen and Bickel (2006) obtain a non-parametric convergence rate of , the same as the one obtained in Gouriéroux, Monfort, and Renne (2017) for parametric estimation.
As in Matteson and Tsay (2017), we will use ICA in conjunction with distance covariance to extract the independent sources in from the reduced-form errors in the VAR model. The distance covariance between two random vectors and of dimensions and , respectively, is
where is a weight function and denotes the characteristic function for any random vector . (Additional background on distance covariance can be found in Section 7.1.) The most commonly used weight function, which we will also adopt here, is
where , (see Szekeley, Rizzo, and Bakirov (2007)). The integral in (9) is then finite provided . Under this moment assumption, one sees immediately that and are independent if and only if since in this case the joint characteristic function factors into the product of the respective marginal characteristic functions, for all .
Based on data from , the general distance covariance in (9) can be estimated by replacing the characteristic function with their empirical counterparts. Let and be respective empirical characteristic functions, e.g., . Then
We now apply the notion of distance covariance to form an objective function for estimating the unmixing matrix . To this end, the unmixing matrix is the element of such that the components are independent. Now the components of are independent if and only if for where . Following Matteson and Tsay (2017), this is equivalent to , where
with weight function given by (10) with . Now based on a sample , the estimate of the unmixing matrix is found by minimizing the objective function,
subject to , where is the empirical estimate of using . It is shown in Matteson and Tsay (2017) that this procedure produces a consistent estimate of provided the variance of the is finite. More recently, it is shown in Davis and Fernandes (2021) that this procedure is also consistent if just the mean is finite. This result justifies the use of the objective function for estimating the unmixing matrix in the finite mean but infinite variance case. In case the mean is infinite, one can choose a in the weight function to ensure that the moment condition is met.
3.4 Prewhitening and Choleski Decomposition
Regardless of method used, the first step in ICA estimation is prewhitening the
, In the context of a SVAR with finite variance, this means linearly transformingby a matrix so that and . Since and by assumption, it follows that . Orthogonality of reduces the number of unknowns from to , which is computationally appealing. In effect, prewhitening removes second moment correlations prior to estimating the independent components.
In the ICA literature, prewhitening is often achieved by transforming the data using the output from a singular value decomposition of the covariance of. But a Choleski decomposition also produces orthonormal data and is a natural alternative for SVARs since it is already widely used to impose a lower triangular structure. By Lemma 2, we can test validity of the imposed structure via an independence test. However, as the population covariance matrix of does not exist, one might wonder if the Choleski decomposition applied to a sample is valid at all.
Let , and suppose that has larger kurtosis than . Define , where . Under Assumption HL, . Furthermore,
is a Choleski decomposition of satisfying , .
The Lemma says that even though a Choleski decomposition of the population covariance matrix of is not possible, the decomposition exists and has a nice structure when applied to the sample covariance matrix. The standardization of simplifies the calculations but is not essential for existence of the decomposition. It is shown in the Appendix that if is indeed lower triangular, and is correctly ordered,
where is sum of variates. Thus the Choleski decomposition of correctly ordered data can recover the true signals up to scale. Note that the Choleski decomposition is used here only as a prewhitening device but not as a way to achieve identification. Even if the ordering is incorrect, ICA will undo the ordering to find the that satisfy additional identification restrictions.
3.5 An Independence Test
As discussed above, a test for independence of is a test for the validity of the identifying assumptions. It will be helpful to first construct a test of independence for components in a random vector. For example, if the components of were already independent, then by Lemma 2, would be diagonal and no further analysis on the sturucture of would be required. Moreover, there would also be no point in analyzing impulse responses implied by restrictions that fail the independence test.
As reviewed in Josse and Holmes (2016), many independence tests are available. Using the ideas of Section 3.3, one can construct a test using the empirical version of the aggregated distance covariance defined in (12). Even though
has a limit distribution under the null hypothesis of independent components, the limit distribution is generally intractable. Hence direct use of the limit distribution for calculating cutoff values for the test statistic is infeasible.
Alternatively, one can use a test by calculating the test statistic for random permutations of the data. A permutation-based test for independence is founded on the idea that if there is dependence in the components, then the value of should be larger than the corresponding statistics based on random permutations of the components, in which the dependence among the components has been severed by the permutation. If is an iid sample of random vectors of dimension , then the permutation procedure is implemented via the following steps. For ,
For , generate , where is a random permutation of .
Compute using .
The value of the test is constructed as
where is the number of ’s from the permuted samples that exceed . We reject independence of the components in if the value is less than a prescribed nominal size. The test is implemented in the R-package steadyICA with a default value of 199.
In summary, the dynamic effects of a disaster shock can be analyzed as follows. Step 1 estimates the coefficients of a VAR model using least squares. Step 2 prewhitens the VAR residuals. Step 3 applies ICA to obtain independent shocks and associates the shock with the largest kurtosis as the disaster shock. Step 4 estimates the impulse response functions. Their dynamic effects after periods are defined by which can be computed once consistent estimates of and are available. We are primarily interested in the effects of on , so we can also estimate the first column of
by projecting the response variable of interest onon other controls as in Jorda (2005). However, it should be noted that the coefficient estimates from the local projection regressions will have non-standard properties in view of Proposition 1.
To illustrate the effectiveness of this methodolgy, simulations are performed with for four SVAR(1) models based on two specifications of and two sets of primitive shocks, holding the matrix fixed throughout at
In model 1 (labeled NLT), the matrix is Not Lower Triangular. In model 2 (labeled LT), is Lower Triangular.
From a given that is either NLT or LT, its inverse yields a non-normalized from which a normalized is formed by imposing the constraint that each row sums to one. Then is used to simulate data and subsequently estimated.
The first innovation specification (denoted HL) has one heavy-tailed shock while in the second specification (denoted LL), all three shocks have light tails. In both cases, the shocks are ordered such that has the largest kurtosis and has the smallest.
Let , and , denote the sample covariance matrices of and from a sample .
, is the lower triangular matrix from the Choleski decomposition of .
, is the lower triangular matrix from Choleski decomposition of .
, , .
, , .
Panel A: Permutation Test: observed
Fraction of values
Panel B: Permutation Test: estimated from VAR
Fraction of values