Interpretable Proximate Factors for Large Dimensions

05/09/2018 ∙ by Markus Pelger, et al. ∙ 0

This paper approximates latent statistical factors with sparse and easy-to-interpret proximate factors. Latent factors in a large-dimensional factor model can be estimated by principal component analysis, but are usually hard to interpret. By shrinking factor weights, we obtain proximate factors that are easier to interpret. We show that proximate factors consisting of 5-10% of the cross-section observations with the largest absolute loadings are usually sufficient to almost perfectly replicate the population factors, without assuming a sparse structure in loadings. We derive lower bounds for the asymptotic exceedance probability of the generalized correlation between proximate factors and population factors based on extreme value theory, thus providing guidance on how to construct the proximate factors. Simulations and empirical applications to financial single-sorted portfolios and macroeconomic data illustrate that proximate factors approximate latent factors well while being interpretable.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 22

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

“Big data” is becoming increasingly popular to study problems in economics and finance. Large-dimensional datasets contain rich information and open the possibilities of new findings and new efficacious models, which is feasible only if we better understand the datasets. Factor modeling (Bai and Ng, 2002; Bai, 2003; Fan et al., 2013) is a method that summarizes information in large-dimensional panel data and is an active area of research. In a large-dimensional factor model, both the time dimension and cross-section dimension of the datasets are large and most of the co-movement can be explained by a few factors. Latent factors estimated from the data are particularly appealing as the underlying factor structure is usually not known. These factors are usually estimated by Principal Component Analysis (PCA). Latent PCA factors have been successfully used in economics and finance, for example for prediction and forecasting (Stock and Watson, 2002a, b), asset pricing of many assets (Lettau and Pelger, 2018b; Kelly et al., 2018), high-frequency asset return modeling (Aït-Sahalia and Xiu, 2018; Pelger, 2019), conditional risk-return and term structure analysis (Ludvigson and Ng, 2007, 2009) and optimal portfolio construction (Fan et al., 2013).333Of course,the application of latent factor models goes beyond the applications listed above, e.g. inferring missing values in matrices (Candès and Tao, 2010; Candès et al., 2011). However, as the latent PCA factors are linear combinations of all cross-sectional units, they are usually hard to interpret. This poses a challenge for modeling and understanding the underlying structure to explain the co-movement in the data. This becomes particularly relevant for the applications in economics and finance where the objective is to understand the underlying economic mechanism.

Practitioners and academics alike have used an intuitive approach to interpret latent statistical factors by focusing on the largest factor weights. A pattern in the largest factor weights suggests an economic interpretation (e.g. Lettau and Pelger (2018b); Pelger (2019)). In this paper we formalize this idea and show that the factors that are only based on the largest factor weights, provide already an excellent approximation to the population factors. This step further reduces the dimensionality of the problem helping to better understand the economic mechanism in the problem.

We propose easy-to-interpret proximate factors for latent factors. We exploit the insight that cross-section units with larger factor weights have a larger signal-to-noise ratio, hence providing more information about the underlying factors. Our method consists of four simple steps. First, we estimate the underlying factor structure with the conventional Principal Component Analysis (PCA), which returns the weights to construct the latent PCA factors. Second, we set all factor weights to zero except for the largest absolute ones. Third, the proximate factors are obtained from a simple regression on the thresholded factor weights. Finally, the loadings are obtained from a regression on the proximate factors. We work under a general scenario where the factor weights and loadings in the true model are not sparse.

We show that one needs to make a clear distinction between factor weights and loadings. In a conventional PCA analysis the loadings serve two purposes: First, they measure the exposure of the cross-sectional data to the factors; Second, they correspond to weights to construct the latent PCA factors. The same view is for example taken in a sparse PCA setup where loadings and hence also the factor weights are shrinked to a sparse matrix. We show that in an approximate factor model with non-sparse factor weights and loadings, we can construct proximate factors with sparse factor weights that are very close to the population factors. These proximate factors have non-sparse loadings which are consistent estimates of the true population loadings. The proximate factors are considerably easier to interpret as they are only based on a small fraction of the data, while enjoying the same properties as the non-sparse factors.444A common method to interpret low-dimensional factor models is to find a rotation of the common factors with a meaningful interpretation. This approach uses the insight that factor models are only identified up to an invertible transformation and represent the same model after an appropriate rotation. The criterion proposed by Kaiser (1958) is a popular way to select factors whose factor weights have groups of large and negligible coefficients. However, in large-dimensional factor models with non-sparse factor weight structure, finding a “good” rotation becomes considerably more challenging. It is generally easier with our sparse proximate factors to find a rotation that has a meaningful interpretation.

We develop the statistical arguments that explain why the sparse proximate factors can be used as substitutes for the non-sparse PCA factors. The conventional derivations used in large-dimensional factor modeling to prove consistency of estimated factors do not apply to our proximate factors. It turns out that the proximate factors are in general a biased estimate of the true population factors. We can control this bias with Extreme Value Theory (EVT) and show how to construct the proximate factors such that this bias becomes negligible. The closeness between proximate factors and true factors is measured by the generalized correlation.555

Generalized correlation (also called canonical correlation) measures how close two vector spaces are. It has been studied by

Anderson (1958) and applied in large-dimensional factor models (Bai and Ng, 2006; Pelger, 2019; Andreou et al., 2017; Pelger and Xiong, 2018). We provide an asymptotic probabilistic lower bound for the generalized correlation based on EVT.666In simulations, we verify that the lower bound has good finite sample properties. The lower bound is easy to calculate and depends on the tail distribution of population factor weights and the number of nonzero elements in the sparse factor weights. The lower bound provides guidance on constructing the proximate factors that are guaranteed to have a high correlation with the true factors. Moreover, when factor weights have unbounded support, we show that proximate factors asymptotically span the same space as latent factors. Importantly, the estimated loadings of the proxy factors converge to the true population loadings up to the usual rotation. This surprising result is due to the two stage procedure for estimating the loadings that averages out the idiosyncratic noise. Hence, regressions based on the more interpretable proxy factors will asymptotically yield the same results as using the harder to interpret PCA factors.

Our results are of practical and theoretical importance. First, we provide a simple and easy-to-implement method to approximate latent factors by a small number of observations. These sparse proximate factors usually provide a much simpler economic interpretation of the model, either by directly analyzing the sparse composition or after rotating them appropriately. Second, we show in empirical and simulation studies that this approximation works surprisingly well and almost no information is lost by working with the sparse factors. Third, our asymptotic bounds for the correlations between the proximate factors and the population factors provides the theoretical reasoning why our method works so well. In particular, it clarifies under which assumptions the proximate factors are a good approximation or even converge to the population factors. As mentioned before, the idea of analyzing the largest factor weights is not new, but we are the first to provide the theoretical arguments why and when it is a reasonable approach. Fourth, the asymptotic bounds provide guidance on how to select the key tuning parameter for our estimator, i.e. the number of non-zero elements.777Obviously, the degree of sparsity can be chosen to obtain a sufficiently high generalized correlation with the estimated PCA factors. However, this would be an in-sample choice of the tuning parameter with no guarantee for its out-of-sample performance. Our theoretical bound provides an alternative to select the tuning parameter based on arguments that should also hold out-of-sample.

We need to overcome three major challenges, when showing the probabilistic lower bound for the generalized correlation. First, we need to show that the estimated loadings converge uniformly to some rotation of the true loadings, under the general approximate factor model assumptions888We impose assumptions similar to Bai and Ng (2002). In contrast to Fan et al. (2013), we do not assume loadings to be fixed and bounded which would restrict the distributions from which the loadings can be sampled.. Uniform consistency of the estimated loadings is necessary for our argument that cross-section units with the largest estimated loadings have the largest probabilistic “signal” for the underlying factors. Second, the hard-thresholding procedure that sets most loadings to zero to get the sparse factor weights has the down-side that we lose some of the large sample properties in the cross-section dimension. We need to take into account the cross-section dependency structure in the errors among a few cross-section units, which directly affects the noise level in calculating the generalized correlation. Third, the sparse factor weights, as well as proximate factors, are in general not orthogonal to one another in contrast to their non-sparse estimated version. A proximate factor for a specific population factor can also be correlated with other latent factors, which is reflected in the generalized correlation. Our assumptions and results need to take this complex inter-factor correlations into account.

In two empirical applications, we apply our method to a large number of financial characteristic-sorted portfolios and a large-dimensional macroeconomic dataset. We find that in both datasets, proximate factors with around 5-10% of the cross-sectional units can very well approximate the non-sparse PCA factors with average correlations of around 97.5%. The proximate factors explain almost the same amount of variation as the non-sparse PCA factors. The sparse factors have an economic meaningful interpretation which would be hard to obtain from the non-sparse representation.

Sparse loadings have already been employed to reduce the composition of factors and to make latent factors more interpretable. Most work formulates the estimation of sparse loadings as a regularized optimization problem to estimate principal components with a soft thresholding term, such as an penalty term or an elastic net penalty term (Zou et al., 2006; Mairal et al., 2010; Bai and Ng, 2008; Bai and Liao, 2016).999(Choi et al., 2010; Lan et al., 2014; Kawano et al., 2015) estimate the sparse loadings by minimizing the sum of the negative log-likelihood of the data with a soft thresholding term. An alternative approach is to take a Bayesian perspective and specify sparse priors for factor loadings and use Bayesian updating to obtain posteriors for sparse loadings (Lucas et al., 2006; Bhattacharya and Dunson, 2011; Kaufmann and Schumacher, 2013; Pati et al., 2014). All these approaches assume that loadings are sparse in the population model, which allows these approaches to develop an asymptotic inferential theory. Nevertheless, the assumption of sparse population loadings may not be satisfied in many datasets. For example, the exposure to a market factor is universal and non-sparse in equity data. It is important to understand that this line of work typically does not distinguish between factor weights and loadings, i.e. it uses the shrinked loadings for factors weights and exposure. In contrast we only estimate sparse factor weights, but non-sparse loadings. Additionally, we do not assume that the population factor weights are sparse. Of course, it is possible to adjust the sparse PCA method to make this important distinction between factor weights and loadings, i.e. loadings are obtained from a second stage regression. However, in contrast to our approach there is no statistical theory explaining and justifying this approach in the same general setup that we are using. Furthermore, lasso-type estimators have the well-known shortcoming that they create biased estimates by the way how they are shrinking large elements and a similar non-optimal shrinking happens in our case. It turns out that even the modified sparse PCA performs worse than our method in simulations and empirical applications.

Another method to increase the understandability and interpretability of factors is to associate latent factors or factor loadings with observed variables. Some latent factors can be approximated well by observed economic factors, such as Fama-French factors101010Our proximate factors also provide a justification for the construction of Fama-French style factors. Lettau and Pelger (2018b)

among others show that PCA-type factors explain well double-sorted portfolio data. The largest loadings (factor weights) for these PCA factors are exactly the extreme quantiles and our proximate factors essentially coincide with the long-short Fama-French-type factors.

for equity data (Fama and French, 1992) or level, slope, and curvature factors for bond data (Diebold and Li, 2006). Fan et al. (2016a) propose robust factor models to exploit the explanatory power of observed proxies on latent factors. Another approach is to model how the factor loadings relate to observable variables. Connor and Linton (2007), Connor et al. (2012), and Fan et al. (2016b) at least partially employ subject-specific covariates to explain factor loadings, such as market capitalization, price-earning ratios, and other firm characteristics. However, in order to explain latent factors by observed variables, it is necessary to include all the relevant variables, some of which might not be known. Our sparse proximate factors can provide discipline on which assets and covariates to focus on.

The rest of the paper is structured as follows. Section 2 introduces the model and the estimator. Section 3 shows the consistency result for the estimated loadings and the asymptotic probabilistic lower bound for the generalized correlation. Section 4 presents simulation results. In section 5 we apply our approach to financial and macroeconomic datasets. Section 6 concludes the paper. All proofs and additional empirical results are delegated to the Appendix.

2 Model Setup

2.1 Estimator

We assume that a large dimensional panel data set with time-series and cross-sectional observations has a factor structure. There are common factors in and both and are large:

(1)

where , , and are common factors, factor loadings, and idiosyncratic components. Factors and loadings are unobserved.111111We assume that we have consistently estimated the number of factors . A -factor model from the panel data can be estimated by Principal Component Analysis (PCA). The PCA factors and loadings

minimize a quadratic loss function

(2)

where is the observation of the -th cross-section unit at time , is the factor value at time , and is the factor loading of the -th cross-section unit. Factors and loadings are identified up to an invertible transformation, as for any invertible , and also minimize the objective function (2). Under the standard identification assumption that

is an identity matrix and

is a diagonal matrix, the solution to equation (2) is

(3)
(4)

where

are the eigenvectors of the

-largest eigenvalues of

multiplied by . is a diagonal matrix with the largest eigenvalues in descending order, and are the coefficients from regressing on . Note, that serves two purposes here: One the one hand it is the cross-sectional weight to construct the estimated factors and other hand it is the estimated exposure to the factors. Bai (2003) shows that in an approximate factor model and are consistent estimators of and up to an invertible transformation.

The general approximate factor model framework assumes that the loadings and thus their consistent estimator are not sparse, which is a necessary assumption to allow for cross-sectional correlation in the idiosyncratic component. As the estimated factors are linear combinations of the cross-section units weighted by a non-sparse , they are composed of almost all cross-section units, which are hard to interpret.

We propose a method to estimate proximate factors that are sparse and hence more interpretable. The method is based on the following steps:

  1. Sparse factor weights: are the standard PCA estimates of the loadings. The proximate factor weights are the largest elements of obtained as follows: We shrink the smallest entries in absolute values in each estimated loading vector to 0 and only keep the largest elements to get the sparse weight vector for each factor . We standardize each sparse weight vector to have length one. 121212Formally, denote as a mask matrix indicating which factor weights are set to zero. has 1s and 0s. The sparse factor weights can be written as

    (5)
    where is -th estimated loading. The vector has the element 1 at the position of the largest loadings of and zero otherwise. denotes the Hadamard product for element by element multiplication of matrices.

  2. Proximate factors: We regress on to obtain the proximate factors :

    (6)
  3. Loadings of proximate factors: We regress on to obtain the loadings of the proximate factors:

    (7)

Proximate factors approximate latent factors

well as measured by the generalized correlation. The generalized correlation equals the correlation between appropriately rotated proximate and population factors as defined in the next section. It is natural to measure the distance between two factors by their correlation. If two factors are perfectly correlated they explain the same variation in the data and provide the same results in linear regressions.

131313

Obviously perfectly correlated factors do not necessarily need to have the same mean. However, in our empirical study the first and second moment properties of the proximate factor coincide with the non-sparse PCA factors.

We illustrate the intuition in a one-factor model. In this case, the generalized correlation is equal to the squared correlation between and . For simplicity, assume that the factors and idiosyncratic component are i.i.d. over time:

Furthermore, our proximate factor consists of only one cross-sectional observation, i.e. . Without loss of generality, the nonzero entry in is which is normalized to . The squared correlation between and equals

where denotes the signal-to-noise ratio. The second equation follows from . The correlation increases with the size of and the signal-to-noise ratio . If the largest population loading is sufficiently large, the correlation will be close to one. In the rest of the paper, we formalize this idea under a general setup.

2.2 Assumptions

We impose several assumptions that are close to, but slightly stronger than, those in Bai and Ng (2002). In order to show is “close” to measured by the generalized correlation, needs to be a uniform consistent estimator for up to some invertible transformation. Furthermore, the largest elements in have to almost coincide with the largest elements in , which requires uniform consistency. The following assumptions are necessary for all theorems in this paper. We assume that there exists a positive constant that can be used in all the assumptions.

Assumption 1.

Factors: and for some positive definite matrix .

Assumption 2.

Factor loadings: and for some positive definite matrix . Loadings are independent of factors and errors.

Assumption 3.

Time and Cross-Section Dependence and Heteroskedasticity: Denote , . Then for all and ,

  1. , ;

  2. is stationary, is the covariance matrix of , ;

  3. , for some , ;

  4. .

Assumption 4.

For all , .

These are standard assumptions for the general approximate factor model.141414Assumption 1 about the population factors is the same as Bai and Ng (2002). Assumption 2 allows loadings to be random. Since the loadings are independent of factors and errors, all results in Bai and Ng (2002) hold. Assumption 3.1 imposes moment conditions for errors, which is the same as Assumption C.1 in Bai and Ng (2002). This assumption implies that is bounded. Assumption 3.2 is close to Assumption 3.2 (i) and (ii) in Fan et al. (2013). This assumption restricts the cross-section dependence of errors and is standard in the literature on approximate factor models. Since is symmetric, is equivalent to . implies that . Together with being the same for all from the stationarity of , this assumption implies Assumption C.3 in Bai and Ng (2002). Assumption 3.3 allows for weak time-series dependence of errors, which is slighter stronger than Assumption C.2 in Bai and Ng (2002). Assumption 3.6 is the time average counterpart of Assumption C.5 in Bai and Ng (2002). Assumption 4 implies Assumption D in Bai and Ng (2002). The fourth moment conditions in Assumptions 3.6 and 4, together with Boole’s inequality or the union bound, are used to show the uniform convergence of loadings, without assuming the boundedness of loadings. Since Assumptions 1-4 imply Assumption A-D in Bai and Ng (2002), all results in Bai and Ng (2002) hold.

Since each proximate factor is a linear combination of cross-section units, the estimation error of the proximate factor is determined by the errors of these cross-section units. We use to denote the maximum of the square root of total pairwise squared correlations among cross-section units. Note that if is i.i.d, . If the errors of the cross-section units are perfectly dependent, . In almost all cases, .

3 Theoretical Results

3.1 Consistency

Bai and Ng (2002) and Bai (2003) show that factors and loadings can be estimated consistently with PCA under Assumptions 1-4 when . More precisely, there exists an invertible , such that and . However, in general, consistency does not hold for proximate factors for the following reason:

and where -th are the nonzero element in , i.e. the sums are only taken over the non-zero weight entries. Even in the special case when the idiosyncratic components at time

are i.i.d with variance

, the variance of the -th entry in is , which does not vanish as

. The average of the idiosyncratic component over the non-sparse loadings leads to a law of large number that diversifies away the idiosyncratic components. For proximate factors this average is taken over a finite number of observations, resulting in the loss of diversification.

Although proximate factors are in general inconsistent, they can still be very close to the population factors. We will study the correlation respectively generalized correlation between the population and proximate factors. Our results depend on the following theorem that shows the uniform consistency of the estimated loadings.

Theorem 1.

Under Assumptions 1-4,

Theorem 1 states that the maximum difference between the estimated loading and some rotation of the true loading for any cross-section unit converges to 0 at a specific rate. Compared to , the uniform convergence rate of is slower if .151515Bai (2003) showed in Proposition 2 that . Fan et al. (2013) showed in Theorem 4 that under boundedness assumption of loadings and . Our results relax the boundedness assumption of loadings and show the counterpart uniform consistency rate of loadings as that of factors in Fan et al. (2013), but is slower than that of loadings in Fan et al. (2013) as expected. Theorem 1 states that a large implies a large . As a result, the -th largest values of is close to the -th largest , which is formally stated in Lemma 3 and proven in the Appendix. Hence, we can derive the distribution of the correlation based on the largest population instead of estimated loadings.

3.2 Loadings of Proximate Factors

The estimated loadings of the proximate factors asymptotically span the same space as the population loadings and hence will lead to the same regression or projection results. Hence, up to an invertible transformation the loadings of the proximate factors are consistent.161616Note, that our notion of consistency does not imply point-wise consistency, i.e. for a finite number of cross-section units the loadings can be distinct. This result actually does not depend on the degree of sparsity . The key element is that the loadings are obtained in a second stage regression that diversifies aways idiosyncratic noise. Hence, it is important to distinguish between loadings and factor weights as sparse loadings are in general not consistent estimator of the population loadings.

One of the major problems when comparing two different sets of loadings is that a factor model is only identified up to invertible linear transformations. Two sets of loadings represent the same model if the loadings span the same vector space. As proposed by Bai and Ng (2006) the generalized correlation is a natural candidate measure to describe how close two vector spaces are to each other. Intuitively, we calculate the correlation between the loadings of the proximate and the population factors after rotating them appropriately. The generalized correlation measures of how many loading vectors two sets have in common. The generalized correlation between the loadings of proximate and population factors is defined as

Here, the generalized correlation , ranging from 0 to (number of factors), measures how close and are. If lies in the space spanned by , then . Otherwise, if the space spanned by is orthogonal to the space spanned by , then .171717Our generalized correlation measure is the sum of the squared individual generalized correlations. The first individual generalized correlation is the highest correlation that can be achieved through a linear combination of the proximate loadings and the populations loadings . For the second generalized correlation we first project out the subspace that spans the linear combination for the first generalized correlation and then determine the highest possible correlation that can be achieved through linear combinations of the remaining dimensional subspaces. This procedure continues until we have calculated the individual generalized correlations. Mathematically the individual generalized correlations are the square root of the eigenvalues of the matrix .

We need to impose one additional weak assumption on the residuals:

Assumption 5.

The largest eigenvalue of is .

Assumption 5 is very weak and essentially satisfied in any sensible approximate factor model. It is standard and has been imposed in many related papers, e.g Fan et al. (2013). It is slightly stronger than Assumption 3.3, that implies that the largest eigenvalue of the population residual auto-covariance matrix is bounded. Under suitable additional assumptions on the tail behavior of the residuals, Assumption 3.3 implies that is which would be sufficient. We denote by the matrix of the largest factor weights based on the population loadings, i.e. it follows the same definition as but applied to the population loadings .

Theorem 2.

Under Assumptions 1-4 and 5 and if , then it holds that

(8)

The assumption that is essentially a full rank assumption on . It requires that the sparse set of cross-section units, that we use to construct the proximate factors, is affected by all factors in a non-redundant way. In the case of only one factor, i.e. , it is trivially satisfied. In the case of multiple factors, we need to rule out that the largest elements of two loading vectors are identical. This is an assumption on the tail dependency of the loadings. If for example the loading vectors are independent and then this condition is satisfied.

Our notation of consistency does not imply point-wise consistency of the loadings, i.e. for a finite number of loading elements can be different from where is an invertible matrix. Our notation of consistency measures the asymptotic difference between vectors whose length goes to infinity and hence a finite number of elements have a negligible effect. The generalized correlation measure is the appropriate measure if we intend to use loadings for projections or in a cross-sectional regression. The strong result in Theorem 2

states that cross-sectional regressions with loadings of proxy factors yield the same results as using the population loadings up to an invertible matrix

. Note, that this theorem has broader implications that go beyond proximate factors. For example it justifies why an iterative procedure to estimate latent factors leads to a consistent estimator after a few iterations.181818Instead of applying PCA, latent factors can also be estimated by an iterative procedure where for a set of candidate factors a first stage set of loadings is estimated with a time-series regression, which is then used in a second stage to obtain factors in a cross-sectional regression. This procedure is iterated until convergence. For example Bai and Ng (2017) use a variation of this approach. Next, we will show that the proximate factors themselves are also very close to the population factors.

3.3 One-Factor Case

We start with the one-factor model and derive two characterizations for the correlation between the population and the proximate factor. The first characterization is based on a counting statistic, while the second one uses extreme value theory. In both cases, we derive analytical solutions for the lower bound. We use the results for comparative statics and preparing for the more general case.

If the population loadings are i.i.d., we have a closed-form lower bound for the asymptotic exceedance probability of the squared correlation .

Proposition 1.

Assume Assumptions 1-4 hold and that for all , the true loadings are i.i.d. with being continuous for all . Then for a given threshold and a fixed , we have

(9)

where .

Proposition 2.

In Proposition 1, if , then as ,

Proposition 2

states that if the loadings have unbounded support the correlation between the proximate and population factor converges to one. At a first glance it seems to be at odds with our previous observations that the idiosyncratic component in a proximate factor cannot be diversified away. The intuition behind this consistency result is based on the growing signal-to-noise ratio. If loadings are sampled with an unbounded support independently of the idiosyncratic component, then for growing

the largest loadings are unbounded and their signal-to-noise ratio explodes. Hence with high probability the largest loadings do not coincide with large idiosyncratic movements and the variation of these cross-section units is essentially only explained by the factor. Hence, selecting the cross-section unit with the largest loading is close to picking the factor itself.

Note that after proper rescaling loadings with unbounded support can also be interpreted as approximately sparse population loadings. Hence, a sparse estimator is consistent if the true population model is sparse itself. The important contribution of this paper is that we can also characterize the asymptotic properties of the sparse estimator if the population model is not sparse.

Proposition 1 allows us to derive comparitive statics. Denote the right-hand side of inequality (9) as and . We take the partial derivate of with respect to ,

is increasing in , so is decreasing in . If , it implies that the support of is smaller than the threshold with probability 1, therefore . If , it implies that as , will have at least entries greater than , as stated in Proposition 2. Thus, we have:

  1. The larger , the larger and the smaller the exceedance probability ;

  2. The larger the signal-to-noise ratio , the smaller and the larger ;

  3. The more dispersed the distribution of , the larger ;

  4. The larger the cross-section dependence of errors , the smaller ;

  5. The number of nonzero elements affects in two ways: First, in most cases, decreases with , which raises . Second, larger results in more subtraction terms in with the opposite effect leading to a trade-off.

Another perspective providing a lower bound for employs EVT, which relaxes the i.i.d. assumption of . The lower bound depends only on the distribution of the extreme values of loadings which can be modeled by extreme value theory under general conditions (e.g. Leadbetter and Nandagopalan, 1989, Hsing, 1988).

We denote the largest by . Under the assumptions of EVT there exist sequences of constants , such that , where , , .

are parameters of the GEV distribution to characterize the tail distribution of i.i.d. random variables with the same marginal distributions as

and is an extremal index measuring the auto-dependence of in the tails. The results for the th-largest loading (extreme order statistic) in the case of dependent loadings is more complex. In Lemma 1 in the Appendix we provide the limiting distribution for the extreme order statistic of a strictly stationary sequence satisfying the strong mixing condition.191919 is indexed by the cross-section units. We assume that are exchangeable and can be properly reshuffled to satisfy the strong mixing condition. Lemma 1 is adapted from Theorem 3.3 in Hsing (1988). It provides the necessary and sufficient condition such that there exists a sequence and function with .

Theorem 3.

Suppose Assumptions 1-4 holds and in addition the assumptions in Lemma 1 are satisfied s.t. for some sequence and function . For a given we have

(10)

Theorem 3 provides a threshold such that the asymptotic probability of exceeding this threshold is larger than . This probability lower bound is characterized by the tail distribution of and the dependence structure in . A special case is for which . If the tail distribution of follows an extreme value distribution with parameters and a nonzero extremal index ,202020The extremal index can be interpreted as the reciprocal of limiting mean cluster size. Smith and Weissman (1994), Ancona-Navarrete and Tawn (2000) and others have studied methods to estimate . A common method is the blocks method, which divides the data with observations into approximate blocks of length , where . The estimated is the ratio of number of blocks in which there is at least one exceedance to the total number of exceedances in all observations. then there exist sequences and such that with and 212121See Theorem 5.2 in Coles et al. (2001). Then, we have or equivalently .

Another special case is being independent, which yields . We immediately obtain the following corollary for independent loadings:

Corollary 1.

Assume Assumptions 1-4 holds, the loadings are i.i.d. and the largest absolute loading element follows an extreme value distribution for some sequences of constants and and with . Then, the -th largest absolute loading element satisfies . For fixed it holds

(11)

The sequences and determine to which of the Gumbel, Frechet and Weilbull families the tail distribution of belongs. Here are some examples:

  1. Gumbel distribution:

    1. Exponential distribution (): .

    2. Standard normal distribution:

      and , where are the pdf and cdf of a standard normal variable.

  2. Frechet distribution (): .

  3. Weibull distribution (): .

We denote the threshold in Corollary 1 by . Given and , the more dependent the , the smaller the extremal index and the smaller . Thus, given a threshold , the probabilistic lower bound decreases with the dependence level of . Moreover, increases with and decreases with , which implies tends to be larger with a larger signal-to-noise ratio . It is straightforward to verify that is non-decreasing with if and are among the previously listed examples of extreme value distributions. These findings are aligned with those implied by Proposition 1.

3.4 Multi-Factor Case

The arguments of the one-factor model extend to a model with multiple factors. As our simulations and empirical results illustrate, the simple thresholding method provides proximate factors that explain very well the non-sparse PCA factors. However, formalizing the properties of the lower bound are more challenging. First, we have to work with the generalized correlation instead of simple correlation between the proximate and population factors. Second, the sparse factor weight vectors are in general not orthogonal to each other in contrast to the PCA-loadings. In order to derive sharp theoretical bounds, we will impose additional assumptions. However, these assumptions are only necessary for deriving analytical asymptotic results, but not for our estimator to work as verified with simulated and empirical data.

One of the major problems when comparing two different sets of factors is that a factor model is only identified up to invertible linear transformations. We will again use the generalized correlation to measure how many factors two sets have in common. The generalized correlation between proximate factors and population factors is defined as

Here, the generalized correlation , ranging from 0 to (number of factors), measures how close and are. If lies in the space spanned by , then .222222The individual generalized correlations are the square root of the eigenvalues of the matrix .

We study two cases: First, the sparse weight vectors are orthogonal to each other, which allows us to directly extend the one-factor results to the multi-factor case; Second, we first find an appropriate rotation of the estimated loadings before thresholding them. We only assume that the sparse rotated factor weights are orthogonal, which is weaker. In our empirical examples we observe that several proximate factors are composed of the same small number of cross-section units but with different weights. In this case it is possible to find a rotation of the factors such that the proximate factors are composed of a disjoint set of cross-sectional units.

For the first case we assume that the sparse factor weights are “non-overlapping”. Formally, we define “non-overlapping” as232323 denotes an indicator function and is one if the condition is satisfied and zero otherwise.

which means that at most one factor weight is nonzero for every cross-section unit in the sparse factor weights. Then, the results from the one-factor model directly generalize to the multi-factor case. In this case the sparse factor weights are orthonormal, i.e. , similar to the non-sparse . The generalized correlation equals the sum of squared correlations between each proximate factor and rotated true factor.

We need to impose additional assumptions on the population model to obtain the non-overlapping sparse factor weights.

Assumption 6.

For the rotated population loadings and a given finite , we denote the -dimensional vector of elements in with largest absolute values as . We assume that the cumulative density function of is continuous and that and are asymptotically independent for . Furthermore, for each loading vector , the entry with the -th largest absolute value, , satisfies the assumptions in Lemma 1 yielding

(12)

where is defined analogously to (20) in Lemma 1.

Under Assumption 6, the largest rotated population loadings are asymptotically “non-overlapping.” Then Theorem 1 implies that also the sparse factor weights are “non-overlapping” with high probability. Furthermore, Assumption 6

implies a joint distribution for the largest elements in

:

as the extreme values for different columns of are independently distributed.

An additional complication in the multi-factor case is the relationship between the sparse and non-sparse eigenvectors. We show that an asymptotic lower bound for is . The complication in the multi-factor case arises from the fact that is in general not a diagonal matrix. In order to illustrate this point we will consider the simple example of and , i.e. a two factor model where the proximate factors only take the largest loading values. W.l.o.g. we assume the first element is the largest element for the first vector and the second element is the largest element for . Then, we have

Normalizing this matrix by the largest diagonal elements we obtain

The smallest singular value

of the matrix is a measure for how large the off-diagonal elements are. In the special case of a diagonal matrix , the smallest singular value equals 1. Our lower bound will depend on a probabilistic bound for , which depends on the distribution of the loading vectors. For example in the special case of i.i.d normally distributed loading elements a random element of the loading vector is , while the largest element is unbounded in the limit. Hence, the ratio is . For this special case the multi-factor case is a direct extension of the one-factor model, i.e. we can apply the one-factor result to each factor in the multi-factor model individually. Another special case is when the population loadings have sparse structures themselves. In both special cases we have . However, in general we need to take the effect of into account.

Under Assumption 6, we state the corresponding multi-factor version to Theorem 3:

Theorem 4.

Under Assumptions 1-6 the asymptotic lower bound equals

(13)