1 Introduction
“Big data” is becoming increasingly popular to study problems in economics and finance. Largedimensional datasets contain rich information and open the possibilities of new findings and new efficacious models, which is feasible only if we better understand the datasets. Factor modeling (Bai and Ng, 2002; Bai, 2003; Fan et al., 2013) is a method that summarizes information in largedimensional panel data and is an active area of research. In a largedimensional factor model, both the time dimension and crosssection dimension of the datasets are large and most of the comovement can be explained by a few factors. Latent factors estimated from the data are particularly appealing as the underlying factor structure is usually not known. These factors are usually estimated by Principal Component Analysis (PCA). Latent PCA factors have been successfully used in economics and finance, for example for prediction and forecasting (Stock and Watson, 2002a, b), asset pricing of many assets (Lettau and Pelger, 2018b; Kelly et al., 2018), highfrequency asset return modeling (AïtSahalia and Xiu, 2018; Pelger, 2019), conditional riskreturn and term structure analysis (Ludvigson and Ng, 2007, 2009) and optimal portfolio construction (Fan et al., 2013).^{3}^{3}3Of course,the application of latent factor models goes beyond the applications listed above, e.g. inferring missing values in matrices (Candès and Tao, 2010; Candès et al., 2011). However, as the latent PCA factors are linear combinations of all crosssectional units, they are usually hard to interpret. This poses a challenge for modeling and understanding the underlying structure to explain the comovement in the data. This becomes particularly relevant for the applications in economics and finance where the objective is to understand the underlying economic mechanism.
Practitioners and academics alike have used an intuitive approach to interpret latent statistical factors by focusing on the largest factor weights. A pattern in the largest factor weights suggests an economic interpretation (e.g. Lettau and Pelger (2018b); Pelger (2019)). In this paper we formalize this idea and show that the factors that are only based on the largest factor weights, provide already an excellent approximation to the population factors. This step further reduces the dimensionality of the problem helping to better understand the economic mechanism in the problem.
We propose easytointerpret proximate factors for latent factors. We exploit the insight that crosssection units with larger factor weights have a larger signaltonoise ratio, hence providing more information about the underlying factors. Our method consists of four simple steps. First, we estimate the underlying factor structure with the conventional Principal Component Analysis (PCA), which returns the weights to construct the latent PCA factors. Second, we set all factor weights to zero except for the largest absolute ones. Third, the proximate factors are obtained from a simple regression on the thresholded factor weights. Finally, the loadings are obtained from a regression on the proximate factors. We work under a general scenario where the factor weights and loadings in the true model are not sparse.
We show that one needs to make a clear distinction between factor weights and loadings. In a conventional PCA analysis the loadings serve two purposes: First, they measure the exposure of the crosssectional data to the factors; Second, they correspond to weights to construct the latent PCA factors. The same view is for example taken in a sparse PCA setup where loadings and hence also the factor weights are shrinked to a sparse matrix. We show that in an approximate factor model with nonsparse factor weights and loadings, we can construct proximate factors with sparse factor weights that are very close to the population factors. These proximate factors have nonsparse loadings which are consistent estimates of the true population loadings. The proximate factors are considerably easier to interpret as they are only based on a small fraction of the data, while enjoying the same properties as the nonsparse factors.^{4}^{4}4A common method to interpret lowdimensional factor models is to find a rotation of the common factors with a meaningful interpretation. This approach uses the insight that factor models are only identified up to an invertible transformation and represent the same model after an appropriate rotation. The criterion proposed by Kaiser (1958) is a popular way to select factors whose factor weights have groups of large and negligible coefficients. However, in largedimensional factor models with nonsparse factor weight structure, finding a “good” rotation becomes considerably more challenging. It is generally easier with our sparse proximate factors to find a rotation that has a meaningful interpretation.
We develop the statistical arguments that explain why the sparse proximate factors can be used as substitutes for the nonsparse PCA factors. The conventional derivations used in largedimensional factor modeling to prove consistency of estimated factors do not apply to our proximate factors. It turns out that the proximate factors are in general a biased estimate of the true population factors. We can control this bias with Extreme Value Theory (EVT) and show how to construct the proximate factors such that this bias becomes negligible. The closeness between proximate factors and true factors is measured by the generalized correlation.^{5}^{5}5
Generalized correlation (also called canonical correlation) measures how close two vector spaces are. It has been studied by
Anderson (1958) and applied in largedimensional factor models (Bai and Ng, 2006; Pelger, 2019; Andreou et al., 2017; Pelger and Xiong, 2018). We provide an asymptotic probabilistic lower bound for the generalized correlation based on EVT.^{6}^{6}6In simulations, we verify that the lower bound has good finite sample properties. The lower bound is easy to calculate and depends on the tail distribution of population factor weights and the number of nonzero elements in the sparse factor weights. The lower bound provides guidance on constructing the proximate factors that are guaranteed to have a high correlation with the true factors. Moreover, when factor weights have unbounded support, we show that proximate factors asymptotically span the same space as latent factors. Importantly, the estimated loadings of the proxy factors converge to the true population loadings up to the usual rotation. This surprising result is due to the two stage procedure for estimating the loadings that averages out the idiosyncratic noise. Hence, regressions based on the more interpretable proxy factors will asymptotically yield the same results as using the harder to interpret PCA factors.Our results are of practical and theoretical importance. First, we provide a simple and easytoimplement method to approximate latent factors by a small number of observations. These sparse proximate factors usually provide a much simpler economic interpretation of the model, either by directly analyzing the sparse composition or after rotating them appropriately. Second, we show in empirical and simulation studies that this approximation works surprisingly well and almost no information is lost by working with the sparse factors. Third, our asymptotic bounds for the correlations between the proximate factors and the population factors provides the theoretical reasoning why our method works so well. In particular, it clarifies under which assumptions the proximate factors are a good approximation or even converge to the population factors. As mentioned before, the idea of analyzing the largest factor weights is not new, but we are the first to provide the theoretical arguments why and when it is a reasonable approach. Fourth, the asymptotic bounds provide guidance on how to select the key tuning parameter for our estimator, i.e. the number of nonzero elements.^{7}^{7}7Obviously, the degree of sparsity can be chosen to obtain a sufficiently high generalized correlation with the estimated PCA factors. However, this would be an insample choice of the tuning parameter with no guarantee for its outofsample performance. Our theoretical bound provides an alternative to select the tuning parameter based on arguments that should also hold outofsample.
We need to overcome three major challenges, when showing the probabilistic lower bound for the generalized correlation. First, we need to show that the estimated loadings converge uniformly to some rotation of the true loadings, under the general approximate factor model assumptions^{8}^{8}8We impose assumptions similar to Bai and Ng (2002). In contrast to Fan et al. (2013), we do not assume loadings to be fixed and bounded which would restrict the distributions from which the loadings can be sampled.. Uniform consistency of the estimated loadings is necessary for our argument that crosssection units with the largest estimated loadings have the largest probabilistic “signal” for the underlying factors. Second, the hardthresholding procedure that sets most loadings to zero to get the sparse factor weights has the downside that we lose some of the large sample properties in the crosssection dimension. We need to take into account the crosssection dependency structure in the errors among a few crosssection units, which directly affects the noise level in calculating the generalized correlation. Third, the sparse factor weights, as well as proximate factors, are in general not orthogonal to one another in contrast to their nonsparse estimated version. A proximate factor for a specific population factor can also be correlated with other latent factors, which is reflected in the generalized correlation. Our assumptions and results need to take this complex interfactor correlations into account.
In two empirical applications, we apply our method to a large number of financial characteristicsorted portfolios and a largedimensional macroeconomic dataset. We find that in both datasets, proximate factors with around 510% of the crosssectional units can very well approximate the nonsparse PCA factors with average correlations of around 97.5%. The proximate factors explain almost the same amount of variation as the nonsparse PCA factors. The sparse factors have an economic meaningful interpretation which would be hard to obtain from the nonsparse representation.
Sparse loadings have already been employed to reduce the composition of factors and to make latent factors more interpretable. Most work formulates the estimation of sparse loadings as a regularized optimization problem to estimate principal components with a soft thresholding term, such as an penalty term or an elastic net penalty term (Zou et al., 2006; Mairal et al., 2010; Bai and Ng, 2008; Bai and Liao, 2016).^{9}^{9}9(Choi et al., 2010; Lan et al., 2014; Kawano et al., 2015) estimate the sparse loadings by minimizing the sum of the negative loglikelihood of the data with a soft thresholding term. An alternative approach is to take a Bayesian perspective and specify sparse priors for factor loadings and use Bayesian updating to obtain posteriors for sparse loadings (Lucas et al., 2006; Bhattacharya and Dunson, 2011; Kaufmann and Schumacher, 2013; Pati et al., 2014). All these approaches assume that loadings are sparse in the population model, which allows these approaches to develop an asymptotic inferential theory. Nevertheless, the assumption of sparse population loadings may not be satisfied in many datasets. For example, the exposure to a market factor is universal and nonsparse in equity data. It is important to understand that this line of work typically does not distinguish between factor weights and loadings, i.e. it uses the shrinked loadings for factors weights and exposure. In contrast we only estimate sparse factor weights, but nonsparse loadings. Additionally, we do not assume that the population factor weights are sparse. Of course, it is possible to adjust the sparse PCA method to make this important distinction between factor weights and loadings, i.e. loadings are obtained from a second stage regression. However, in contrast to our approach there is no statistical theory explaining and justifying this approach in the same general setup that we are using. Furthermore, lassotype estimators have the wellknown shortcoming that they create biased estimates by the way how they are shrinking large elements and a similar nonoptimal shrinking happens in our case. It turns out that even the modified sparse PCA performs worse than our method in simulations and empirical applications.
Another method to increase the understandability and interpretability of factors is to associate latent factors or factor loadings with observed variables. Some latent factors can be approximated well by observed economic factors, such as FamaFrench factors^{10}^{10}10Our proximate factors also provide a justification for the construction of FamaFrench style factors. Lettau and Pelger (2018b)
among others show that PCAtype factors explain well doublesorted portfolio data. The largest loadings (factor weights) for these PCA factors are exactly the extreme quantiles and our proximate factors essentially coincide with the longshort FamaFrenchtype factors.
for equity data (Fama and French, 1992) or level, slope, and curvature factors for bond data (Diebold and Li, 2006). Fan et al. (2016a) propose robust factor models to exploit the explanatory power of observed proxies on latent factors. Another approach is to model how the factor loadings relate to observable variables. Connor and Linton (2007), Connor et al. (2012), and Fan et al. (2016b) at least partially employ subjectspecific covariates to explain factor loadings, such as market capitalization, priceearning ratios, and other firm characteristics. However, in order to explain latent factors by observed variables, it is necessary to include all the relevant variables, some of which might not be known. Our sparse proximate factors can provide discipline on which assets and covariates to focus on.The rest of the paper is structured as follows. Section 2 introduces the model and the estimator. Section 3 shows the consistency result for the estimated loadings and the asymptotic probabilistic lower bound for the generalized correlation. Section 4 presents simulation results. In section 5 we apply our approach to financial and macroeconomic datasets. Section 6 concludes the paper. All proofs and additional empirical results are delegated to the Appendix.
2 Model Setup
2.1 Estimator
We assume that a large dimensional panel data set with timeseries and crosssectional observations has a factor structure. There are common factors in and both and are large:
(1) 
where , , and are common factors, factor loadings, and idiosyncratic components. Factors and loadings are unobserved.^{11}^{11}11We assume that we have consistently estimated the number of factors . A factor model from the panel data can be estimated by Principal Component Analysis (PCA). The PCA factors and loadings
minimize a quadratic loss function
(2) 
where is the observation of the th crosssection unit at time , is the factor value at time , and is the factor loading of the th crosssection unit. Factors and loadings are identified up to an invertible transformation, as for any invertible , and also minimize the objective function (2). Under the standard identification assumption that
is an identity matrix and
is a diagonal matrix, the solution to equation (2) is(3)  
(4) 
where
are the eigenvectors of the
largest eigenvalues of
multiplied by . is a diagonal matrix with the largest eigenvalues in descending order, and are the coefficients from regressing on . Note, that serves two purposes here: One the one hand it is the crosssectional weight to construct the estimated factors and other hand it is the estimated exposure to the factors. Bai (2003) shows that in an approximate factor model and are consistent estimators of and up to an invertible transformation.The general approximate factor model framework assumes that the loadings and thus their consistent estimator are not sparse, which is a necessary assumption to allow for crosssectional correlation in the idiosyncratic component. As the estimated factors are linear combinations of the crosssection units weighted by a nonsparse , they are composed of almost all crosssection units, which are hard to interpret.
We propose a method to estimate proximate factors that are sparse and hence more interpretable. The method is based on the following steps:

Sparse factor weights: are the standard PCA estimates of the loadings. The proximate factor weights are the largest elements of obtained as follows: We shrink the smallest entries in absolute values in each estimated loading vector to 0 and only keep the largest elements to get the sparse weight vector for each factor . We standardize each sparse weight vector to have length one. ^{12}^{12}12Formally, denote as a mask matrix indicating which factor weights are set to zero. has 1s and 0s. The sparse factor weights can be written as
(5)where is th estimated loading. The vector has the element 1 at the position of the largest loadings of and zero otherwise. denotes the Hadamard product for element by element multiplication of matrices. 
Proximate factors: We regress on to obtain the proximate factors :
(6) 
Loadings of proximate factors: We regress on to obtain the loadings of the proximate factors:
(7)
Proximate factors approximate latent factors
well as measured by the generalized correlation. The generalized correlation equals the correlation between appropriately rotated proximate and population factors as defined in the next section. It is natural to measure the distance between two factors by their correlation. If two factors are perfectly correlated they explain the same variation in the data and provide the same results in linear regressions.
^{13}^{13}13Obviously perfectly correlated factors do not necessarily need to have the same mean. However, in our empirical study the first and second moment properties of the proximate factor coincide with the nonsparse PCA factors.
We illustrate the intuition in a onefactor model. In this case, the generalized correlation is equal to the squared correlation between and . For simplicity, assume that the factors and idiosyncratic component are i.i.d. over time:
Furthermore, our proximate factor consists of only one crosssectional observation, i.e. . Without loss of generality, the nonzero entry in is which is normalized to . The squared correlation between and equals
where denotes the signaltonoise ratio. The second equation follows from . The correlation increases with the size of and the signaltonoise ratio . If the largest population loading is sufficiently large, the correlation will be close to one. In the rest of the paper, we formalize this idea under a general setup.
2.2 Assumptions
We impose several assumptions that are close to, but slightly stronger than, those in Bai and Ng (2002). In order to show is “close” to measured by the generalized correlation, needs to be a uniform consistent estimator for up to some invertible transformation. Furthermore, the largest elements in have to almost coincide with the largest elements in , which requires uniform consistency. The following assumptions are necessary for all theorems in this paper. We assume that there exists a positive constant that can be used in all the assumptions.
Assumption 1.
Factors: and for some positive definite matrix .
Assumption 2.
Factor loadings: and for some positive definite matrix . Loadings are independent of factors and errors.
Assumption 3.
Time and CrossSection Dependence and Heteroskedasticity: Denote , . Then for all and ,

, ;

is stationary, is the covariance matrix of , ;

, for some , ;

.
Assumption 4.
For all , .
These are standard assumptions for the general approximate factor model.^{14}^{14}14Assumption 1 about the population factors is the same as Bai and Ng (2002). Assumption 2 allows loadings to be random. Since the loadings are independent of factors and errors, all results in Bai and Ng (2002) hold. Assumption 3.1 imposes moment conditions for errors, which is the same as Assumption C.1 in Bai and Ng (2002). This assumption implies that is bounded. Assumption 3.2 is close to Assumption 3.2 (i) and (ii) in Fan et al. (2013). This assumption restricts the crosssection dependence of errors and is standard in the literature on approximate factor models. Since is symmetric, is equivalent to . implies that . Together with being the same for all from the stationarity of , this assumption implies Assumption C.3 in Bai and Ng (2002). Assumption 3.3 allows for weak timeseries dependence of errors, which is slighter stronger than Assumption C.2 in Bai and Ng (2002). Assumption 3.6 is the time average counterpart of Assumption C.5 in Bai and Ng (2002). Assumption 4 implies Assumption D in Bai and Ng (2002). The fourth moment conditions in Assumptions 3.6 and 4, together with Boole’s inequality or the union bound, are used to show the uniform convergence of loadings, without assuming the boundedness of loadings. Since Assumptions 14 imply Assumption AD in Bai and Ng (2002), all results in Bai and Ng (2002) hold.
Since each proximate factor is a linear combination of crosssection units, the estimation error of the proximate factor is determined by the errors of these crosssection units. We use to denote the maximum of the square root of total pairwise squared correlations among crosssection units. Note that if is i.i.d, . If the errors of the crosssection units are perfectly dependent, . In almost all cases, .
3 Theoretical Results
3.1 Consistency
Bai and Ng (2002) and Bai (2003) show that factors and loadings can be estimated consistently with PCA under Assumptions 14 when . More precisely, there exists an invertible , such that and . However, in general, consistency does not hold for proximate factors for the following reason:
and where th are the nonzero element in , i.e. the sums are only taken over the nonzero weight entries. Even in the special case when the idiosyncratic components at time
are i.i.d with variance
, the variance of the th entry in is , which does not vanish as. The average of the idiosyncratic component over the nonsparse loadings leads to a law of large number that diversifies away the idiosyncratic components. For proximate factors this average is taken over a finite number of observations, resulting in the loss of diversification.
Although proximate factors are in general inconsistent, they can still be very close to the population factors. We will study the correlation respectively generalized correlation between the population and proximate factors. Our results depend on the following theorem that shows the uniform consistency of the estimated loadings.
Theorem 1 states that the maximum difference between the estimated loading and some rotation of the true loading for any crosssection unit converges to 0 at a specific rate. Compared to , the uniform convergence rate of is slower if .^{15}^{15}15Bai (2003) showed in Proposition 2 that . Fan et al. (2013) showed in Theorem 4 that under boundedness assumption of loadings and . Our results relax the boundedness assumption of loadings and show the counterpart uniform consistency rate of loadings as that of factors in Fan et al. (2013), but is slower than that of loadings in Fan et al. (2013) as expected. Theorem 1 states that a large implies a large . As a result, the th largest values of is close to the th largest , which is formally stated in Lemma 3 and proven in the Appendix. Hence, we can derive the distribution of the correlation based on the largest population instead of estimated loadings.
3.2 Loadings of Proximate Factors
The estimated loadings of the proximate factors asymptotically span the same space as the population loadings and hence will lead to the same regression or projection results. Hence, up to an invertible transformation the loadings of the proximate factors are consistent.^{16}^{16}16Note, that our notion of consistency does not imply pointwise consistency, i.e. for a finite number of crosssection units the loadings can be distinct. This result actually does not depend on the degree of sparsity . The key element is that the loadings are obtained in a second stage regression that diversifies aways idiosyncratic noise. Hence, it is important to distinguish between loadings and factor weights as sparse loadings are in general not consistent estimator of the population loadings.
One of the major problems when comparing two different sets of loadings is that a factor model is only identified up to invertible linear transformations. Two sets of loadings represent the same model if the loadings span the same vector space. As proposed by Bai and Ng (2006) the generalized correlation is a natural candidate measure to describe how close two vector spaces are to each other. Intuitively, we calculate the correlation between the loadings of the proximate and the population factors after rotating them appropriately. The generalized correlation measures of how many loading vectors two sets have in common. The generalized correlation between the loadings of proximate and population factors is defined as
Here, the generalized correlation , ranging from 0 to (number of factors), measures how close and are. If lies in the space spanned by , then . Otherwise, if the space spanned by is orthogonal to the space spanned by , then .^{17}^{17}17Our generalized correlation measure is the sum of the squared individual generalized correlations. The first individual generalized correlation is the highest correlation that can be achieved through a linear combination of the proximate loadings and the populations loadings . For the second generalized correlation we first project out the subspace that spans the linear combination for the first generalized correlation and then determine the highest possible correlation that can be achieved through linear combinations of the remaining dimensional subspaces. This procedure continues until we have calculated the individual generalized correlations. Mathematically the individual generalized correlations are the square root of the eigenvalues of the matrix .
We need to impose one additional weak assumption on the residuals:
Assumption 5.
The largest eigenvalue of is .
Assumption 5 is very weak and essentially satisfied in any sensible approximate factor model. It is standard and has been imposed in many related papers, e.g Fan et al. (2013). It is slightly stronger than Assumption 3.3, that implies that the largest eigenvalue of the population residual autocovariance matrix is bounded. Under suitable additional assumptions on the tail behavior of the residuals, Assumption 3.3 implies that is which would be sufficient. We denote by the matrix of the largest factor weights based on the population loadings, i.e. it follows the same definition as but applied to the population loadings .
The assumption that is essentially a full rank assumption on . It requires that the sparse set of crosssection units, that we use to construct the proximate factors, is affected by all factors in a nonredundant way. In the case of only one factor, i.e. , it is trivially satisfied. In the case of multiple factors, we need to rule out that the largest elements of two loading vectors are identical. This is an assumption on the tail dependency of the loadings. If for example the loading vectors are independent and then this condition is satisfied.
Our notation of consistency does not imply pointwise consistency of the loadings, i.e. for a finite number of loading elements can be different from where is an invertible matrix. Our notation of consistency measures the asymptotic difference between vectors whose length goes to infinity and hence a finite number of elements have a negligible effect. The generalized correlation measure is the appropriate measure if we intend to use loadings for projections or in a crosssectional regression. The strong result in Theorem 2
states that crosssectional regressions with loadings of proxy factors yield the same results as using the population loadings up to an invertible matrix
. Note, that this theorem has broader implications that go beyond proximate factors. For example it justifies why an iterative procedure to estimate latent factors leads to a consistent estimator after a few iterations.^{18}^{18}18Instead of applying PCA, latent factors can also be estimated by an iterative procedure where for a set of candidate factors a first stage set of loadings is estimated with a timeseries regression, which is then used in a second stage to obtain factors in a crosssectional regression. This procedure is iterated until convergence. For example Bai and Ng (2017) use a variation of this approach. Next, we will show that the proximate factors themselves are also very close to the population factors.3.3 OneFactor Case
We start with the onefactor model and derive two characterizations for the correlation between the population and the proximate factor. The first characterization is based on a counting statistic, while the second one uses extreme value theory. In both cases, we derive analytical solutions for the lower bound. We use the results for comparative statics and preparing for the more general case.
If the population loadings are i.i.d., we have a closedform lower bound for the asymptotic exceedance probability of the squared correlation .
Proposition 1.
Proposition 2.
In Proposition 1, if , then as ,
Proposition 2
states that if the loadings have unbounded support the correlation between the proximate and population factor converges to one. At a first glance it seems to be at odds with our previous observations that the idiosyncratic component in a proximate factor cannot be diversified away. The intuition behind this consistency result is based on the growing signaltonoise ratio. If loadings are sampled with an unbounded support independently of the idiosyncratic component, then for growing
the largest loadings are unbounded and their signaltonoise ratio explodes. Hence with high probability the largest loadings do not coincide with large idiosyncratic movements and the variation of these crosssection units is essentially only explained by the factor. Hence, selecting the crosssection unit with the largest loading is close to picking the factor itself.Note that after proper rescaling loadings with unbounded support can also be interpreted as approximately sparse population loadings. Hence, a sparse estimator is consistent if the true population model is sparse itself. The important contribution of this paper is that we can also characterize the asymptotic properties of the sparse estimator if the population model is not sparse.
Proposition 1 allows us to derive comparitive statics. Denote the righthand side of inequality (9) as and . We take the partial derivate of with respect to ,
is increasing in , so is decreasing in . If , it implies that the support of is smaller than the threshold with probability 1, therefore . If , it implies that as , will have at least entries greater than , as stated in Proposition 2. Thus, we have:

The larger , the larger and the smaller the exceedance probability ;

The larger the signaltonoise ratio , the smaller and the larger ;

The more dispersed the distribution of , the larger ;

The larger the crosssection dependence of errors , the smaller ;

The number of nonzero elements affects in two ways: First, in most cases, decreases with , which raises . Second, larger results in more subtraction terms in with the opposite effect leading to a tradeoff.
Another perspective providing a lower bound for employs EVT, which relaxes the i.i.d. assumption of . The lower bound depends only on the distribution of the extreme values of loadings which can be modeled by extreme value theory under general conditions (e.g. Leadbetter and Nandagopalan, 1989, Hsing, 1988).
We denote the largest by . Under the assumptions of EVT there exist sequences of constants , such that , where , , .
are parameters of the GEV distribution to characterize the tail distribution of i.i.d. random variables with the same marginal distributions as
and is an extremal index measuring the autodependence of in the tails. The results for the thlargest loading (extreme order statistic) in the case of dependent loadings is more complex. In Lemma 1 in the Appendix we provide the limiting distribution for the extreme order statistic of a strictly stationary sequence satisfying the strong mixing condition.^{19}^{19}19 is indexed by the crosssection units. We assume that are exchangeable and can be properly reshuffled to satisfy the strong mixing condition. Lemma 1 is adapted from Theorem 3.3 in Hsing (1988). It provides the necessary and sufficient condition such that there exists a sequence and function with .Theorem 3.
Theorem 3 provides a threshold such that the asymptotic probability of exceeding this threshold is larger than . This probability lower bound is characterized by the tail distribution of and the dependence structure in . A special case is for which . If the tail distribution of follows an extreme value distribution with parameters and a nonzero extremal index ,^{20}^{20}20The extremal index can be interpreted as the reciprocal of limiting mean cluster size. Smith and Weissman (1994), AnconaNavarrete and Tawn (2000) and others have studied methods to estimate . A common method is the blocks method, which divides the data with observations into approximate blocks of length , where . The estimated is the ratio of number of blocks in which there is at least one exceedance to the total number of exceedances in all observations. then there exist sequences and such that with and ^{21}^{21}21See Theorem 5.2 in Coles et al. (2001). Then, we have or equivalently .
Another special case is being independent, which yields . We immediately obtain the following corollary for independent loadings:
Corollary 1.
The sequences and determine to which of the Gumbel, Frechet and Weilbull families the tail distribution of belongs. Here are some examples:

Gumbel distribution:

Exponential distribution (): .


Frechet distribution (): .

Weibull distribution (): .
We denote the threshold in Corollary 1 by . Given and , the more dependent the , the smaller the extremal index and the smaller . Thus, given a threshold , the probabilistic lower bound decreases with the dependence level of . Moreover, increases with and decreases with , which implies tends to be larger with a larger signaltonoise ratio . It is straightforward to verify that is nondecreasing with if and are among the previously listed examples of extreme value distributions. These findings are aligned with those implied by Proposition 1.
3.4 MultiFactor Case
The arguments of the onefactor model extend to a model with multiple factors. As our simulations and empirical results illustrate, the simple thresholding method provides proximate factors that explain very well the nonsparse PCA factors. However, formalizing the properties of the lower bound are more challenging. First, we have to work with the generalized correlation instead of simple correlation between the proximate and population factors. Second, the sparse factor weight vectors are in general not orthogonal to each other in contrast to the PCAloadings. In order to derive sharp theoretical bounds, we will impose additional assumptions. However, these assumptions are only necessary for deriving analytical asymptotic results, but not for our estimator to work as verified with simulated and empirical data.
One of the major problems when comparing two different sets of factors is that a factor model is only identified up to invertible linear transformations. We will again use the generalized correlation to measure how many factors two sets have in common. The generalized correlation between proximate factors and population factors is defined as
Here, the generalized correlation , ranging from 0 to (number of factors), measures how close and are. If lies in the space spanned by , then .^{22}^{22}22The individual generalized correlations are the square root of the eigenvalues of the matrix .
We study two cases: First, the sparse weight vectors are orthogonal to each other, which allows us to directly extend the onefactor results to the multifactor case; Second, we first find an appropriate rotation of the estimated loadings before thresholding them. We only assume that the sparse rotated factor weights are orthogonal, which is weaker. In our empirical examples we observe that several proximate factors are composed of the same small number of crosssection units but with different weights. In this case it is possible to find a rotation of the factors such that the proximate factors are composed of a disjoint set of crosssectional units.
For the first case we assume that the sparse factor weights are “nonoverlapping”. Formally, we define “nonoverlapping” as^{23}^{23}23 denotes an indicator function and is one if the condition is satisfied and zero otherwise.
which means that at most one factor weight is nonzero for every crosssection unit in the sparse factor weights. Then, the results from the onefactor model directly generalize to the multifactor case. In this case the sparse factor weights are orthonormal, i.e. , similar to the nonsparse . The generalized correlation equals the sum of squared correlations between each proximate factor and rotated true factor.
We need to impose additional assumptions on the population model to obtain the nonoverlapping sparse factor weights.
Assumption 6.
For the rotated population loadings and a given finite , we denote the dimensional vector of elements in with largest absolute values as . We assume that the cumulative density function of is continuous and that and are asymptotically independent for . Furthermore, for each loading vector , the entry with the th largest absolute value, , satisfies the assumptions in Lemma 1 yielding
(12) 
Under Assumption 6, the largest rotated population loadings are asymptotically “nonoverlapping.” Then Theorem 1 implies that also the sparse factor weights are “nonoverlapping” with high probability. Furthermore, Assumption 6
implies a joint distribution for the largest elements in
:as the extreme values for different columns of are independently distributed.
An additional complication in the multifactor case is the relationship between the sparse and nonsparse eigenvectors. We show that an asymptotic lower bound for is . The complication in the multifactor case arises from the fact that is in general not a diagonal matrix. In order to illustrate this point we will consider the simple example of and , i.e. a two factor model where the proximate factors only take the largest loading values. W.l.o.g. we assume the first element is the largest element for the first vector and the second element is the largest element for . Then, we have
Normalizing this matrix by the largest diagonal elements we obtain
The smallest singular value
of the matrix is a measure for how large the offdiagonal elements are. In the special case of a diagonal matrix , the smallest singular value equals 1. Our lower bound will depend on a probabilistic bound for , which depends on the distribution of the loading vectors. For example in the special case of i.i.d normally distributed loading elements a random element of the loading vector is , while the largest element is unbounded in the limit. Hence, the ratio is . For this special case the multifactor case is a direct extension of the onefactor model, i.e. we can apply the onefactor result to each factor in the multifactor model individually. Another special case is when the population loadings have sparse structures themselves. In both special cases we have . However, in general we need to take the effect of into account.
Comments
There are no comments yet.