Relationships between the components of a random vectorare of prime interest in many fields where statistical methods are used. Traditionally, this dependence is summarized through a correlation matrix. When is multivariate Normal, the classical choice is the linear correlation matrix. When multivariate Normality fails, as is frequent, e.g., in risk management, linear correlation can be grossly misleading and may not even exist (Embrechts et al., 2002). For this reason, it is safer to use a rank correlation matrix such as the matrix of pair-wise Kendall’s taus or Spearman’s rhos.
In high dimensions, empirical correlation matrices typically conceal underlying dependence patterns. This is due to their sheer size and to the inherent imprecision of the estimates, especially when the sample size is small compared to dimension . For example, consider the log-returns of stocks included in the NASDAQ100 index from January 1 to September 30, 2017, giving observations. Hardly any pattern is visible in the left panel of Figure 1, which shows the empirical Kendall rank correlation matrix based on residuals from a fitted stochastic volatility model.
Noisiness of sample correlation matrices is well documented. Several strategies have been proposed to remedy for it, most notably shrinkage (Ledoit and Wolf, 2004; Schäfer and Strimmer, 2005). Alternative procedures developed in the context of graphical models consist of decomposing a noisy inverse covariance matrix into a low-rank matrix and a sparse matrix (Chandrasekaran et al., 2012; Ma et al., 2013; Agarwal et al., 2012).
We follow a different path in this article. Motivated by the above NASDAQ example and many others, we focus on applications in which it makes sense to assume that the correlation matrix has a block structure. By this we mean that the variables can be grouped into disjoint clusters in such a way that for any two clusters and , and any and , the correlation between and satisfies . In other words, all variables within each cluster are equicorrelated and the between-cluster correlation depends only on the clusters but not the representatives. This assumption is a way to reduce the number of unknown pair-wise correlations from to at most . Correlation matrices with a block structure occur in portfolio and credit risk modeling, where variables can be grouped by industry classifications or risk types; see, e.g., the block-structured DECO model (Engle and Kelly, 2012). They also arise in the modeling of categorical, clustered, or gene expression data, and in genomics studies. In the NASDAQ100 example, a block structure emerges upon relabeling of the variables, as shown in the middle panel of Figure 1, though it is still noisy.
This article describes a technique for learning the cluster structure from data and shows how to use the latter to devise a more efficient estimator of the correlation matrix. No prior knowledge of the clusters, their number or composition is assumed. We only require that the dependence within each cluster is exchangeable. The procedure we propose is an iterative algorithm that is similar to, but different from, agglomerative clustering. In contrast to model-based clustering which aims to cluster together observations from the same subpopulation of a multivariate mixture distribution, the current proposal aims at identifying elements of a correlation matrix that are equal. The algorithm also outputs an improved estimate of the correlation matrix which has a block structure, and an estimate of its asymptotic covariance matrix. In the above example of stock returns, the relabeling in the middle panel was done using the clusters identified through the proposed algorithm; the improved estimate of the correlation matrix is displayed in the right panel. As we prove asymptotically and illustrate via simulations, the improvement of the estimator can be substantial, in particular for . Even in the unstructured case when
and there is no gain asymptotically, the new estimator can perform substantially better in finite samples due to a bias-variance tradeoff, particularly whenis small compared to .
All procedures developed in this paper are based on the matrix of pair-wise Kendall rank correlations, which turned out to be slightly more convenient than Spearman’s rank correlation matrix, for reasons stated in Section 2. A clear advantage of this approach over the linear correlation matrix is that Kendall’s tau is margin-free, well-defined and well-behaved irrespective of the distribution of , making our methodology nonparametric and margin-free. In particular, it is not assumed that the variables in the same cluster are equally distributed; we only require that the marginal distributions are continuous. No Normality assumption is imposed either. However, when is multivariate Normal, or more generally elliptical, there is a one-to-one relationship between and the linear correlation matrix, and the improved estimator of developed in this paper may be used to obtain more efficient estimators of the linear correlation matrix and its inverse.
Beyond the estimation of correlation itself, our procedure can be used as a first step in building complex dependence models. When is large, a model for the distribution of needs to be both flexible and parsimonious. Within the Normal or elliptical model, this means that the number of free parameters in the correlation matrix needs to be reduced, and the block structure identified through our algorithm can serve precisely this purpose. Outside the Normal model, dependence in
can be conveniently described through copulas, for joint distribution ofcan be rewritten, for all , as
where are the univariate marginal distributions of and is a copula, i.e., a joint distribution function with standard uniform marginals (Sklar, 1959). To achieve flexibility and parsimony when is large, the complexity of the problem needs to be reduced through an ingenious construction of ; examples are vines (Kurowicka and Joe, 2011), factor models (Krupskii and Joe, 2013; Hua and Joe, 2017), or hierarchical constructions (Mai and Scherer, 2012; Brechmann, 2014; Joe, 2015). The cluster algorithm proposed in this paper is particularly well suited for such approaches: Equicorrelated clusters can first be identified through it and modeled by exchangeable lower-dimensional copulas. Dependence between clusters can then be achieved subsequently through vines or factors. As such, this paper contributes to the emerging literature on structure learning for copula models; clustering algorithms have been employed very recently to learn the structure of nested (also called hierarchical) Archimedean copulas Okhrin et al. (2013); Górecki et al. (2016, 2017a, 2017b).
The article is organized as follows. In Section 2, we discuss the partial exchangeability assumption and its implications. In Section 3, we construct an improved estimator of assuming a known cluster structure, derive its asymptotic distribution, and show that it has smaller asymptotic variance than the empirical Kendall rank correlation matrix when . The algorithm through which and the cluster structure can be learned from data, and which outputs an improved estimator of , is then introduced in Section 4. In Section 5, we discuss implications for the estimation of the linear correlation matrix. The new methodology is studied through simulations in Section 6 and illustrated on NASDAQ100 stock returns in Section 7. Section 8 concludes the paper. The estimation of the covariance matrix of the estimator of and proofs are relegated to the Appendix. Additional material from the simulation study and the data illustration as well as the link to the R-code to implement the method may be found in the Online Supplement.
2 Partial exchangeability assumption
Throughout, let be a random vector with continuous univariate marginals, denoted . In this case, the copula in Sklar’s decomposition (1) is unique; in fact, it is the joint distribution of the vector . The following partial exchangeability assumption plays a central role in this paper.
Partial Exchangeability Assumption (PEA).
For , let . A partition of satisfies the Partial Exchangeability Assumption (PEA) if for any and any permutation of such that for all and all , if and only if , one has
or equivalently, , where denotes equality in distribution.
To understand the 2, note first that the partition with satisfies the 2 only if is fully exchangeable, meaning that for all and any permutation of , ; examples of fully exchangeable copulas are Gaussian or Student with an equicorrelation matrix, and all Archimedean copulas. When , the 2 is a weaker version of full exchangeability. A partition for which 2 holds divides into clusters such that the copula is invariant under within-cluster permutations. In particular, for any , the copula of is fully exchangeable. In contrast to many standard clustering contexts, the 2 does not imply that variables in the same cluster are equally distributed, because it only concerns the copula of and not the marginals . As explained in Section 4, while the 2 may hold for several partitions, the coarsest partition that satisfies the 2 is unique, viz.
Finally, observe that the 2 is not restrictive in any way. In the completely unstructured case, in Eq. (2) is . However, as we demonstrate in Section 6, there may still be a substantial advantage in considering a coarser partition for inference purposes in finite samples. When the partition in Eq. (2) is such that , the number of distinct rank correlations is reduced from to
Several commonly used models make an implicit or explicit use of this kind of complexity reduction; examples include latent variable models such as frailty or random effects models, Markov random fields or graphical models, hierarchical copula models (Brechmann, 2014; Mai and Scherer, 2012), factor copulas (Gregory and Laurent, 2004; Krupskii and Joe, 2015), or nested (hierarchical) Archimedean copulas (Joe, 2015).
For a partition that satisfies the 2, we write whenever for some . Furthermore, the cluster membership matrix is a matrix whose th entry is given, for all , by .
Next, let be the matrix of pair-wise Kendall correlation coefficients. Specifically, for all , the th entry of is the population version of Kendall’s tau between and , viz.
i.e., the difference between the probabilities of concordance and discordance ofand its independent copy . Because depends only on the copula of , viz.
Suppose that the partition of satisfies the 2 and that and , where and . Then the copulas and are identical and, consequently, .
It follows from Proposition 1 that whenever and where and , any copula-based measure of association will satisfy ; examples of include Spearman’s rho or Gini’s gamma (Nelsen, 2006). The reason why we focus on Kendall’s tau in this paper is that the asymptotic and finite-sample variance of the empirical Kendall’s tau matrix have a tractable form, more so for instance than that of Spearman’s rho (Borkowf, 2002). Having said that, the procedures proposed here could in principle be extended to other measures of association.
Suppose now that a partition with satisfies the 2. Proposition 1 then implies that when for some , the th and th rows and columns in are identical, once the diagonal entries are aligned. Consequently, if the variables are relabeled so that the clusters are contiguous, then the cluster membership matrix is block-diagonal and is a block matrix. As an illustration, consider the following example, which we shall use throughout the paper.
Consider a random vector of dimension such that the partition of size given by , , and satisfies the 2. The corresponding cluster membership matrix and the matrix of pair-wise Kendall correlations are shown in Figure 2; the exact distribution of does not matter at this point. In this case, the clusters are not contiguous, i.e., is not block diagonal, and the block structure of is not easily seen. Once the variables are relabeled as , the partition that satisfies the 2 becomes , where
The clusters are now contiguous, is block-diagonal, and has an apparent block structure; see Figure 2. Every time we revisit this example, we work with the relabeled vector .
Although we use examples in which the matrix is block-diagonal for illustrative purposes, contiguity of the clusters is not required. In fact, given that is unknown, the variables are unlikely to be labeled so that has an apparent block structure. To describe the latter, we first need additional notation. To this end, let be an arbitrary symmetric matrix. The entries above the main diagonal can be stacked in a vector, say , of length . Note that the diagonal elements of play no role at this point. The particular way the vectorization is done is irrelevant, as long as it is the same throughout. For example, one may use the lexicographical ordering viz.
For arbitrary , refers to the pair of indices such that . Now any partition of induces a partition of the elements of , or, equivalently, of . For any , let
Note that the total number of nonempty blocks is given in Eq. (3) because when , is nonempty only if . Referring to the sets using a single index, the partition of is then given by
In analogy to the cluster membership matrix , we define a block membership matrix ; for all and , its th entry is given by . Finally, define the set of all symmetric matrices with a block structure given by , viz.
Note that only the elements of that are above the main diagonal enter the definition of in Eq. (9).
Now suppose that is such that the 2 holds and that the elements above the main diagonal of are stacked in . By Proposition 1, for any and , , or, equivalently, . This means that ; when no confusion can arise, we will also write . Consequently, there are only distinct elements in . Storing these in a vector , we thus have . This means that when 2 holds, the number of free parameters in is reduced from to given in Eq. (3). We revisit Example 1 to illustrate.
Consider the matrix corresponding to in Example 1, and stack it in a vector of length as in Eq. (6). Because there are clusters given in Eq. (5), has length and the cluster structure reduces the number of free parameters in from to . The six distinct blocks are visible in the right panel of Figure 2.
3 Improved estimation of
Suppose that is a random sample from and, for each , set . The classical nonparametric estimator of is ; for , its th entry is given by
As explained in Section 2, if the 2 holds for some partition with , the number of free parameters in reduces from to . We now show that an a priori knowledge of leads to a more efficient estimator of .
Recall first that for all , is a -statistic and thus unbiased and asymptotically Normal (Hoeffding, 1948). The behavior of was studied in El Maache and Lepage (2003) and Genest et al. (2011); results pertaining to the closely related coefficient of agreement appear in Ehrenberg (1952). If and denote the vectorized versions of and respectively, one has, as ,
where denotes convergence in distribution and is the -dimensional vector of zeros. Expressions for the asymptotic variance as well as the finite-sample variance of have been derived in Genest et al. (2011).
The asymptotic Normality of
suggests to base inference on the following loss function,
is the Mahalanobis distance between and that accounts for the heterogeneous variability of the entries of . The fact that the finite-sample variance is unknown is irrelevant for now; it will only become a concern in Section 4.
Considering an arbitrary , it is obvious that attains its minimum at since . Now suppose that is a partition of such that the 2 holds. Unless , it is extremely unlikely that . However, we can introduce these structural constraints implied by into the estimation procedure.
The block-wise averages that appear in are akin to the pair-wise linear correlation averaging from Elton and Gruber (1973) and Ledoit and Wolf (2003), and more particularly the block DECO in Engle and Kelly (2012), which uses block-wise averages of linear correlations. Averaging of pair-wise Kendall’s tau in order to fit nested Archimedean copula models has recently been employed in Okhrin and Tetereva (2017). Also note that the 2 plays a crucial role in the proof of Theorem 1. It is needed to invoke Proposition A2 and Lemma B4 and obtain a finite-sample variance matrix with a structure that will lead to the required simplifications. In particular, it is not enough to assume that for some partition .
When it introduces no confusion, we refer to as and to its matrix version as . What is crucial in Theorem 1 is that consists of the cluster averages of the elements of and as such does not involve the unknown finite-sample variance of , so an estimator of is not needed to compute . The information contained in is introduced by projecting onto . The resulting estimator is expected to be closer to the original matrix because the entries that estimate a same value are averaged over, thus reducing the estimation variance. In fact, for any , the asymptotic variance of is less than or equal to that of as a result of the following theorem.
To conclude this section, we illustrate using the setup in Example 1.
4 Learning the structure
Because the cluster structure is typically unknown, the improved estimator derived in Section 3 cannot be directly used. We now propose a way to learn from data and to obtain an improved estimator of as a by-product. To do so, we first identify candidate structures in Section 4.1 and then choose one among them in Section 4.2.
4.1 Creating a set of candidate structures
If the 2 holds for some partition , it holds for any refinement thereof, defined below.
Let be a partition of . A refinement of is a partition of such that and for any there exists such that .
The block structure implied by a refinement of is consistent with the block structure implied by . This is formalized in the next proposition, which follows easily from Eq. (9).
For partitions , of , is a refinement of if and only if .
Proposition 2 implies that if for some partition , then for any refinement thereof, .
Consider the partition given in Eq. (5) in Example 1. The partition with , , , is a refinement of since , and . Consequently, satisfies the 2 as well. Figure 4 shows the cluster membership matrices and corresponding to and , respectively. Also displayed are the estimates and ; one can see that the block structure of the former is embedded in the latter but not conversely.
While the partition that satisfies the 2 may not be unique, the following holds.
The fact that any refinement of in Eq. (2) also satisfies the 2 motivates us to start with the finest possible partition for which the 2 always holds, and to merge the clusters one at a time in a way that resembles hierarchical agglomerative clustering. Specifically, we will create a path through the set of all possible partitions of with for , with the aim that given in Eq. (2) is an element of . The construction of is motivated by the following result.
Let be an arbitrary partition of , be the block membership matrix corresponding to in Eq. (8), and , as in Theorem 1. If is positive definite, and is an estimator of such that exists and, as , element-wise in probability, the following holds.
The construction of relies on slowly introducing information through constraints under which the loss function in Eq. (11) is minimized. To do this, the estimation of the unknown finite-sample variance of needs to be considered. While does not appear in the estimator in Theorem 1, it is relevant for the construction of . In A, we detail how to obtain an estimator of for a given partition : We first compute the plug-in estimator of following Genest et al. (2011) and then average out certain entries of using the block structure of induced by . The resulting estimator, which also depends on , is denoted or simply when no confusion can arise. When is small compared to or when is large, not enough averaging is employed and may be too noisy. In such cases, we apply Steinian shrinkage following Devlin et al. (1975), viz.
where is a diagonal matrix whose diagonal entries are those of and is the shrinkage intensity parameter. As shown in A.3, element-wise in probability if as . When is large, we suggest using a value of close to so that a lot of shrinkage is applied. The extreme case is recommended when is large, as the estimation and storage of becomes virtually impossible otherwise.
Now suppose that the th partition has been selected; let denote the corresponding estimate of . To select the st cluster structure , merge two clusters at a time and choose the optimal merger, in the sense that
for given in Eq. (12). The minimization in Eq. (13) is done by simply going through all possible mergers; indicates that must be a refinement of , so that the previously introduced equality constraints are carried. We then update the estimate of to as in Theorem 1, the estimate of to and iterate the above steps until . The entire procedure is formalized in Algorithm 1 below.
Algorithm 1 returns a sequence of decreasingly complex structures; note that is a refinement of for all . The partitions and are inevitable outputs of the algorithm; this is why we refer to as a path we took to go from to in the space of all partitions. If present on the path, in Eq. (2) is always , where . An illustration is provided in Example 5 below.
Figure 5 shows the application of Algorithm 1 on constructed from the random sample from in Example 3; the true cluster structure in Eq. (2) is given by Eq. (5). It indeed lies on the path; it corresponds to .
From each , we can compute the cluster membership matrix , the improved estimate as in Theorem 1, its matrix version , and an estimate of upon setting . Letting be the block membership matrix corresponding to in Eq. (8) and , a consistent estimator of the covariance matrix of is then , which simplifies to by Lemma B6.
Consider a partition for which the 2 holds, and let be a refinement thereof. Let and be the block membership matrices derived from Eq. (8) corresponding to and , respectively, and set and . Then because , where and are as defined in A.2, Lemma B6 applies and . Furthermore, for and , . Hence,
Now set . If is a sequence of partitions such that , , and for each , is a refinement of . A successive application of Eq. (14) then gives
In particular, for any , . This motivates that in Algorithm 1, only two clusters are merged at a time.
4.2 Structure selection
Proposition 4 suggests that the loss will increase sharply when the clustering has become too coarse. The following result offers a way to determine when this sharp increase may have occurred.
where is the number of blocks given in Eq. (3) corresponding to the th partition . For large enough, we expect that a sharp decrease in will occur at the first such that , i.e., when the matrix corresponding to becomes inadmissible. We do not use the criterion (15) as a formal -value, but rather as a tool that can help with structure selection. In Appendix C.2 in the Online Supplement, we present a naive automated selection procedure based on Eq. (15), which we refer to as Algorithm C1.