1 Introduction
1.20
Canonical correlation analysis (cca) (Jordan1875; Hotelling1936)
is a multivariate method that aims at reducing the correlation structure between two sets of variables to the simplest possible form (hence the name “canonical”) through linear transformations of the variables within each set. Put simply, given two sets of variables, the method seeks linear mixtures within each set, such that each resulting mixture from one set is maximally correlated with a corresponding mixture from the other set, but uncorrelated with all other mixtures in either set.
From a peak use through from the late 1970’s until mid1980’s, the method has recently regained popularity, presumably thanks to its ability to uncover latent, common factors underlying association between multiple measurements obtained, something relevant in recent research that uses high dimensional phenotyping and investigates betweensubject variability across multiple domains. This is in contrast to initial studies that introduced cca to the imaging field (Friston1995; Friston1996; Worsley1997; Friman2001; Friman2002; Friman2003) for investigation of signal variation in functional magnetic resonance imaging (fmri) time series. For example, Smith2015 used cca to identify underlying factors associating brain connectivity features to various demographic, psychometric, and lifestyle measures; Drysdale2017 used cca to investigate associations between brain connectivity and clinical assessments, and found two canonical variables that would allow classification of participants into distinct categories (but see Dinga2019); Bijsterbosch2018 and Xia2018 likewise used cca to identify associations between functional connectivity and various indices of behaviour and psychopathology, whereas Alnaes2019 used cca to investigate the association between imaging measurements and cognitive, behavioral, psychosocial and socioeconomic indices; Li2019 used cca to investigate, among subjects, the topography of the global fmri signal and its relationship with a number of cognitive and behavioral measurements; Clemens2020 used a combination of pattern classification algorithms and cca to study imaging and behavioral correlates of the subjective perception of oneself belonging to a particular gender. All these betweensubject, group level studies used some form of permutation to assess the significance of the results, and most of them regressed out putative nuisance variables or confounds from the data before proceeding to inference.
Permutation tests are well known and widely used. Among their many benefits, these tests lead to valid inferences while requiring assumptions that are commonly satisfied in betweensubject analyses, such that of exchangeability of observations under the null hypothesis. We could not find, however, studies that explicitly verified whether the performance of permutation tests for
cca would be valid. Investigating such validity, we found that trivial implementations of permutation inference for ccaare inadequate on four different grounds. First, simple, uncorrected permutation pvalues are not guaranteed to be monotonically related to the canonical correlations, leading to inadmissible results; for the same reason, multiple testing correction using false discovery rate is also inadmissible. Second, except for the highest canonical correlation, a simple onestep estimation of all others without considering the variability already explained by previous canonical variables in relation to each of them also leads to inflated per comparison error rates and thus, invalid results. Third, regressing out nuisance variables without consideration to the introduction of dependencies among observations caused by residualisation lead to an invalid test, with excess false positives. Fourth, multiple testing correction using the distribution of the maximum test statistic leads to conservative results, except for the highest canonical correlation.
In this paper we explain and discuss in detail each of these problems, and offer solutions that address each of them. In particular, we propose a novel stepwise estimation method for the canonical correlations and canonical variables that remains valid even when the number of variables is not the same for both sides of cca. We propose a method that transforms residualised data to a lower dimensional basis where exchangeability — as required for the validity of permutation tests — holds. We also propose that inference that considers multiple canonical correlations should use a closed testing procedure that is more powerful than the usual correction method used in permutation tests that use the distribution of the maximum statistic; the procedure also ensures a monotonic relationship between pvalues and canonical correlations. Finally, we provide a complete, general algorithm for valid inferences for cca.
2 Theory
2.1 Notation and general aspects
Thorough definition and derivation of cca
is available in many classical textbooks on multivariate analysis
(e.g., Kendall1975; Mardia1979; Brillinger1981; Muirhead1982; Seber1984; Krzanowski1988; Anderson2003); the reader is referred to these for a comprehensive overview. Here we present concisely and only allude to the distinction between population () and sample () canonical correlations where strictly needed. Let and be each one a collection of, respectively, and variables observed from subjects, . Without loss of generality, assume that , that the columns of and are meancentered, that these matrices are of full rank, and define , , and . The goal of cca is to identify canonical coefficients or canonical weights and , , such that the pairs of canonical variables, defined as:(1) 
have correlations that are maximal, under the constraint that . Estimation of and amounts to finding the solutions to:
(2) 
where the unknowns are , and ; are the sample canonical correlations, i.e., the correlations between the estimated canonical variables and . The coefficients
are eigenvectors of
, whereas are eigenvectors of; the respective eigenvalues (for either
or , as these eigenvalues are the same) are . For convenience, we call canonical component the ensemble formed by the th canonical correlation, its corresponding pair of canonical variables, and associated pair of canonical coefficients; canonical variables may also be termed modes of variation.The typical method for estimation involves an iterative procedure that finds one and at a time, with computed as a function of these. However, the method proposed by Bjorck1973 is more concise and numerically more stable; it is described in the Appendix (Algorithm 3); the canonical correlations are then produced in descending order, . This positiveness of all canonical correlations is a consequence of these values being explicitly maximised during estimation. Reversal of the sign of the coefficients can always be accompanied by the reversal of the sign of the corresponding coefficients in the other side, to no net effect on . That is, the signs of the canonical variables and coefficients are indeterminate, and any solution is arbitrary; nothing can be concluded about the specific direction of effects with cca.
2.2 Parametric inference
The distribution of the sample canonical correlations is intractable, even under assumptions of normality and independence among subjects, and is a function of the population correlations (Constantine1963; James1964). This difficulty led to the development of a rich asymptotic theory (Fisher1939; Hsu1941; Lawley1959; Fujikosh1977; Glynn1978). However, these approximations have been shown to be extremely sensitive to departures from normality, or require additional terms that further complicate their use (Muirhead1980); Brillinger1981 recommended resampling methods to estimate parameters used by normal approximations, which otherwise can be biased (Anderson2003). These difficulties pose challenges for inference. Even though some computationally efficient algorithms have been proposed (Koev2006), these tests continue to be rarely used.
Instead, a test based on whether a certain number of correlations are equal to zero has been proposed. The null hypothesis is , i.e., , for , that is, the null is that population canonical correlations (the smaller ones) are zero (Bartlett1938; Bartlett1947; Marriott1952; Lawley1959; Fujikoshi1974), versus the alternative that at least one is not, i.e., . The test is based on the statistic proposed by Wilks1935, as:
(3) 
where the constant if there are no nuisance variables (Section 2.6). Under the null hypothesis, follows an approximate
distribution with degrees of freedom
if each of the columns of andhave values that are independent and identically distributed following a normal distribution
(but see Glynn1978, for a different expression). Unfortunately, this test is sensitive to departures from normality, particularly in the presence of outliers, and its use has been questioned
(Seber1984; Harris1976).Another test statistic is based on Roy1953^{1}^{1}1Roy1953 proposed two distinct but related test statistics; these are both known as “Roy’s largest root”. Here we use the one that is interpreted as a coefficient of determination, and not the other that is interpreted as an statistic. See Kuhfeld1986 for a complete discussion., and is simply:
(4) 
The critical values for the corresponding parametric distribution at a given test level can be found in the charts provided by Heck1960, using as parameters , , and (Lee1978), where the constant if there are no nuisance variables (Section 2.6), or in tables provided by Kres1975; more recent approximations for normally distributed data can be found in Chiani2016 and Johnstone2017. Some approximations, however, are considered conservative (Harris1976; Harris2013), even when assumptions are met. Note that, while Roy1953 proposed the use of the largest value as test statistic, which would then be , any given null at position must have already considered the previous canonical components, from until , such that the maximum statistic, after the previous canonical correlations have been removed from the model, is always the current one. A similar reasoning holds for the smallest canonical correlations in the test proposed by Wilks1935. This feature is exploited in the stepwise rejective procedure proposed in Section 2.5.
2.3 Permutation inference
The above problems can be eschewed with the use of resampling methods, such as permutation. An intuitive (but inadequate) permutation test for cca could be constructed by randomly permuting the rows of or . For each shuffling of the data, indicated by , a new set of canonical correlations and associated statistics would be computed. A pvalue would be obtained as , where is the indicator (Kronecker) function, which evaluates as 1 if the condition inside the brackets is true, or 0 otherwise, and the index corresponds to the unpermuted data (i.e., no permutation, with the data in their original ordering).
Such a simple procedure, however, would ignore the fact that, this resampling scheme matches the first null hypothesis , but not the subsequent ones. For a given canonical correlation at position being tested, , one must generate a permutation that satisfy the corresponding null , but not necessarily . Otherwise, latent effects represented by the corresponding earlier canonical variables and would, in the procedure above, remain in the and at the time these are permuted. However, the variance associated with these earlier canonical variables would have already been explained through the rejection of their respective null hypothesis up to . This variance is now a nuisance for the positions from (inclusive) onward. It contains information that are not pertinent to position and subsequent ones, which therefore should not be used to build the null distribution, i.e., variance should not be reused in the shufflings that lead to a given or subsequent correlations.
Fortunately, cca is invariant to linear transformations of that mix the variables in or in . Since the canonical variables are themselves linear transformations of these input variables (Equation 1), a second cca using and in place of the initial and lead to the same solutions. Yet, unless , will not span the same space as . In principle, this would be inconsequential as far as the canonical variables are concerned. However, ignoring the variability in not contained in would again affect the pvalues should and be used in a permutation test, as the permuted data would not be representative of the original (unpermuted) that led to these initial canonical variables. To mitigate the problem, include into the matrix of canonical coefficients their orthogonal complement, i.e., compute , then use instead of as a replacement for . In this paper we adopted the convention that , but the same procedure works in reverse and, algorithmically, it might as well be simpler to compute also a and use it instead of as a replacement for . If , then .
While these transformations do not change in any way the canonical components, they allow the construction of an improved algorithm that addresses the issue of variability already explained by canonical variables of lower rank (i.e., the ones with order indices smaller than that of a given one). It consists of running an initial cca using and to obtain and , then subject these to a second cca and permutation testing while, crucially, at each permutation, iteratively repeating cca times, each using not the whole and , but only from the th component onwards, i.e., and for the test about the th canonical correlation. Of note, a specific test level need not be specified at the time in which the above iterative (stepwise) procedure is performed; instead, and in combination with the multiple testing procedure described below, adjusted pvalues are are computed, which then are used to accept or reject the null for the th correlation. Algorithm 1 (Section 2.8) shows the procedure in detail (the algorithm contains other details that are discussed in the next sections).
A number of further aspects need be considered in permutation tests: the number of possible reorderings of the data, the need for permutations that break the association between the variables being tested, the random selection of permutations from the permutation set when not all possible permutations can be used, the choice of the test statistic, how to correct for the multiplicity of tests, the number of permutations to allow narrow confidence intervals around pvalues, among others. Most of these topics have been discussed in
Winkler2014 and references therein and will not be repeated here. However, for cca, some aspects deserve special treatment and are considered below.2.4 Choice of the statistic
Asymptotically, using Wilks’ statistic or Roy’s are expected to lead to the same conclusion since all correlations are sorted in descending order: if , then all subsequent ones must be zero; likewise, if , then clearly at least one correlation between and is larger than zero, which has to include itself. Moreover, permutation under the null is justifiable in the complete absence of association between the two sets, which implies that, under the null , all correlations are equal to zero. With finite data, however, one statistic can be more powerful than the other in different settings. Their relative performance is studied in Sections 3 and 4.
Computationally, Wilks’ requires more operations to be performed compared to Roy’s statistic. Since the relationship between is is monotonic, the two are permutationally equivalent, using alone is sufficient, which makes Roy’s the absolute fastest. However, even for Wilks’, the amount of computation required is negligible compared to the overall number of operations needed for estimation of the canonical coefficients, such that in practice, the choice between the two should be in terms of power.
In either case, while inference refers to the respective null hypothesis at position , it is not to be understood as inference on the index . Rather, the null is merely conditional on the nulls for all previous correlations from to having been rejected. Rejecting the null implies that the correlation observed at position is unlikely high if there is no association between the two variable sets after all previous (from to ) canonical variables have been sequentially removed, as described in Section 2.3. In Algorithm 1 (Section 2.8) this is done in line 29, that uses as inputs to cca the precomputed canonical variables only from position onwards, as opposed to all of them.
2.5 Multiplicity of tests
For either of these two test statistics, the ordering of the canonical correlations from larger (farther from zero) to smaller (closer to zero) imply that rejection of the null hypothesis at each must happen sequentially, starting from , using the respective test statistic and associated pvalue until the the null , for some , is not rejected at a predefined test level . Then, at that position , the procedure stops, and the null is retained from that position (inclusive) onward until the final index .
The ordering of the canonical correlations brings additional consequences. First, because rejection of implies rejection of all joint (intersection) hypotheses that include , that is , such sequentially rejective procedure is also a closed testing procedure (ctp), which controls the amount of any type i error across all tests, i.e., the familywise error rate (fwer) in the strong sense (Marcus1976; Hochberg1987). Thus, there is no need for further correction for multiple testing. Another way of stating the same is that the test for a given , , has been “protected” by the test at the position at the level . Adjusted pvalues (in the fwer sense) can be computed as , that is, the fweradjusted pvalue for the canonical component is the cumulative maximum pvalue up to position . Such adjusted pvalues can be considered significant if their value is below the desired test level .
The second consequence is that fwer adjustment of pvalues using the distribution of the maximum statistic (not to be confused with the cumulative maximum described in the above paragraph) will be conservative for all canonical components except the first. The reason is that the maximum statistic is always the distribution of the first, which is stochastically dominant over all others. The distribution of the maximum is usually sought as a shortcut to a ctp when the condition of subset pivotality holds (Westfall1993), as that reduces the computational burden from tests to only tests. Interestingly, the ordering of the canonical correlations from largest to smallest leads to a ctp that does not use the distribution of the maximum, and yet requires only tests, regardless of whether subset pivotality holds.
A third consequence is that using permutation pvalues outside the above sequentially rejective procedure that controls fwer is not appropriate since these simple, uncorrected pvalues are not guaranteed to be monotonically related to the canonical correlations . Attempts to use these uncorrected pvalues outside a ctp would lead to paradoxical results whereby higher, stronger canonical correlations could not be considered significant, yet later ones, smaller, weaker, could be so; that is, it would make the test inadmissible (Lehmann2005, p. 232). For the same reason, such simpl pvalues should not be subjected to correction using false discovery rate (fdr; Benjamini1995), because the ordering of pvalues for fdr, from smallest to largest, is not guaranteed to match the ordering of the canonical correlations, leading similarly to an inadmissible test and nonsensical results.
2.6 Nuisance variables
Few authors discussed nuisance variables or confounds in canonical correlation analysis, e.g., Roy1957; Rao1969; Timm1976; Lee1978; Sato2010. Let the be a matrix of nuisance variables, including an intercept. Partial cca consists of regressing out from both and , then subjecting the residuals to the usual cca, under the assumption (or knowledge) that is a confound to both and . This is distinct from part cca, which consists of regressing out from either or , under the assumption that affects only one of the two sets, then proceeding with the usual cca using the respective residuals in the place of the original set. Finally, one can define bipartial cca, where is regressed out from , while another set of variables , of size , is regressed out from ; as in the previous cases, the respective residuals are used in an otherwise usual cca (Table 1). In the parametric case, inference can proceed using the distribution of or (Equations 3 and 4) using for part, and for bipartial cca (Timm1976; Lee1978).
Name  Left set  Right set 
cca (“full”, no nuisance)  
Partial cca  
Part cca  
Bipartial cca 
is a residual forming matrix that considers the nuisance variables in , and is computed as , where the symbol represents a pseudoinverse. is computed similarly, considering the nuisance variables in . The choice which set is on left or right side is arbitrary.
Permutation inference, however, requires further considerations, otherwise, as shown in Section 4, results will be invalid. Consider first the case without nuisance variables. Let be the horizontal concatenation of the two sets of variables whose association is being investigated. Both and occupy an dimensional space and, so does, therefore, . A random permutation of the rows of either of the two sets of variables will not affect their dimensionalities. For example continues to occupy the same dimensional space as .
However, residualisation changes this scenario. Let be the residual forming matrix associated with the nuisance variables , with the symbol representing the Moore–Penrose pseudoinverse. has the following interesting properties: (symmetry) and (idempotency), which will be exploited later. In partial cca, can be regressed out from and by computing and . Let be the concatenation of the residualised sets and with respect to . While occupies an dimensional space, occupies a smaller one; its dimensions are, at most, of a size given by the rank of , which is assuming and are of full rank. The same holds for and and, therefore, for and .
Permutation affects these relations: while still occupies a space of dimensions as the unpermuted , , differently than , may now occupy a space with dimensions anywhere between and , depending on a given random permutation. With fewer effective observations determined by this lower space after residualisation, and the same number of variables, the canonical correlations in the unpermuted case are stochastically larger than in the permuted, which in turn leads to an excess of spuriously small pvalues. For not occupying the same space as the original, the permuted data are no longer a similar realisation of the unpermuted, thus violating exchangeability, and specifically causing the distribution of the test statistics to be unduly shifted to the left.
Here the following solution is proposed: using the results from Huh2001, let be a semiorthogonal basis (Abadir2005, p. 84) for the column space of constructed via, e.g., spectral or Schur decomposition, such that , , where , and . Then cca on and lead to the same solutions as on and . The reason is that, from Section 2.1, , which is the same as , since, as discussed earlier, is symmetric and idempotent. In a similar manner, , and likewise, . While premultiplication by does not affect the cca results, it changes the dependence structure among the rows of the data: occupies a dimensional space, and so does , for a permutation matrix of size , such that exchangeability holds, thus allowing a valid permutation test to be conducted successfully.
The treatment of partial cca, as described above, can be seen as a particular case of bipartial cca in which , that is, the set of nuisance variables in both sides is the same. Of course, for bipartial cca proper, this equality not necessarily holds, and the two sets may be different in different ways: may be entirely orthogonal to , or some or all variables from one set may be fully represented in the other, either directly (e.g., some of the variables present in both sets), or as linear combinations of one set in the other, or it may be that these two sets are simply not orthogonal. The direct strategy of computing
and its respective semiorthogonal matrix
lead to difficulties because, unless , the products and will not have the same number of rows: has , whereas has rows, thus preventing the computation of cca.A more general solution, that accommodates bipartial and, therefore, is a generalisation for all cases of nuisance variables in cca, consists of randomly permuting rows of and/or using, respectively, permutation matrices and of respective sizes and , therefore permuting in the lower dimensional space where and are exchangeable, then, crucially, reestablishing the original number of rows using the property that the transpose of a semiorthogonal matrix is the same as its inverse (), to only then perform cca. Therefore cca is computed using and . Left and right sides will continue to have rank and respectively, will have already been permuted, and will both have rows. The procedure is fully symmetric in that, when the permutation matrices and are both identity matrices (of sizes and , respectively), which is equivalent to no permutation, the expression for both sides reduce to the residualised data and , as desired. The concatenation has the same rank as that of , thus addressing the above problem of the unpermuted test statistic having a different and stochastically dominant distribution over that of the permuted data. Table 2 summarises the proposed solution for all cases, including part cca.
Name  Left set  Right set 
cca (“full”, no nuisance)  
Partial cca  
Part cca  
Bipartial cca 
is a semiorthogonal basis for the column space of , such that , , where , and . is a similarly defined matrix for the column space of , . The bipartial cca case generalizes all others: for “full” cca, , and so, ; for partial cca, ; for part cca , and so, . For full and partial, premultiplication by can be omitted since , such that results do not change. Once these simplifications are considered, the general bipartial cca case reduces to the other three as shown in the Table. Full and partial have matching number of rows in both sides, such that only one side needs be permuted; part and bipartial, however, have at the time of the permutation a different number of rows in each side, such that both can be permuted separately through the use of suitably sized permutation matrices and ; is size for full cca, and for the three other cases; is size for full and for part cca, and for the two other cases.
2.7 Restricted exchangeability
The above method uses the Huh–Jhun semiorthogonal matrix applied to cca and leads to a valid permutation test provided that there are no dependencies among the rows of . That is, the method takes into account dependencies introduced by the regression of and/or out from and/or , but not dependencies that might already exist in the data, and which generally preclude direct use of permutation tests. However, structured dependencies, such as those that may exist, for instance, in studies that involve repeated measurements, or for those in which participants do not constitute independent observations, e.g., sibpairs, as in the Human Connectome Project (hcp; VanEssen2012), can be treated by allowing only those permutations that respect such dependency structure (Winkler2015). Unfortunately, the Huh–Jhun semiorthogonal matrix does not respect such structure, blurring information from observations across blocks, and preventing the definition of a meaningful mapping from the original observations that define the block structure to the or observations that are ultimately permuted.^{2}^{2}2There is an exception: if has a block diagonal structure and the observations encompassed by such blocks coincide with the exchangeability blocks, then an algorithm that uses Huh–Jhun and block permutation can be constructed.
Such mapping, whereby each one of the and rows of, respectively, and corresponds uniquely to one of the rows of the original data and , can be obtained using a different method, due to Theil1965; Theil1968, and reviewed in detail by Magnus2005. Consider first the case of . In the Theil method, that here is adapted for cca, , where the exponent represents the positive definite matrix square root, and is a selection matrix,
, that is, an identity matrix from which some
rows have been removed. Premultiplication of any matrix by a selection matrix deletes specific rows, i.e., the ones that correspond to columns that are all zero in the selection matrix (Figure 1). The thus computed are the best linear unbiased residuals with scalar covariance (blus), in that they are unbiased estimates of
, where are the (unknown) true errors after the nuisance effects of have been removed from ; for cca using observed data, contains the variance of interest, which may be shared among linear combinations of variables in both sides, and is therefore what is subjected to cca and statistical testing. For partial cca, is the same for both sides; for bipartial cca, similar computations hold for the other side, i.e., . Table 3 summarises the two methods.Method  Matrix 
Theil1965  
Huh2001  (via svd or Schur) 
is the residualforming matrix ( or , for the respective set of nuisance variables, subscript dropped). In Theil, is a (for ) or (for ) selection matrix; the matrix square root (the exponent ) is the positive definite solution. In Huh–Jhun, after Schur or svd factorisations are computed, the or columns of that have corresponding zero eigenvalues in the diagonal of are removed. For both methods, , , and . Both methods aim at obtaining residuals with a scalar covariance matrix . Theil explictly seeks blus residuals. However, strictly, does not need to be a selection matrix: choose to be , using computed with the Huh–Jhun approach. Then, following Magnus2005, it can be shown that Huh–Jhun also provides blus residuals.
To construct a permutation procedure for cca that respects the block structure, the Theil method can be used to compute instead of using the Huh–Jhun approach. Choose observations to be removed from both sides (for partial cca, since ). Construct the selection matrix of size , define the exchangeability blocks based on observations, compute and using the same for both (for part cca, use the same strategy as for bipartial, replacing for ), residualise (in the blus sense) the input variables by computing and . These have the same number of rows, and the dependencies among these rows is the same for both sides; thus, only one side needs be subjected to random permutations that respect such existing dependencies. Optionally, after permutation, the number of observations may be reestablished by premultiplication by and . Finally, cca is performed, with observation to the aspects discussed in Sections 2.3 and 2.5. A detailed algorithm is presented in Section 2.8.
It remains to be decided how to select the observations to be dropped. In principle, any set could be considered for removal, provided that and continue to be of full rank. Some informed choices, however, could be more powerful than others. One of the main conclusions from Winkler2015 is that the complexity of the dependence structure and the ensuing restrictions on exchangeability lead to reductions in power. Thus, natural candidates for removal are observations that, once removed, cause the overall dependence structure to be simpler. For example, it is sometimes the case that some observations are so uniquely related to all others that there are no other observations like them in the sample. These observations, therefore, cannot be permuted with any other, or perhaps with only a few. Their contribution to hypothesis testing in the permutation case is minimal, and their removal are less likely to affect a decision on rejection of the null hypothesis. Consider for example a design that has many monozygotic, dizygotic, and nontwin pairs of subjects, and that in the sample, there happens to be a single pair of halfsiblings. It is well known that, for heritable traits, genetic resemblance depends on the kinship among individuals; halfsiblings are expected to have a different degree of statistical dependency among each other compared to each one of the other types of sibships in this sample. Thus, in there being just one such pair, it would be reasonable to prioritise it for exclusion, while keeping others.
2.8 General algorithm
A set of steps for permutation inference for cca is described in Algorithm 1. In it, input variables and will have been meancentered before the algorithm begins, or an intercept will have been included as nuisance variable in both and . is a set containing pairs of permutation matrices indexed by . In this set, the first permutation is always “no permutation”, i.e., , such that , for all . For the cases in which only one side of cca needs be permuted (Table 2), or for the cases in which , or when there are dependencies among the data such that the Theil method is used to construct (Table 3), then can be set as for all . Details on how is defined in observance to the null hypothesis and respecting structured dependencies among the data have been discussed in Winkler2014; Winkler2015. In the algorithm, can be larger, equal, or smaller than . Optional input arguments are the matrices with nuisance variables and , and the selection matrix . If is supplied but not , then the algorithm performs part or partial cca, depending on the Boolean argument partial; if both and are supplied, the algorithm performs bipartial cca; if neither is supplied, then “full” cca is performed. If is supplied, then the blus residuals based on Theil are used; otherwise, Huh–Jhun residuals are used. For either of these two cases, the semiorthogonal matrix is computed using a separate, ancillary function named “semiortho”, described in the Appendix.
An initial cca using residualised data is done in line 19; this uses another ancillary function, named “cca”, and also described in the Appendix; this function returns three results: the canonical coefficients and , and the canonical correlations . The canonical coefficients are used to compute the canonical variables and , augmented by their orthogonal complement needed to ensure that they span the same space as the variables subjected to this initial cca; the canonical correlations are ignored at this point and not stored (hence the placeholder “_”). A counter for each canonical component is initialised as 0.
The core part of the algorithm are the two loops that run over the permutations in and the canonical components (between lines 25 and 35). At each permutation , cca is executed times. In each, the columns of and that precede preceding the current are removed, such that their respective variances are not allowed to influence the canonical correlations at position . At each permutation, the canonical correlations are obtained (the third output from the function “cca”) and used to compute the associated test statistic. As shown, Wilks’ statistic, , is used, simplified by the removal of the constant term, which does not affect permutation pvalues. For numerical stability, sum of logs is favoured over the log of a product (compare line 30 with Equation 3). For inference using Roy’s statistic, replace the condition for in line 31; this modification alone is sufficient as is permutationally equivalent to . In that case, computations indicated in line 30 are no longer needed and can be removed to save computational time.
Whenever the statistic for the correlation at position in a given permutation is higher or equal than that for the unpermuted data, the counter is incremented (line 32). After the loop, the counter is converted into a pvalue for each . These simple, uncorrected pvalues, however, are not useful. Instead, fweradjusted pvalues are computed under closure using the cumulative maximum, i.e., the pvalue for is the largest (least significant) uncorrected pvalue up to position . The algorithm returns then these adjusted pvalues, which can be compared to a predefined test level to establish significance. Note that itself is never used in the algorithm.
As presented, the algorithm does not cover dimensionality reduction or any penalty to enforce sparse solutions for cca
. Dimensionality reduction using methods such as principal component analysis (
pca) or independent component analysis (
ica) with selection of components, if included, would be performed after residualisation, but before cca. Thus, in the algorithm, pca or ica, if executed, would be done between lines 18 and 19. As for the many forms of sparse or penalised cca (Nielsen2002; Waaijenborg2007; Parkhomenko2009; Witten2009; Hardoon2011; Gao2017; Ma2018; Tan2018), these can be incorporated into the algorithm through the replacement of the classical cca in lines 19 and 29 for one of these methods.3 Evaluation Methods
In this section we describe the synthetic data and methods used to investigate error rates and power under the different choices for the various aspects presented in Section 2 at each stage of a permutation test for cca, providing empirical evidence for the approach proposed. An overview of these aspects and choices at each stage is shown in Table 4
. For each case, we use a series of simulation scenarios: each consists of a set of synthetic variables constructed using random values drawn from a normal or a nonnormal (kurtotic or binary) probability distribution, sometimes with or without dimensionality reduction using principal components analysis
(pca; Hotelling1933; Jolliffe2002), sometimes with or without signal, and sometimes with or without nuisance variables. We also consider cases with large sample sizes and large number of variables. An overview of these scenarios (there are twenty of them) is in Table 5.Step  Possible strategies studied  Use  Theory  Scenarios 
Estimation of the canonical components  (a) All in a single step.  ✗  Section 2.3.  i–vi. 
(b) Stepwise; variance already explained removed.  ✓  
Inclusion of the complement of the canonical coefficients  (a) Null space not included.  ✗  Section 2.3.  i–vi. 
(b) Null space included.  ✓  
Correction for multiple testing  (a) Uncorrected, simple pvalues, .  ✗  Section 2.5.  i–vi, xviii. 
(b) Corrected, cumulative maximum, .  ✓  
(c) Corrected, distribution of the maximum, .  
Treatment of nuisance variables  (a) Simple residualisation ().  ✗  Sections 2.6 and 2.7.  vii–xviii. 
(b) Huh–Jhun method.  ✓  
(c) Theil method.  ✓  
Choice of the test statistic  (a) Wilks’ .  ✓  Sections 2.2 and 2.4  xvii, xviii. 
(b) Roy’s largest root, .  ✓  
✓ Can or should be used.  Can but should not be used.  ✗ Cannot or should not be used. 
Scenarios  #(pca)  Distribution  Signals  #Perms.  #Reps.  
Without nuisance  i  100  16  20  0  0  —  normal  —  2000  2000 
ii  100  16  20  0  0  10  normal  —  2000  2000  
iii  100  16  20  0  0  —  kurtotic  —  2000  2000  
iv  100  16  20  0  0  10  kurtotic  —  2000  2000  
v  100  16  20  0  0  —  binary  —  2000  2000  
vi  100  16  20  0  0  10  binary  —  2000  2000  
Partial cca  vii  100  16  20  15  —  normal  —  2000  2000  
viii  100  16  20  15  10  normal  —  2000  2000  
ix  100  16  20  15  —  kurtotic  —  2000  2000  
x  100  16  20  15  10  kurtotic  —  2000  2000  
xi  100  16  20  15  —  binary  —  2000  2000  
xii  100  16  20  15  10  binary  —  2000  2000  
Bipartial cca  xiii  100  16  20  15  15  —  normal  —  2000  2000 
xiv  100  16  20  15  15  10  normal  —  2000  2000  
Larger samples  xv  16  20  15  —  normal  —  500  500  
xvi  16  20  15  10  normal  —  500  500  
With signal  xvii  100  16  20  0  0  —  normal  sparse  2000  2000 
xviii  100  16  20  0  0  —  normal  dense  2000  2000 
* For scenarios xv and xvi, sample size varied, . In the table, and refer to the number of nuisance variables other than the intercept, which is always included (so the number of nuisance variables in left and right sides for all the simulation scenarios was always, respectively and ). For partial cca, the number of nuisance variables on one side is always the same as in the other, i.e., , but that does not have to be for bipartial cca, even though here the same size was used. The case with larger samples was used for investigation of partial cca.
We start by investigating aspects related to the estimation of the canonical components at each permutation. Specifically, we consider (a) a onestep estimation of all canonical components, from to , versus (b) sequential estimation that removes, for the th canonical component in a given permutation, the variance already explained by the previous ones, as described in Section 2.3. With respect to the inclusion of the complement of the canonical coefficients, we consider (a) without the inclusion of the null space of the canonical coefficients, versus (b) with its inclusion so as to ensure that all variance from the original data not explained in the initial cca is considered in the estimation at every permutation, as described in Section 2.3. With respect to multiple testing, and consider the following strategies: (a) simple, uncorrected pvalues, , (b) corrected under closure, , and (c) corrected using the distribution of the maximum statistic ; both and offer fwer control, as discussed in Section 2.5. Keeping the same notation, we define scenarios i–vi consisting of observations, with variables on the left side () of cca and variables on the right side () (the procedure is symmetric; the choice of sides is arbitrary and do not affect results); for these six scenarios, data are drawn from one of three possible distributions: a normal distribution with zero mean and unit variance, a Student’s distribution with variable degrees of freedom
(kurtotic), or a Bernoulli distribution with parameter
(binary). Analyses with and without dimensionality reduction to 10 variables using pca are considered. The number of permutations used to compute pvalues was set as , with realisations (repetitions), thus allowing the computation of error rates.We then turn our attention to aspects related to nuisance and residualisation discussed in Sections 2.6 and 2.7. We consider (a) simple residualisation, (b) residualisation using the Huh–Jhun method, and (c) residualisation using the Theil method. For this purpose, scenarios vii–xvii are constructed similarly as i–vi, except that a third set of variables is used as nuisance for partial cca, whereas two other scenarios, xiii and xiv, a fourth set of variables of variables is used as nuisance for bipartial cca.
The impact of ignoring, in samples substantially larger than the number of variables, the dependencies introduced by the residualisation of both sides of cca is studied with scenarios xv and xvi, that consider samples progressively larger, , while keeping the other parameters similar as in scenarios vii and viii. Finally, we briefly investigate power and the choice of the test statistic: we consider (a) Wilks’ statistic (), as well as (b) Roy’s largest root (), as discussed in Sections 2.2 and 2.4. We define scenarios xvii and xviii similarly as i, this time including a strong, true signal in one canonical component, thus named “sparse”, or multiple, weaker signals shared across multiple (half of the smaller set, thus, “dense”). For all scenarios, an intercept is always included as nuisance variable in both sides such that the actual number of nuisance variables is and for each side, respectively. To report confidence intervals (95%), the Wilson1927 method is used.
4 Results
In the results below, the Sections 4.1 and 4.2 establish empirically that with an estimation method (i) that includes the null space of the canonical coefficients, (ii) that finds the canonical correlations in an iterative manner, and (iii) that after computing pvalues through a closed testing procedure, the error rates are controlled. The subsequent results, from Section 4.3 onwards, consider then only this valid approach.
4.1 Estimation strategies
Not including the complement of the canonical coefficients (null space) caused error rates to be dramatically inflated, well above the expected test level (5%), regardless of whether the estimation used the single step or the stepwise procedure, and regardless of any of the multiple testing correction strategies discussed; these results are shown in Table 6.
Null space not included  Null space included  
Single step  Stepwise  Single step  Stepwise  
(a) Uncorrected, simple pvalues, .  
91.35
(90.04–92.50) 
91.35
(90.04–92.50) 
4.70
(3.86–5.72) 
4.70
(3.86–5.72) 

93.50
(92.33–94.50) 
60.40
(58.24–62.52) 
4.60
(3.77–5.61) 
0.25
(0.11–0.58) 

94.70
(93.63–95.60) 
26.70
(24.81–28.68) 
4.60
(3.77–5.61) 
0.00
(0.00–0.19) 

95.55
(94.56–96.37) 
7.25
(6.19–8.47) 
4.85
(3.99–5.88) 
0.00
(0.00–0.19) 

96.10
(95.16–96.86) 
1.45
(1.01–2.07) 
4.40
(3.59–5.39) 
0.00
(0.00–0.19) 

96.75
(95.88–97.44) 
0.25
(0.11–0.58) 
4.30
(3.50–5.28) 
0.00
(0.00–0.19) 

fwer  99.90
(99.64–99.97) 
91.35
(90.04–92.50) 
18.30
(16.67–20.05) 
4.70
(3.86–5.72) 
(b) Corrected, cumulative maximum, .  
91.35
(90.04–92.50) 
91.35
(90.04–92.50) 
4.70
(3.86–5.72) 
4.70
(3.86–5.72) 

90.35
(88.98–91.57) 
60.40
(58.24–62.52) 
3.40
(2.69–4.29) 
0.25
(0.11–0.58) 

89.80
(88.40–91.05) 
26.70
(24.81–28.68) 
2.75
(2.12–3.56) 
0.00
(0.00–0.19) 

89.55
(88.13–90.82) 
7.25
(6.19–8.47) 
2.40
(1.81–3.17) 
0.00
(0.00–0.19) 

89.30
(87.87–90.58) 
1.45
(1.01–2.07) 
1.75
(1.26–2.42) 
0.00
(0.00–0.19) 

88.85
(87.40–90.16) 
0.25
(0.11–0.58) 
1.45
(1.01–2.07) 
0.00
(0.00–0.19) 

fwer  91.35
(90.04–92.50) 
91.35
(90.04–92.50) 
4.70
(3.86–5.72) 
4.70
(3.86–5.72) 
(c) Corrected, distribution of the maximum, .  
91.35
(90.04–92.50) 
91.35
(90.04–92.50) 
4.70
(3.86–5.72) 
4.70
(3.86–5.72) 

7.95
(6.84–9.22) 
10.95
(9.66–12.39) 
0.00
(0.00–0.19) 
0.00
(0.00–0.19) 

0.00
(0.00–0.19) 
0.00
(0.00–0.19) 
0.00
(0.00–0.19) 
0.00
(0.00–0.19) 

0.00
(0.00–0.19) 
0.00
(0.00–0.19) 
0.00
(0.00–0.19) 
0.00
(0.00–0.19) 

0.00
(0.00–0.19) 
0.00
(0.00–0.19) 
0.00
(0.00–0.19) 
0.00
(0.00–0.19) 

0.00
(0.00–0.19) 
0.00
(0.00–0.19) 
0.00
(0.00–0.19) 
0.00
(0.00–0.19) 

fwer  91.35
(90.04–92.50) 
91.35
(90.04–92.50) 
4.70
(3.86–5.72) 
4.70
(3.86–5.72) 
Using the Roy’s statistic led to similar results as with Wilks (not shown). Dimensionality reduction with pca led to similar results for the case in which the null space is included (not shown). For the case in which the null space is not included, results are not comparable with the ones above because, after pca, in the simulations, such that there is no null space to be considered as the matrices with canonical coefficients in both sides are then square.
Even when the null space of the canonical coefficients is included, a single step procedure is never satisfactory. To understand this, consider the following consequence of the theory presented in Sections 2.2 and 2.5: for a valid, exact test in cca, the expected error rate for each , i.e., the per comparison error rate (pcer; Hochberg1987) is for , but for it is , since the null can only be rejected if the previous one has also been declared significant at . More generally, the pcer for a valid test is for the th test, i.e., for the th canonical correlation. If the test level is set at 5%, then the pcer is 5% for , 0.25% for , 0.0125% for , and so forth. Error rates above this expectation render the test invalid; below render it conservative. In the simulations, a single step procedure never led to an exact test, with or without consideration to multiple testing, as shown in the first two columns of Table 6.
4.2 Multiple testing
As with the pcer, it is worth mentioning what the expected fwer for a valid, exact test is. That expectation is the test level itself, i.e., . Any higher error rate renders a test invalid; lower error rate renders it conservative, even if still valid. Table 6 shows the fwer for the three different correction methods considered.
If the null space was not included, since the pcer was not controlled, the fwer could not be controlled either (first two columns). If the null space of the canonical coefficients was included (last two columns), even though the single step estimation controlled the pcer, the fwer was not controlled for the simple, uncorrected pvalues (third column, upper panel), which is not surprising. It should be emphasised, however, that these simple pvalues have another problem: they are not guaranteed to be monotonically related to the respective canonical correlations, such that it is possible that, using these pvalues, the null hypothesis could be rejected for some canonical correlation, but retained for another that happens to be larger than the former. The use of such uncorrected, simple pvalues, therefore, constitutes a test that is inadmissible. The problem with lack of monotonicity with uncorrected pvalues is less severe if estimation is done in a stepwise manner (fourth column, upper panel), but is nonetheless still present, as shown in Figure 2, and has potential to lead to an excess fwer, even though that did not occur in these simulations.
For the other two correction methods, when the null space of the canonical coefficients was included in the estimation process, fwer was controlled (third and fourth columns of Table 6, middle and lower panels), but there are particularities. Using the distribution of the maximum (lower panel) led to very conservative pcer, for both single step or stepwise estimation, whereas correction with closure led to invalid pcer for single step estimation (third column, middle panel).
The only configuration that led to exact (not conservative, nor invalid) control over pcer and fwer, and a monotonic relationship between canonical correlations and associated pvalues, is the one in which a stepwise estimation was performed, with the null space of the canonical coefficients included, and with correction using a closed testing procedure (fourth column, middle panel of Table 6). Moreover, the fwer, when controlled using the cumulative maximum or the distribution of the maximum statistic, is guaranteed to match the pcer for : in the former case, any further rejection of the null is conditional on the first one having been rejected; in the latter, the distribution of the maximum coincides with the distribution of the first as the canonical correlations are ranked from largest to smallest.
4.3 Nuisance variables
For partial cca, simple residualisation, even using the above procedure, resulted in the error rates being dramatically inflated, as shown in Table 7. The Huh–Jhun and the Theil residualisation methods, in contrast, resulted in the error rates being controlled at the nominal test level, with no excess false positives. For bipartial cca, the problem did not happen in the simulation settings: simple residualisation of both sides by entirely different sets of variables did not cause the error rates to be inflated; yet, using HuhJhun or Theil also produced nominal error rates, suggesting that these could be used in any configuration of nuisance variables, regardless of whether those in one side are not independent from those in the other.
Simple residuals  Huh–Jhun  Theil  
(a) Partial cca  
(fwer)  83.85
(82.17–85.40) 
5.15
(4.26–6.21) 
4.85
(3.99–5.88) 
44.15
(41.99–46.34) 
0.35
(0.17–0.72) 
0.35
(0.17–0.72) 

12.75
(11.36–14.28) 
0.00
(0.00–0.19) 
0.00
(0.00–0.19) 

1.75
(1.26–2.42) 
0.00
(0.00–0.19) 
0.00
(0.00–0.19) 

0.20
(0.08–0.51) 
0.00
(0.00–0.19) 
0.00
(0.00–0.19) 

0.00
(0.00–0.19) 
0.00
(0.00–0.19) 
0.00
(0.00–0.19) 

(b) Bipartial cca  
(fwer)  5.55
(4.63–6.64) 
5.20
(4.31–6.26) 
4.45
(3.63–5.44) 
0.10
(0.03–0.36) 
0.30
(0.14–0.65) 
0.20
(0.08–0.51) 

0.00
(0.00–0.19) 
0.00
(0.00–0.19) 
0.00
(0.00–0.19) 

0.00
(0.00–0.19) 
0.00
(0.00–0.19) 
0.00
(0.00–0.19) 

0.00
(0.00–0.19) 
0.00
(0.00–0.19) 
0.00
(0.00–0.19) 

0.00
(0.00–0.19) 
0.00
(0.00–0.19) 
0.00
(0.00–0.19) 
Estimation included the null space of the canonical coefficients and a stepwise procedure, assessed using the Wilks’ statistic, and corrected using a closed testing procedure (ctp). The ctp guarantees that the familywise error rate (fwer) matches the pcer for the first canonical correlation (i.e., for ). Using the Roy’s statistic led to similar results as with Wilks’; likewise, dimensionality reduction with pca led to similar results (not shown).
4.4 Large samples
Increasing the sample size while keeping the number of variables fixed progressively reduced the amount of errors for the simple residualisation method to treat nuisance variables, as shown in Table 8; the reduction did not affect Huh–Jhun or Theil methods, for which error rates were already controled even with a relatively smaller sample compared to the number of variables.
Simple residuals  Huh–Jhun  Theil  
100  82.60
(79.03–85.67) 
4.80
(3.25–7.04) 
4.20
(2.76–6.34) 
200  28.00
(24.24–32.09) 
6.00
(4.23–8.44) 
4.80
(3.25–7.04) 
300  20.00
(16.73–23.73) 
5.60
(3.90–7.97) 
5.40
(3.74–7.74) 
400  12.80
(10.15–16.01) 
4.40
(2.92–6.57) 
4.20
(2.76–6.34) 
500  9.40
(7.14–12.28) 
5.20
(3.57–7.51) 
6.20
(4.40–8.67) 
600  9.20
(6.97–12.05) 
4.20
(2.76–6.34) 
3.20
(1.98–5.13) 
700  10.40
(8.02–13.38) 
4.20
(2.76–6.34) 
3.80
(2.45–5.86) 
800  8.60
(6.45–11.38) 
6.00
(4.23–8.44) 
5.20
(3.57–7.51) 
900  7.40
(5.42–10.03) 
5.00
(3.41–7.28) 
5.00
(3.41–7.28) 
1000  8.20
(6.10–10.94) 
4.20
(2.76–6.34) 
4.40
(2.92–6.57) 
Estimation included the null space of the canonical coefficients and a stepwise procedure, assessed using the Wilks’ statistic, and corrected using a closed testing procedure (ctp). The ctp guarantees that the familywise error rate (fwer) matches the pcer for the first canonical correlation (i.e., for ). Using the Roy’s statistic led to similar results as with Wilks’; likewise, dimensionality reduction with pca led to similar results (not shown). The confidence intervals are wider than for other tables because the number of realisations (and also of permutations) was smaller (Table 5)
4.5 Nonnormality
Without nuisance variables and with kurtotic data simulated using a Student’s distribution with a small number of degrees of freedom, , as well as with binary data simulated using a Bernoulli distribution with parameter , error rates were controlled nominally, as shown in Table 9. In partial cca, however, even using the Huh–Jhun method, highly kurtotic data led to excess error rates. In particular, for the simulated data using a Student’s distribution with degrees of freedom of only , the observed error rate was 14.7%, for a test level of 5%; using the Theil method led to also inflated error rate in this case, with 10.7% (95% confidence interval: 8.28–13.72, not shown in the table). For , error rates were controlled at the nominal level, for both Huh–Jhun (Table 9) and Theil (not shown).
Distribution  Without nuisance  Partial cca  
Normal  4.70
(3.86–5.72) 
5.15
(4.26–6.21) 

Student  3.95
(3.18–4.90) 
14.70
(13.22–16.32) 

5.45
(4.54–6.53) 
5.40
(4.49–6.48) 

4.15
(3.36–5.12) 
5.40
(4.49–6.48) 

4.70
(3.86–5.72) 
5.00
(4.13–6.04) 

3.85
(3.09–4.79) 
5.10
(4.22–6.15) 

Bernoulli  5.30
(4.40–6.37) 
5.30
(4.40–6.37) 
: Degrees of freedom of the Student’s distribution used to simulate data; : Parameter of the Bernoulli distribution used to simulate data. Estimation used the null space of the canonical coefficients and a stepwise procedure, assessed using the Wilks’ statistic, and corrected using a closed testing procedure (ctp). The ctp guarantees that the familywise error rate (fwer) matches the pcer for the first canonical correlation (i.e., for ). Using the Roy’s statistic led to similar results as with Wilks’; using Theil led to similar results as Huh–Jhun; likewise, dimensionality reduction with pca led to similar results (not shown).
4.6 Choice of the statistic
The results above, that consider solely the error rates, and are shown based on results found with the Wilks’ statistic (), are essentially the same for Roy’s largest root (). That is, results regarding the estimation strategies, multiple testing, nuisance variables, nonnormality, behaviour with large samples, and dimensionality reduction with pca, are virtually the same for Wilks’ and Roy’s statistics. In the presence of synthetic signal, however, the two test statistics diverged. Table 10 shows that, with signal spread ac
Comments
There are no comments yet.