# Permutation inference for Canonical Correlation Analysis

Canonical correlation analysis (CCA) has become a key tool for population neuroimaging for allowing investigation of association between many imaging and non-imaging variables. As age, sex and other variables are often a source of variability not of direct interest, previous work has used CCA on residuals from a model that removes these effects, then proceeded directly to permutation inference. We show that a simple permutation test, as typically used to identify significant modes of shared variation on such data adjusted for nuisance variables, produces inflated error rates. The reason is that residualisation introduces dependencies among the observations that violate the exchangeability assumption. Even in the absence of nuisance variables, however, a simple permutation test for CCA also leads to excess error rates for all canonical correlations other than the first. The reason is that a simple permutation scheme does not ignore the variability already explained by canonical variables of lower rank. Here we propose solutions for both problems: in the case of nuisance variables, we show that projecting the residuals to a lower dimensional space where exchangeability holds results in a valid permutation test; for more general cases, with or without nuisance variables, we propose estimating the canonical correlations in a stepwise manner, removing at each iteration the variance already explained. We also discuss how to address the multiplicity of tests via closure, which leads to an admissible test that is not conservative. We also provide a complete algorithm for permutation inference for CCA.

## Authors

• 1 publication
• 3 publications
• 2 publications
• 6 publications
09/17/2012

### Generalized Canonical Correlation Analysis for Disparate Data Fusion

Manifold matching works to identify embeddings of multiple disparate dat...
12/05/2019

### Another look at the Lady Tasting Tea and permutation-based randomization tests

Fisher's famous Lady Tasting Tea experiment is often referred to as the ...
04/28/2022

### Generalized permutation tests

Permutation tests are an immensely popular statistical tool, used for te...
11/23/2020

### Conditional canonical correlation estimation based on covariates with random forests

Investigating the relationships between two sets of variables helps to u...
05/12/2016

### Subspace Perspective on Canonical Correlation Analysis: Dimension Reduction and Minimax Rates

Canonical correlation analysis (CCA) is a fundamental statistical tool f...
06/10/2013

### Discriminative extended canonical correlation analysis for pattern set matching

In this paper we address the problem of matching sets of vectors embedde...
11/02/2021

### Regularization for Shuffled Data Problems via Exponential Family Priors on the Permutation Group

In the analysis of data sets consisting of (X, Y)-pairs, a tacit assumpt...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

1.20

Canonical correlation analysis (cca) (Jordan1875; Hotelling1936)

is a multivariate method that aims at reducing the correlation structure between two sets of variables to the simplest possible form (hence the name “canonical”) through linear transformations of the variables within each set. Put simply, given two sets of variables, the method seeks linear mixtures within each set, such that each resulting mixture from one set is maximally correlated with a corresponding mixture from the other set, but uncorrelated with all other mixtures in either set.

From a peak use through from the late 1970’s until mid-1980’s, the method has recently regained popularity, presumably thanks to its ability to uncover latent, common factors underlying association between multiple measurements obtained, something relevant in recent research that uses high dimensional phenotyping and investigates between-subject variability across multiple domains. This is in contrast to initial studies that introduced cca to the imaging field (Friston1995; Friston1996; Worsley1997; Friman2001; Friman2002; Friman2003) for investigation of signal variation in functional magnetic resonance imaging (fmri) time series. For example, Smith2015 used cca to identify underlying factors associating brain connectivity features to various demographic, psychometric, and lifestyle measures; Drysdale2017 used cca to investigate associations between brain connectivity and clinical assessments, and found two canonical variables that would allow classification of participants into distinct categories (but see Dinga2019); Bijsterbosch2018 and Xia2018 likewise used cca to identify associations between functional connectivity and various indices of behaviour and psychopathology, whereas Alnaes2019 used cca to investigate the association between imaging measurements and cognitive, behavioral, psychosocial and socioeconomic indices; Li2019 used cca to investigate, among subjects, the topography of the global fmri signal and its relationship with a number of cognitive and behavioral measurements; Clemens2020 used a combination of pattern classification algorithms and cca to study imaging and behavioral correlates of the subjective perception of oneself belonging to a particular gender. All these between-subject, group level studies used some form of permutation to assess the significance of the results, and most of them regressed out putative nuisance variables or confounds from the data before proceeding to inference.

Permutation tests are well known and widely used. Among their many benefits, these tests lead to valid inferences while requiring assumptions that are commonly satisfied in between-subject analyses, such that of exchangeability of observations under the null hypothesis. We could not find, however, studies that explicitly verified whether the performance of permutation tests for

cca would be valid. Investigating such validity, we found that trivial implementations of permutation inference for cca

are inadequate on four different grounds. First, simple, uncorrected permutation p-values are not guaranteed to be monotonically related to the canonical correlations, leading to inadmissible results; for the same reason, multiple testing correction using false discovery rate is also inadmissible. Second, except for the highest canonical correlation, a simple one-step estimation of all others without considering the variability already explained by previous canonical variables in relation to each of them also leads to inflated per comparison error rates and thus, invalid results. Third, regressing out nuisance variables without consideration to the introduction of dependencies among observations caused by residualisation lead to an invalid test, with excess false positives. Fourth, multiple testing correction using the distribution of the maximum test statistic leads to conservative results, except for the highest canonical correlation.

In this paper we explain and discuss in detail each of these problems, and offer solutions that address each of them. In particular, we propose a novel stepwise estimation method for the canonical correlations and canonical variables that remains valid even when the number of variables is not the same for both sides of cca. We propose a method that transforms residualised data to a lower dimensional basis where exchangeability — as required for the validity of permutation tests — holds. We also propose that inference that considers multiple canonical correlations should use a closed testing procedure that is more powerful than the usual correction method used in permutation tests that use the distribution of the maximum statistic; the procedure also ensures a monotonic relationship between p-values and canonical correlations. Finally, we provide a complete, general algorithm for valid inferences for cca.

## 2 Theory

### 2.1 Notation and general aspects

Thorough definition and derivation of cca

is available in many classical textbooks on multivariate analysis

(e.g., Kendall1975; Mardia1979; Brillinger1981; Muirhead1982; Seber1984; Krzanowski1988; Anderson2003); the reader is referred to these for a comprehensive overview. Here we present concisely and only allude to the distinction between population () and sample () canonical correlations where strictly needed. Let and be each one a collection of, respectively, and variables observed from subjects, . Without loss of generality, assume that , that the columns of and are mean-centered, that these matrices are of full rank, and define , , and . The goal of cca is to identify canonical coefficients or canonical weights and , , such that the pairs of canonical variables, defined as:

 [u1,…,uK]=UN×K=YA[v1,…,vK]=VN×K=YB (1)

have correlations that are maximal, under the constraint that . Estimation of and amounts to finding the solutions to:

 [−rkΣYYΣYXΣXY−rkΣXX]⋅[akbk]=0 (2)

where the unknowns are , and ; are the sample canonical correlations, i.e., the correlations between the estimated canonical variables and . The coefficients

are eigenvectors of

, whereas are eigenvectors of

; the respective eigenvalues (for either

or , as these eigenvalues are the same) are . For convenience, we call canonical component the ensemble formed by the -th canonical correlation, its corresponding pair of canonical variables, and associated pair of canonical coefficients; canonical variables may also be termed modes of variation.

The typical method for estimation involves an iterative procedure that finds one and at a time, with computed as a function of these. However, the method proposed by Bjorck1973 is more concise and numerically more stable; it is described in the Appendix (Algorithm 3); the canonical correlations are then produced in descending order, . This positiveness of all canonical correlations is a consequence of these values being explicitly maximised during estimation. Reversal of the sign of the coefficients can always be accompanied by the reversal of the sign of the corresponding coefficients in the other side, to no net effect on . That is, the signs of the canonical variables and coefficients are indeterminate, and any solution is arbitrary; nothing can be concluded about the specific direction of effects with cca.

### 2.2 Parametric inference

The distribution of the sample canonical correlations is intractable, even under assumptions of normality and independence among subjects, and is a function of the population correlations (Constantine1963; James1964). This difficulty led to the development of a rich asymptotic theory (Fisher1939; Hsu1941; Lawley1959; Fujikosh1977; Glynn1978). However, these approximations have been shown to be extremely sensitive to departures from normality, or require additional terms that further complicate their use (Muirhead1980); Brillinger1981 recommended resampling methods to estimate parameters used by normal approximations, which otherwise can be biased (Anderson2003). These difficulties pose challenges for inference. Even though some computationally efficient algorithms have been proposed (Koev2006), these tests continue to be rarely used.

Instead, a test based on whether a certain number of correlations are equal to zero has been proposed. The null hypothesis is , i.e., , for , that is, the null is that population canonical correlations (the smaller ones) are zero (Bartlett1938; Bartlett1947; Marriott1952; Lawley1959; Fujikoshi1974), versus the alternative that at least one is not, i.e., . The test is based on the statistic proposed by Wilks1935, as:

 λk=−(N−C−P+Q+32)ln(K∏i=k(1−r2i)) (3)

where the constant if there are no nuisance variables (Section 2.6). Under the null hypothesis, follows an approximate

distribution with degrees of freedom

if each of the columns of and

have values that are independent and identically distributed following a normal distribution

(but see Glynn1978, for a different expression)

. Unfortunately, this test is sensitive to departures from normality, particularly in the presence of outliers, and its use has been questioned

(Seber1984; Harris1976).

Another test statistic is based on Roy1953111Roy1953 proposed two distinct but related test statistics; these are both known as “Roy’s largest root”. Here we use the one that is interpreted as a coefficient of determination, and not the other that is interpreted as an -statistic. See Kuhfeld1986 for a complete discussion., and is simply:

 θk=r2k (4)

The critical values for the corresponding parametric distribution at a given test level can be found in the charts provided by Heck1960, using as parameters , , and (Lee1978), where the constant if there are no nuisance variables (Section 2.6), or in tables provided by Kres1975; more recent approximations for normally distributed data can be found in Chiani2016 and Johnstone2017. Some approximations, however, are considered conservative (Harris1976; Harris2013), even when assumptions are met. Note that, while Roy1953 proposed the use of the largest value as test statistic, which would then be , any given null at position must have already considered the previous canonical components, from until , such that the maximum statistic, after the previous canonical correlations have been removed from the model, is always the current one. A similar reasoning holds for the smallest canonical correlations in the test proposed by Wilks1935. This feature is exploited in the stepwise rejective procedure proposed in Section 2.5.

### 2.3 Permutation inference

The above problems can be eschewed with the use of resampling methods, such as permutation. An intuitive (but inadequate) permutation test for cca could be constructed by randomly permuting the rows of or . For each shuffling of the data, indicated by , a new set of canonical correlations and associated statistics would be computed. A p-value would be obtained as , where is the indicator (Kronecker) function, which evaluates as 1 if the condition inside the brackets is true, or 0 otherwise, and the index corresponds to the unpermuted data (i.e., no permutation, with the data in their original ordering).

Such a simple procedure, however, would ignore the fact that, this resampling scheme matches the first null hypothesis , but not the subsequent ones. For a given canonical correlation at position being tested, , one must generate a permutation that satisfy the corresponding null , but not necessarily . Otherwise, latent effects represented by the corresponding earlier canonical variables and would, in the procedure above, remain in the and at the time these are permuted. However, the variance associated with these earlier canonical variables would have already been explained through the rejection of their respective null hypothesis up to . This variance is now a nuisance for the positions from (inclusive) onward. It contains information that are not pertinent to position and subsequent ones, which therefore should not be used to build the null distribution, i.e., variance should not be re-used in the shufflings that lead to a given or subsequent correlations.

Fortunately, cca is invariant to linear transformations of that mix the variables in or in . Since the canonical variables are themselves linear transformations of these input variables (Equation 1), a second cca using and in place of the initial and lead to the same solutions. Yet, unless , will not span the same space as . In principle, this would be inconsequential as far as the canonical variables are concerned. However, ignoring the variability in not contained in would again affect the p-values should and be used in a permutation test, as the permuted data would not be representative of the original (unpermuted) that led to these initial canonical variables. To mitigate the problem, include into the matrix of canonical coefficients their orthogonal complement, i.e., compute , then use instead of as a replacement for . In this paper we adopted the convention that , but the same procedure works in reverse and, algorithmically, it might as well be simpler to compute also a and use it instead of as a replacement for . If , then .

While these transformations do not change in any way the canonical components, they allow the construction of an improved algorithm that addresses the issue of variability already explained by canonical variables of lower rank (i.e., the ones with order indices smaller than that of a given one). It consists of running an initial cca using and to obtain and , then subject these to a second cca and permutation testing while, crucially, at each permutation, iteratively repeating cca times, each using not the whole and , but only from the -th component onwards, i.e., and for the test about the -th canonical correlation. Of note, a specific test level need not be specified at the time in which the above iterative (stepwise) procedure is performed; instead, and in combination with the multiple testing procedure described below, adjusted p-values are are computed, which then are used to accept or reject the null for the -th correlation. Algorithm 1 (Section 2.8) shows the procedure in detail (the algorithm contains other details that are discussed in the next sections).

A number of further aspects need be considered in permutation tests: the number of possible reorderings of the data, the need for permutations that break the association between the variables being tested, the random selection of permutations from the permutation set when not all possible permutations can be used, the choice of the test statistic, how to correct for the multiplicity of tests, the number of permutations to allow narrow confidence intervals around p-values, among others. Most of these topics have been discussed in

Winkler2014 and references therein and will not be repeated here. However, for cca, some aspects deserve special treatment and are considered below.

### 2.4 Choice of the statistic

Asymptotically, using Wilks’ statistic or Roy’s are expected to lead to the same conclusion since all correlations are sorted in descending order: if , then all subsequent ones must be zero; likewise, if , then clearly at least one correlation between and is larger than zero, which has to include itself. Moreover, permutation under the null is justifiable in the complete absence of association between the two sets, which implies that, under the null , all correlations are equal to zero. With finite data, however, one statistic can be more powerful than the other in different settings. Their relative performance is studied in Sections 3 and 4.

Computationally, Wilks’ requires more operations to be performed compared to Roy’s statistic. Since the relationship between is is monotonic, the two are permutationally equivalent, using alone is sufficient, which makes Roy’s the absolute fastest. However, even for Wilks’, the amount of computation required is negligible compared to the overall number of operations needed for estimation of the canonical coefficients, such that in practice, the choice between the two should be in terms of power.

In either case, while inference refers to the respective null hypothesis at position , it is not to be understood as inference on the index . Rather, the null is merely conditional on the nulls for all previous correlations from to having been rejected. Rejecting the null implies that the correlation observed at position is unlikely high if there is no association between the two variable sets after all previous (from to ) canonical variables have been sequentially removed, as described in Section 2.3. In Algorithm 1 (Section 2.8) this is done in line 29, that uses as inputs to cca the precomputed canonical variables only from position onwards, as opposed to all of them.

### 2.5 Multiplicity of tests

For either of these two test statistics, the ordering of the canonical correlations from larger (farther from zero) to smaller (closer to zero) imply that rejection of the null hypothesis at each must happen sequentially, starting from , using the respective test statistic and associated p-value until the the null , for some , is not rejected at a predefined test level . Then, at that position , the procedure stops, and the null is retained from that position (inclusive) onward until the final index .

The ordering of the canonical correlations brings additional consequences. First, because rejection of implies rejection of all joint (intersection) hypotheses that include , that is , such sequentially rejective procedure is also a closed testing procedure (ctp), which controls the amount of any type i error across all tests, i.e., the familywise error rate (fwer) in the strong sense (Marcus1976; Hochberg1987). Thus, there is no need for further correction for multiple testing. Another way of stating the same is that the test for a given , , has been “protected” by the test at the position at the level . Adjusted p-values (in the fwer sense) can be computed as , that is, the fwer-adjusted p-value for the canonical component is the cumulative maximum p-value up to position . Such adjusted p-values can be considered significant if their value is below the desired test level .

The second consequence is that fwer adjustment of p-values using the distribution of the maximum statistic (not to be confused with the cumulative maximum described in the above paragraph) will be conservative for all canonical components except the first. The reason is that the maximum statistic is always the distribution of the first, which is stochastically dominant over all others. The distribution of the maximum is usually sought as a shortcut to a ctp when the condition of subset pivotality holds (Westfall1993), as that reduces the computational burden from tests to only tests. Interestingly, the ordering of the canonical correlations from largest to smallest leads to a ctp that does not use the distribution of the maximum, and yet requires only tests, regardless of whether subset pivotality holds.

A third consequence is that using permutation p-values outside the above sequentially rejective procedure that controls fwer is not appropriate since these simple, uncorrected p-values are not guaranteed to be monotonically related to the canonical correlations . Attempts to use these uncorrected p-values outside a ctp would lead to paradoxical results whereby higher, stronger canonical correlations could not be considered significant, yet later ones, smaller, weaker, could be so; that is, it would make the test inadmissible (Lehmann2005, p. 232). For the same reason, such simpl p-values should not be subjected to correction using false discovery rate (fdr; Benjamini1995), because the ordering of p-values for fdr, from smallest to largest, is not guaranteed to match the ordering of the canonical correlations, leading similarly to an inadmissible test and nonsensical results.

### 2.6 Nuisance variables

Few authors discussed nuisance variables or confounds in canonical correlation analysis, e.g., Roy1957; Rao1969; Timm1976; Lee1978; Sato2010. Let the be a matrix of nuisance variables, including an intercept. Partial cca consists of regressing out from both and , then subjecting the residuals to the usual cca, under the assumption (or knowledge) that is a confound to both and . This is distinct from part cca, which consists of regressing out from either or , under the assumption that affects only one of the two sets, then proceeding with the usual cca using the respective residuals in the place of the original set. Finally, one can define bipartial cca, where is regressed out from , while another set of variables , of size , is regressed out from ; as in the previous cases, the respective residuals are used in an otherwise usual cca (Table 1). In the parametric case, inference can proceed using the distribution of or (Equations 3 and 4) using for part, and for bipartial cca (Timm1976; Lee1978).

Permutation inference, however, requires further considerations, otherwise, as shown in Section 4, results will be invalid. Consider first the case without nuisance variables. Let be the horizontal concatenation of the two sets of variables whose association is being investigated. Both and occupy an -dimensional space and, so does, therefore, . A random permutation of the rows of either of the two sets of variables will not affect their dimensionalities. For example continues to occupy the same -dimensional space as .

However, residualisation changes this scenario. Let be the residual forming matrix associated with the nuisance variables , with the symbol representing the Moore–Penrose pseudo-inverse. has the following interesting properties: (symmetry) and (idempotency), which will be exploited later. In partial cca, can be regressed out from and by computing and . Let be the concatenation of the residualised sets and with respect to . While occupies an -dimensional space, occupies a smaller one; its dimensions are, at most, of a size given by the rank of , which is assuming and are of full rank. The same holds for and and, therefore, for and .

Permutation affects these relations: while still occupies a space of dimensions as the unpermuted , , differently than , may now occupy a space with dimensions anywhere between and , depending on a given random permutation. With fewer effective observations determined by this lower space after residualisation, and the same number of variables, the canonical correlations in the unpermuted case are stochastically larger than in the permuted, which in turn leads to an excess of spuriously small p-values. For not occupying the same space as the original, the permuted data are no longer a similar realisation of the unpermuted, thus violating exchangeability, and specifically causing the distribution of the test statistics to be unduly shifted to the left.

Here the following solution is proposed: using the results from Huh2001, let be a semi-orthogonal basis (Abadir2005, p. 84) for the column space of constructed via, e.g., spectral or Schur decomposition, such that , , where , and . Then cca on and lead to the same solutions as on and . The reason is that, from Section 2.1, , which is the same as , since, as discussed earlier, is symmetric and idempotent. In a similar manner, , and likewise, . While pre-multiplication by does not affect the cca results, it changes the dependence structure among the rows of the data: occupies a -dimensional space, and so does , for a permutation matrix of size , such that exchangeability holds, thus allowing a valid permutation test to be conducted successfully.

The treatment of partial cca, as described above, can be seen as a particular case of bipartial cca in which , that is, the set of nuisance variables in both sides is the same. Of course, for bipartial cca proper, this equality not necessarily holds, and the two sets may be different in different ways: may be entirely orthogonal to , or some or all variables from one set may be fully represented in the other, either directly (e.g., some of the variables present in both sets), or as linear combinations of one set in the other, or it may be that these two sets are simply not orthogonal. The direct strategy of computing

and its respective semi-orthogonal matrix

lead to difficulties because, unless , the products and will not have the same number of rows: has , whereas has rows, thus preventing the computation of cca.

A more general solution, that accommodates bipartial and, therefore, is a generalisation for all cases of nuisance variables in cca, consists of randomly permuting rows of and/or using, respectively, permutation matrices and of respective sizes and , therefore permuting in the lower dimensional space where and are exchangeable, then, crucially, reestablishing the original number of rows using the property that the transpose of a semi-orthogonal matrix is the same as its inverse (), to only then perform cca. Therefore cca is computed using and . Left and right sides will continue to have rank and respectively, will have already been permuted, and will both have rows. The procedure is fully symmetric in that, when the permutation matrices and are both identity matrices (of sizes and , respectively), which is equivalent to no permutation, the expression for both sides reduce to the residualised data and , as desired. The concatenation has the same rank as that of , thus addressing the above problem of the unpermuted test statistic having a different and stochastically dominant distribution over that of the permuted data. Table 2 summarises the proposed solution for all cases, including part cca.

### 2.7 Restricted exchangeability

The above method uses the Huh–Jhun semi-orthogonal matrix applied to cca and leads to a valid permutation test provided that there are no dependencies among the rows of . That is, the method takes into account dependencies introduced by the regression of and/or out from and/or , but not dependencies that might already exist in the data, and which generally preclude direct use of permutation tests. However, structured dependencies, such as those that may exist, for instance, in studies that involve repeated measurements, or for those in which participants do not constitute independent observations, e.g., sib-pairs, as in the Human Connectome Project (hcp; VanEssen2012), can be treated by allowing only those permutations that respect such dependency structure (Winkler2015). Unfortunately, the Huh–Jhun semi-orthogonal matrix does not respect such structure, blurring information from observations across blocks, and preventing the definition of a meaningful mapping from the original observations that define the block structure to the or observations that are ultimately permuted.222There is an exception: if has a block diagonal structure and the observations encompassed by such blocks coincide with the exchangeability blocks, then an algorithm that uses Huh–Jhun and block permutation can be constructed.

Such mapping, whereby each one of the and rows of, respectively, and corresponds uniquely to one of the rows of the original data and , can be obtained using a different method, due to Theil1965; Theil1968, and reviewed in detail by Magnus2005. Consider first the case of . In the Theil method, that here is adapted for cca, , where the exponent represents the positive definite matrix square root, and is a selection matrix,

, that is, an identity matrix from which some

rows have been removed. Pre-multiplication of any matrix by a selection matrix deletes specific rows, i.e., the ones that correspond to columns that are all zero in the selection matrix (Figure 1). The thus computed are the best linear unbiased residuals with scalar covariance (blus

), in that they are unbiased estimates of

, where are the (unknown) true errors after the nuisance effects of have been removed from ; for cca using observed data, contains the variance of interest, which may be shared among linear combinations of variables in both sides, and is therefore what is subjected to cca and statistical testing. For partial cca, is the same for both sides; for bipartial cca, similar computations hold for the other side, i.e., . Table 3 summarises the two methods.

To construct a permutation procedure for cca that respects the block structure, the Theil method can be used to compute instead of using the Huh–Jhun approach. Choose observations to be removed from both sides (for partial cca, since ). Construct the selection matrix of size , define the exchangeability blocks based on observations, compute and using the same for both (for part cca, use the same strategy as for bipartial, replacing for ), residualise (in the blus sense) the input variables by computing and . These have the same number of rows, and the dependencies among these rows is the same for both sides; thus, only one side needs be subjected to random permutations that respect such existing dependencies. Optionally, after permutation, the number of observations may be reestablished by pre-multiplication by and . Finally, cca is performed, with observation to the aspects discussed in Sections 2.3 and 2.5. A detailed algorithm is presented in Section 2.8.

It remains to be decided how to select the observations to be dropped. In principle, any set could be considered for removal, provided that and continue to be of full rank. Some informed choices, however, could be more powerful than others. One of the main conclusions from Winkler2015 is that the complexity of the dependence structure and the ensuing restrictions on exchangeability lead to reductions in power. Thus, natural candidates for removal are observations that, once removed, cause the overall dependence structure to be simpler. For example, it is sometimes the case that some observations are so uniquely related to all others that there are no other observations like them in the sample. These observations, therefore, cannot be permuted with any other, or perhaps with only a few. Their contribution to hypothesis testing in the permutation case is minimal, and their removal are less likely to affect a decision on rejection of the null hypothesis. Consider for example a design that has many monozygotic, dizygotic, and non-twin pairs of subjects, and that in the sample, there happens to be a single pair of half-siblings. It is well known that, for heritable traits, genetic resemblance depends on the kinship among individuals; half-siblings are expected to have a different degree of statistical dependency among each other compared to each one of the other types of sibships in this sample. Thus, in there being just one such pair, it would be reasonable to prioritise it for exclusion, while keeping others.

### 2.8 General algorithm

A set of steps for permutation inference for cca is described in Algorithm 1. In it, input variables and will have been mean-centered before the algorithm begins, or an intercept will have been included as nuisance variable in both and . is a set containing pairs of permutation matrices indexed by . In this set, the first permutation is always “no permutation”, i.e., , such that , for all . For the cases in which only one side of cca needs be permuted (Table 2), or for the cases in which , or when there are dependencies among the data such that the Theil method is used to construct (Table 3), then can be set as for all . Details on how is defined in observance to the null hypothesis and respecting structured dependencies among the data have been discussed in Winkler2014; Winkler2015. In the algorithm, can be larger, equal, or smaller than . Optional input arguments are the matrices with nuisance variables and , and the selection matrix . If is supplied but not , then the algorithm performs part or partial cca, depending on the Boolean argument partial; if both and are supplied, the algorithm performs bipartial cca; if neither is supplied, then “full” cca is performed. If is supplied, then the blus residuals based on Theil are used; otherwise, Huh–Jhun residuals are used. For either of these two cases, the semi-orthogonal matrix is computed using a separate, ancillary function named “semiortho”, described in the Appendix.

An initial cca using residualised data is done in line 19; this uses another ancillary function, named “cca”, and also described in the Appendix; this function returns three results: the canonical coefficients and , and the canonical correlations . The canonical coefficients are used to compute the canonical variables and , augmented by their orthogonal complement needed to ensure that they span the same space as the variables subjected to this initial cca; the canonical correlations are ignored at this point and not stored (hence the placeholder “_”). A counter for each canonical component is initialised as 0.

The core part of the algorithm are the two loops that run over the permutations in and the canonical components (between lines 25 and 35). At each permutation , cca is executed times. In each, the columns of and that precede preceding the current are removed, such that their respective variances are not allowed to influence the canonical correlations at position . At each permutation, the canonical correlations are obtained (the third output from the function “cca”) and used to compute the associated test statistic. As shown, Wilks’ statistic, , is used, simplified by the removal of the constant term, which does not affect permutation p-values. For numerical stability, sum of logs is favoured over the log of a product (compare line 30 with Equation 3). For inference using Roy’s statistic, replace the condition for in line 31; this modification alone is sufficient as is permutationally equivalent to . In that case, computations indicated in line 30 are no longer needed and can be removed to save computational time.

Whenever the statistic for the correlation at position in a given permutation is higher or equal than that for the unpermuted data, the counter is incremented (line 32). After the loop, the counter is converted into a p-value for each . These simple, uncorrected p-values, however, are not useful. Instead, fwer-adjusted p-values are computed under closure using the cumulative maximum, i.e., the p-value for is the largest (least significant) uncorrected p-value up to position . The algorithm returns then these adjusted p-values, which can be compared to a predefined test level to establish significance. Note that itself is never used in the algorithm.

As presented, the algorithm does not cover dimensionality reduction or any penalty to enforce sparse solutions for cca

. Dimensionality reduction using methods such as principal component analysis (

pcaica) with selection of components, if included, would be performed after residualisation, but before cca. Thus, in the algorithm, pca or ica, if executed, would be done between lines 18 and 19. As for the many forms of sparse or penalised cca (Nielsen2002; Waaijenborg2007; Parkhomenko2009; Witten2009; Hardoon2011; Gao2017; Ma2018; Tan2018), these can be incorporated into the algorithm through the replacement of the classical cca in lines 19 and 29 for one of these methods.

## 3 Evaluation Methods

In this section we describe the synthetic data and methods used to investigate error rates and power under the different choices for the various aspects presented in Section 2 at each stage of a permutation test for cca, providing empirical evidence for the approach proposed. An overview of these aspects and choices at each stage is shown in Table 4

. For each case, we use a series of simulation scenarios: each consists of a set of synthetic variables constructed using random values drawn from a normal or a non-normal (kurtotic or binary) probability distribution, sometimes with or without dimensionality reduction using principal components analysis

(pca; Hotelling1933; Jolliffe2002), sometimes with or without signal, and sometimes with or without nuisance variables. We also consider cases with large sample sizes and large number of variables. An overview of these scenarios (there are twenty of them) is in Table 5.

We start by investigating aspects related to the estimation of the canonical components at each permutation. Specifically, we consider (a) a one-step estimation of all canonical components, from to , versus (b) sequential estimation that removes, for the -th canonical component in a given permutation, the variance already explained by the previous ones, as described in Section 2.3. With respect to the inclusion of the complement of the canonical coefficients, we consider (a) without the inclusion of the null space of the canonical coefficients, versus (b) with its inclusion so as to ensure that all variance from the original data not explained in the initial cca is considered in the estimation at every permutation, as described in Section 2.3. With respect to multiple testing, and consider the following strategies: (a) simple, uncorrected p-values, , (b) corrected under closure, , and (c) corrected using the distribution of the maximum statistic ; both and offer fwer control, as discussed in Section 2.5. Keeping the same notation, we define scenarios ivi consisting of observations, with variables on the left side () of cca and variables on the right side () (the procedure is symmetric; the choice of sides is arbitrary and do not affect results); for these six scenarios, data are drawn from one of three possible distributions: a normal distribution with zero mean and unit variance, a Student’s distribution with variable degrees of freedom

(kurtotic), or a Bernoulli distribution with parameter

(binary). Analyses with and without dimensionality reduction to 10 variables using pca are considered. The number of permutations used to compute p-values was set as , with realisations (repetitions), thus allowing the computation of error rates.

We then turn our attention to aspects related to nuisance and residualisation discussed in Sections 2.6 and 2.7. We consider (a) simple residualisation, (b) residualisation using the Huh–Jhun method, and (c) residualisation using the Theil method. For this purpose, scenarios viixvii are constructed similarly as ivi, except that a third set of variables is used as nuisance for partial cca, whereas two other scenarios, xiii and xiv, a fourth set of variables of variables is used as nuisance for bipartial cca.

The impact of ignoring, in samples substantially larger than the number of variables, the dependencies introduced by the residualisation of both sides of cca is studied with scenarios xv and xvi, that consider samples progressively larger, , while keeping the other parameters similar as in scenarios vii and viii. Finally, we briefly investigate power and the choice of the test statistic: we consider (a) Wilks’ statistic (), as well as (b) Roy’s largest root (), as discussed in Sections 2.2 and 2.4. We define scenarios xvii and xviii similarly as i, this time including a strong, true signal in one canonical component, thus named “sparse”, or multiple, weaker signals shared across multiple (half of the smaller set, thus, “dense”). For all scenarios, an intercept is always included as nuisance variable in both sides such that the actual number of nuisance variables is and for each side, respectively. To report confidence intervals (95%), the Wilson1927 method is used.

## 4 Results

In the results below, the Sections 4.1 and 4.2 establish empirically that with an estimation method (i) that includes the null space of the canonical coefficients, (ii) that finds the canonical correlations in an iterative manner, and (iii) that after computing p-values through a closed testing procedure, the error rates are controlled. The subsequent results, from Section 4.3 onwards, consider then only this valid approach.

### 4.1 Estimation strategies

Not including the complement of the canonical coefficients (null space) caused error rates to be dramatically inflated, well above the expected test level (5%), regardless of whether the estimation used the single step or the stepwise procedure, and regardless of any of the multiple testing correction strategies discussed; these results are shown in Table 6.

Even when the null space of the canonical coefficients is included, a single step procedure is never satisfactory. To understand this, consider the following consequence of the theory presented in Sections 2.2 and 2.5: for a valid, exact test in cca, the expected error rate for each , i.e., the per comparison error rate (pcer; Hochberg1987) is for , but for it is , since the null can only be rejected if the previous one has also been declared significant at . More generally, the pcer for a valid test is for the -th test, i.e., for the -th canonical correlation. If the test level is set at 5%, then the pcer is 5% for , 0.25% for , 0.0125% for , and so forth. Error rates above this expectation render the test invalid; below render it conservative. In the simulations, a single step procedure never led to an exact test, with or without consideration to multiple testing, as shown in the first two columns of Table 6.

### 4.2 Multiple testing

As with the pcer, it is worth mentioning what the expected fwer for a valid, exact test is. That expectation is the test level itself, i.e., . Any higher error rate renders a test invalid; lower error rate renders it conservative, even if still valid. Table 6 shows the fwer for the three different correction methods considered.

If the null space was not included, since the pcer was not controlled, the fwer could not be controlled either (first two columns). If the null space of the canonical coefficients was included (last two columns), even though the single step estimation controlled the pcer, the fwer was not controlled for the simple, uncorrected p-values (third column, upper panel), which is not surprising. It should be emphasised, however, that these simple p-values have another problem: they are not guaranteed to be monotonically related to the respective canonical correlations, such that it is possible that, using these p-values, the null hypothesis could be rejected for some canonical correlation, but retained for another that happens to be larger than the former. The use of such uncorrected, simple p-values, therefore, constitutes a test that is inadmissible. The problem with lack of monotonicity with uncorrected p-values is less severe if estimation is done in a stepwise manner (fourth column, upper panel), but is nonetheless still present, as shown in Figure 2, and has potential to lead to an excess fwer, even though that did not occur in these simulations.

For the other two correction methods, when the null space of the canonical coefficients was included in the estimation process, fwer was controlled (third and fourth columns of Table 6, middle and lower panels), but there are particularities. Using the distribution of the maximum (lower panel) led to very conservative pcer, for both single step or stepwise estimation, whereas correction with closure led to invalid pcer for single step estimation (third column, middle panel).

The only configuration that led to exact (not conservative, nor invalid) control over pcer and fwer, and a monotonic relationship between canonical correlations and associated p-values, is the one in which a stepwise estimation was performed, with the null space of the canonical coefficients included, and with correction using a closed testing procedure (fourth column, middle panel of Table 6). Moreover, the fwer, when controlled using the cumulative maximum or the distribution of the maximum statistic, is guaranteed to match the pcer for : in the former case, any further rejection of the null is conditional on the first one having been rejected; in the latter, the distribution of the maximum coincides with the distribution of the first as the canonical correlations are ranked from largest to smallest.

### 4.3 Nuisance variables

For partial cca, simple residualisation, even using the above procedure, resulted in the error rates being dramatically inflated, as shown in Table 7. The Huh–Jhun and the Theil residualisation methods, in contrast, resulted in the error rates being controlled at the nominal test level, with no excess false positives. For bipartial cca, the problem did not happen in the simulation settings: simple residualisation of both sides by entirely different sets of variables did not cause the error rates to be inflated; yet, using Huh-Jhun or Theil also produced nominal error rates, suggesting that these could be used in any configuration of nuisance variables, regardless of whether those in one side are not independent from those in the other.

### 4.4 Large samples

Increasing the sample size while keeping the number of variables fixed progressively reduced the amount of errors for the simple residualisation method to treat nuisance variables, as shown in Table 8; the reduction did not affect Huh–Jhun or Theil methods, for which error rates were already controled even with a relatively smaller sample compared to the number of variables.

### 4.5 Non-normality

Without nuisance variables and with kurtotic data simulated using a Student’s distribution with a small number of degrees of freedom, , as well as with binary data simulated using a Bernoulli distribution with parameter , error rates were controlled nominally, as shown in Table 9. In partial cca, however, even using the Huh–Jhun method, highly kurtotic data led to excess error rates. In particular, for the simulated data using a Student’s distribution with degrees of freedom of only , the observed error rate was 14.7%, for a test level of 5%; using the Theil method led to also inflated error rate in this case, with 10.7% (95% confidence interval: 8.28–13.72, not shown in the table). For , error rates were controlled at the nominal level, for both Huh–Jhun (Table 9) and Theil (not shown).

### 4.6 Choice of the statistic

The results above, that consider solely the error rates, and are shown based on results found with the Wilks’ statistic (), are essentially the same for Roy’s largest root (). That is, results regarding the estimation strategies, multiple testing, nuisance variables, non-normality, behaviour with large samples, and dimensionality reduction with pca, are virtually the same for Wilks’ and Roy’s statistics. In the presence of synthetic signal, however, the two test statistics diverged. Table 10 shows that, with signal spread ac