## 1 Introduction

When a process can be described by two sets of variables corresponding to two different aspects, or views, analysing the relations between these two views may improve the understanding of the underlying system. In this context, a relation is a mapping of the observations corresponding to a variable of one view to the observations corresponding to a variable of the other view. For example in the field of medicine, one view could comprise variables corresponding to the symptoms of the disease and the other to the risk factors that can have an effect on the disease incidence. Identifying the relations between the symptoms and the risk factors can improve the understanding of the disease exposure and give indications for prevention and treatment. Examples of these kind of two-view settings, where the analysis of the relations could provide new information about the functioning of the system, occur in several other fields of science. These relations can be determined by means of canonical correlation methods that have been developed specifically for this purpose.

Since the proposition of canonical correlation analysis (CCA) by H. Hotelling [Hotelling (1935), Hotelling (1936)], relations between variables have been explored in various fields of science. CCA was first applied to examine the relation of wheat characteristics to flour characteristics in an economics study by F. Waugh in 1942 [Waugh (1942)]. Since then, studies in the fields of psychology [Hopkins (1969), Dunham and Kravetz (1975)], geography [Monmonier and Finn (1973)], medicine [Lindsey et al. (1985)], physics [Wong et al. (1980)], chemistry [Tu et al. (1989)], biology [Sullivan (1982)], time-series modeling [Heij and Roorda (1991)], and signal processing [Schell and Gardner (1995)] constitute examples of the early application fields of CCA.

In the beginning of the 21 century, the applicability of CCA has been demonstrated in modern fields of science such as neuroscience, machine learning and bioinformatics. Relations have been explored for developing brain-computer interfaces [Cao et al. (2015), Nakanishi et al. (2015)] and in the field imaging genetics [Fang et al. (2016)]

. CCA has also been applied for feature selection

[Ogura et al. (2013)], feature extraction and fusion

[Shen et al. (2013)], and dimension reduction [Wang et al. (2013)]. Examples of application studies conducted in the fields of bioinformatics and computational biology include [Rousu et al. (2013), Seoane et al. (2014), Baur and Bozdag (2015), Sarkar and Chakraborty (2015), Cichonska et al. (2016)]. The vast range of application domains emphasises the utility of CCA in extracting relations between variables.Originally, CCA was developed to extract linear relations in overdetermined settings, that is when the number of observations exceeds the number of variables in either view. To extend CCA to underdetermined settings that often occur in modern data analysis, methods of regularisation have been proposed. When the sample size is small, Bayesian CCA also provides an alternative to perform CCA. The applicability of CCA to underdetermined settings has been further improved through sparsity-inducing norms that facilitate the interpretation of the final result. Kernel methods and neural networks have been introduced for uncovering non-linear relations. At present, canonical correlation methods can be used to extract linear and non-linear relations in both over- and underdetermined settings.

In addition to the already described variants of CCA, alternative extensions have been proposed, such as the semi-paired and multi-view CCA. In general, CCA algorithms assume one-to-one correspondence between the observations in the views, in other words, the data is assumed to be paired. However, in real datasets some of the observations may be missing in either view, which means that the observations are semi-paired. Examples of semi-paired CCA algorithms comprise [Blaschko et al. (2008)], [Kimura et al. (2013)], [Chen et al. (2012)], and [Zhang et al. (2014)]. CCA has also been extended to more than two views by [Horst (1961)], [Carroll (1968)], [Kettenring (1971)], and [Van de Geer (1984)]. In multi-view CCA the relations are sought among more than two views. Some of the modern extensions of multi-view CCA comprise its regularised [Tenenhaus and Tenenhaus (2011)], kernelised [Tenenhaus et al. (2015)], and sparse [Tenenhaus et al. (2014)] variants. Application studies of multi-view CCA and its modern variants can be found in neuroscience [Kang et al. (2013)], [Chen et al. (2014)], feature fusion [Yuan et al. (2011)] and dimensionality reduction [Yuan et al. (2014)]. However, both the semi-paired and multi-view CCA are beyond the scope of this tutorial.

This tutorial begins with an introduction to the original formulation of CCA. The basic framework and statistical assumptions are presented. The techniques for solving the CCA optimisation problem are discussed. After solving the CCA problem, the approaches to interpret and evaluate the result are explained. The variants of CCA are illustrated using worked examples. Of the extended versions of CCA, the tutorial concentrates on the topics of regularised, kernel, and sparse CCA. Additionally, the deep and Bayesian CCA variants are briefly reviewed. This tutorial acquaints the reader with canonical correlation methods, discusses where they are applicable and what kind of information can be extracted.

## 2 Canonical Correlation Analysis

### 2.1 The Basic Principles of CCA

CCA is a two-view multivariate statistical method. In multivariate statistical analysis, the data comprises multiple variables measured on a set of observations or individuals. In the case of CCA, the variables of an observation can be partitioned into two sets that can be seen as the two views of the data. This can be illustrated using the following notations. Let the views and be denoted by the matrices and , of sizes and

respectively. The row vectors

and for denote the sets of empirical multivariate observations in and respectively. The observations are assumed to be jointly sampled from a normal multivariate distribution. A reason for this is that the normal multivariate model approximates well the distribution of continuous measurements in several sampled distributions [Anderson (2003)]. The column vectors for and for denote the variable vectors of the observations respectively. The inner product between two vectors is either denoted by or. Throughout this tutorial, we assume that the variables are standardised to zero mean and unit variance. In CCA, the aim is to extract the linear relations between the variables of

and .CCA is based on linear transformations. We consider the following transformations

where , , , , , and . The data matrices and represent linear transformations of the positions and onto the images and in the space . The positions and are often referred to as canonical weight vectors and the images and are also termed as canonical variates or scores. The constraints of CCA on the mappings are that the position vectors of the images and are unit norm vectors and that the enclosing angle, [Golub and Zha (1995), Dauxois and Nkiet (1997)], between and is minimised. The cosine of the angle, also referred to as the canonical correlation, between the images and is given by the formula and due to the unit norm constraint Hence the basic principle of CCA is to find two positions and that after the linear transformations and are mapped onto an -dimensional unit ball and located in such a way that the cosine of the angle between the position vectors of their images and is maximised.

The images and of the positions and that result in the smallest angle, , determine the first canonical correlation which equals [Björck and Golub (1973)]. The smallest angle is given by

(1) |

Let the maximum be obtained by and . The pair of images and , that has the second smallest enclosing angle , is found in the orthogonal complements of and . The procedure is continued until no more pairs are found. Hence the angles for when that can be found are recursively defined by

The number of canonical correlations, , corresponds to the dimensionality of CCA. Qualitatively, the dimensionality of CCA can be also seen as the number of patterns that can be extracted from the data.

When the dimensionality of CCA is large, it may not be relevant to solve all the positions and and images and

. In general, the value of the canonical correlation and the statistical significance are considered to convey the importance of the pattern. The first estimation strategy for finding the number of statistically significant canonical correlation coefficients was proposed in

[Bartlett (1941)]. The techniques have been further developed in [Fujikoshi and Veitch (1979), Tu (1991), Gunderson and Muirhead (1997), Yamada and Sugiyama (2006), Lee (2007), Sakurai (2009)].In summary, the principle behind CCA is to find two positions in the two data spaces respectively that have images on a unit ball such that the angle between them is minimised and consequently the canonical correlation is maximised. The linear transformations of the positions are given by the data matrices. The number of relevant positions can be determined by analysing the values of the canonical correlations or by applying statistical significance tests.

### 2.2 Finding the positions and the images in CCA

The position vectors and having images and

in the new coordinate system of a unit ball that have a maximum cosine of the angle in between can be obtained using techniques of functional analysis. The eigenvalue-based methods comprise solving a standard eigenvalue problem, as originally proposed by Hotelling in

[Hotelling (1936)], or a generalised eigenvalue problem [Bach and Jordan (2002), Hardoon et al. (2004)]. Alternatively, the positions and the images can be found using the singular value decomposition (SVD), as introduced in

[Healy (1957)]. The techniques can be considered as standard ways of solving the CCA problem.#### Solving CCA Through the Standard Eigenvalue Problem

In the technique of Hotelling, both the positions and and the images and are obtained by solving a standard eigenvalue problem. The Lagrange multiplier technique [Hotelling (1936), Hooper (1959)] is employed to obtain the characteristic equation. Let and denote the data matrices of sizes and respectively. The sample covariance matrix between the variable column vectors in and is . The empirical variance matrices between the variables in and are given by and respectively. The joint covariance matrix is then

(2) |

The first and greatest canonical correlation that corresponds to the smallest angle is between the first pair of images and . Since the correlation between and does not change with the scaling of and , we can constrain and to be such that and have unit variance. This is given by

(3) | ||||

(4) |

Due to the normality assumption and comparability, the variables of and should be centered such that they have zero means. In this case, the covariance between and is given by

(5) |

Substituting (5), (3) and (4) into the algebraic problem in Equation (1), we obtain:

In general, the constraints (3) and (4) are expressed in squared form, and . The problem can be solved using the Lagrange multiplier technique. Let

(6) |

where and denote the Lagrange multipliers. Differentiating with respect to and gives

(7) | |||

(8) |

Multiplying (7) from the left by and (8) from the left by gives

Since and , we obtain that

(9) |

Substituting (9) into Equation (7) we obtain

(10) |

Substituting (10) into (8) we obtain

which is equivalent to the generalised eigenvalue problem of the form

If is invertible, the problem reduces to a standard eigenvalue problem of the form

The eigenvalues of the matrix are found by solving the characteristic equation

The square roots of the eigenvalues correspond to the canonical correlations. The technique of solving the standard eigenvalue problem is shown in Example 2.2.

We generate two data matrices and of sizes and , where , and , respectively as follows. The variables of

are generated from a random univariate normal distribution,

. We generate the following linear relationswhere and denote vectors of normal noise. The data is standardised such that every variable has zero mean and unit variance. The joint covariance matrix in (2) of the generated data is given by

Now we compute the eigenvalues of the characteristic equation

The square roots of the eigenvalues of are , , and

. The eigenvectors

satisfy the equationHence we obtain

and vectors satisfy

The vectors , and and , and correspond to the pairs of positions and that have the images and . In linear CCA, the canonical correlations equal to the square roots of the eigenvalues, that is , , and .

#### Solving CCA Through the Generalised Eigenvalue Problem

The positions and and their images and can also be solved through a generalised eigenvalue problem [Bach and Jordan (2002), Hardoon et al. (2004)]. The equations in (7) and (8) can be represented as simultaneous equations

that are equivalent to

(11) |

The equation (11) represents a generalised eigenvalue problem of the form where the pair is an eigenvalue of the pair [Saad (2011), Golub and Van Loan (2012)]. The pair of matrices and is also referred to as matrix pencil. In particular, is symmetric and is symmetric positive-definite. The pair is then called the symmetric pair. As shown in [Watkins (2004)], a symmetric pair has real eigenvalues and linearly independent eigenvectors. To express the generalised eigenvalue problem in the form , the generalised eigenvalue is given by . Since the generalised eigenvalues come in pairs where , the positive generalised eigenvalues correspond to the canonical correlations.

Using the data in Example 2.2, we apply the formulation of the generalised eigenvalue problem to obtain the positions and . The resulting generalised eigenvalues are

The generalised eigenvectors that correspond to the positive generalised eigenvalues in descending order are

The vectors , and and , and correspond to the pairs of positions and The canonical correlations are , , and .

The entries of the position pairs differ to some extent from the solutions to the standard eigenvalue problem in the Example 2.2. This is due to the numerical algorithms that are applied to solve the eigenvalues and eigenvectors. Additionally, the signs may also be opposite. This can be seen when comparing the second pairs of positions with the Example 2.2. This results from the symmetric nature of CCA.

#### Solving CCA Using the SVD

The technique of applying the SVD to solve the CCA problem was first introduced by [Healy (1957)] and described by [Ewerbring and Luk (1989)] as follows. First, the variance matrices and are transformed into identity forms. Due to the symmetric positive definite property, the square root factors of the matrices can be found using a Cholesky or eigenvalue decomposition:

Applying the inverses of the square root factors symmetrically on the joint covariance matrix in (2) we obtain

The position vectors and can hence be obtained by solving the following SVD

(12) |

where the columns of the matrices and correspond to the sets of orthonormal left and right singular vectors respectively. The singular values of matrix correspond to the canonical correlations. The positions and are obtained from

The method is shown in Example 2.2.

The method of solving CCA using the SVD is demonstrated using the data of Example 2.2. We compute the matrix

The SVD gives

The singular values of the matrix correspond to the canonical correlations. The positions and are given by

where and for correspond to the left and right singular vectors. The vectors , and and , and correspond to the pairs of positions and The canonical correlations are , , and .

The main motivation for improving the eigenvalue-based technique was the computational complexity. The standard and generalised eigenvalue methods scale with the cube of the input matrix dimension, in other words, the time complexity is , for a matrix of size . The input matrix in the SVD-based technique is rectangular. This gives a time complexity of , for a matrix of size . Hence the SVD-based technique is computationally more tractable for very large datasets.

To recapitulate, the images and of the positions and that successively maximise the canonical correlation can be obtained by solving a standard [Hotelling (1936)] or a generalised eigenvalue problem [Bach and Jordan (2002), Hardoon et al. (2004)] or by applying the SVD [Healy (1957), Ewerbring and Luk (1989)]. The CCA problem can also be solved using alternative techniques. The only requirements are that the successive images on the unit ball are orthogonal and that the angle is minimised.

### 2.3 Evaluating the Canonical Correlation Model

The pair of position vectors that have images on the unit ball with a minimum enclosing angle correspond to the canonical correlation model obtained from the training data. The entries of these position vectors convey the relations between the variables obtained from the sampling distribution. In general, a statistical model is validated in terms of statistical significance and generalisability. To assess the statistical significance of the relations obtained from the training data, Bartlett’s sequential test procedure [Bartlett (1941)] can be applied. Although the technique was presented in 1941, it is still applied in timely CCA application studies such as [Marttinen et al. (2013), Kabir et al. (2014), Song et al. (2016)]. The generalisability of the canonical correlation model determines whether the relations obtained from the training data can be considered to represent general patterns occurring in the sampling distribution. The methods of testing the statistical significance and generalisability of the extracted relations represent standard ways to evaluate the canonical correlation model.

The entries of the position vectors and can be used as a means to analyse the linear relations between the variables. The linear relation corresponding to the value of the canonical correlation is found between the entries that are of the greatest value. The values of the entries of the position vectors and are visualised in Figure 1. The linear relation that corresponds to the canonical correlation of is found between the variables and . Since the signs of both entries are negative, the relation is positive. The second pair of positions conveys the negative relation between and . The positive relation between and can be identified from the entries of the third pair of positions .

In [Meredith (1964)], structure correlations were introduced as a means to analyse the relations between the variables. Structure correlations are the correlations of the original variables, for and for , with the images, or . In general, the structure correlations convey how the images and are aligned in the space in relation to the variable axes.

In [Ter Braak (1990)], the structure correlations were visualised on a biplot to facilitate the interpretation of the relations. To plot the variables on the biplot, the correlations of the original variables of both sets with two successive images, for example the images , of one of the sets are computed. The plot is interpreted by the cosine of the angles between the variable vectors which is given by Hence a positive linear relation is shown by an acute angle while an obtuse angle depicts a negative linear relation. A right angle corresponds to a zero correlation.

Three biplots of the data and results of Example 2.2 are shown in Figure 2. In each of the biplots, the same relations that were identified in Figure 1 can be found by analysing the angles between the variable vectors. The extraction of the relations can be enhanced by changing the pairs of images with which the correlations are computed.

The statistical significance tests of the canonical correlations evaluate whether the obtained pattern can be considered to occur non-randomly. The sequential test procedure of Bartlett [Bartlett (1938)] determines the number of statistically significant canonical correlations in the data. The procedure to evaluate the statistical significance of the canonical correlations is described in [Fujikoshi and Veitch (1979)]. We test the hypothesis

(13) |

where when . If the hypothesis is rejected for but accepted for the number of statistically significant canonical correlations can be estimated as . For the test, the Bartlett-Lawley statistic, is applied

(14) |

where denotes the canonical correlation. The asymptotic null distribution of is the chi-squared with degrees of freedom. Hence we first test that no canonical relation exists between the two views. If we reject the hypothesis we continue to test that one canonical relation exists. If all the canonical patterns are statistically significant even the hypothesis is rejected.

We demonstrate the sequential test procedure of Bartlett using the simulated setting of Examples 2.2, 2.2 and 2.2. In the setting, , and . Hence . First, we test that there are no canonical correlations

(15) |

The Bartlett-Lawley statistic is . Since the critical value at the significance level is . Since the hypothesis is rejected. Next we test that there is one canonical correlation.

(16) |

The Bartlett-Lawley statistic is and . The critical value at the significance level is . Since the hypothesis is rejected. We continue to test that there are two canonical correlations

(17) |

The Bartlett-Lawley statistic is and . The critical value at the significance level is . Since the hypothesis is rejected. Hence the hypothesis is accepted and all three canonical patterns are statistically significant.

To determine whether the extracted relations can be considered generalisable, or in other words general patterns in the sampling distribution, the linear transformations of the position vectors and need to be performed using test data. Unlike training data, test data originates from the sampling distribution but were not used in the model computation. Let the matrices and denote the test data of observations. The linear transformations of the position vectors and are then

where the images and are in the space . The cosine of the angle between the test images implies the generalisability. If the canonical correlations computed from test data also result in high correlation values we can deduce that the relations can generally be found from the particular sampling distribution.

We evaluate the generalisability of the canonical correlation model obtained in Example 2.2. The test data matrices and of sizes and where and are from the same distributions as described in Example 2.2. The observations were not included in the computation of the model. The test canonical correlations corresponding to the positions and are , , The high values indicate that the extracted relations can be considered generalisable.

The canonical correlation model can be evaluated by assessing the statistical significance and testing the generalisability of the relations. The statistical significance of the model can be determined by testing whether the extracted canonical correlations are not non-zero by chance. The generalisability of the relations can be assessed using new observations from the sampling distribution. These evaluation methods can generally be applied to test the validity of the extracted relations obtained using any variant of CCA.

## 3 Extensions of Canonical Correlation Analysis

### 3.1 Regularisation Techniques in Underdetermined Systems

CCA finds linear relations in the data when the number of observations exceeds the number of variables in either view. This possibly guarantees the non-singularity of the variance matrices and when solving the CCA problem. In the case of the standard eigenvalue problem, the matrices and should be non-singular so that they can be inverted. In the case of the SVD method, singular and may not have the square root factors. If the number of observations is less than the number of variables it is likely that some of the variables are collinear. Hence a sufficient sample size reduces the collinearity of the variables and guarantees the non-singularity of the variance matrices. The first proposition to solve the problem of insufficient sample size was presented in [Vinod (1976)]. A more recent technique to regularise CCA has been proposed in [Cruz-Cano and Lee (2014)]. In the following, we present the original method of regularisation [Vinod (1976)] due to its popularity in CCA applications [González et al. (2009)], [Yamamoto et al. (2008)], and [Soneson et al. (2010)].

In the work of [Vinod (1976)], the singularity problem was proposed to be solved by regularisation. In general, the idea is to improve the invertibility of the variance matrices and by adding arbitrary constants and to the diagonal and The constraints of CCA become

and hence the magnitudes of the position vectors and are smaller when regularisation, and , is applied. The regularised CCA optimisation problem is given by

The positions and can be found by solving the standard eigenvalue problem

or the generalised eigenvalue problem

As in the case of linear CCA, the canonical correlations correspond to the inner products between the consecutive image pairs where .

The regularisation proposed by [Vinod (1976)] makes the CCA problem solvable but introduces new parameters and that have to be chosen. The first proposition of applying a leave-one-out cross-validation procedure to automatically select the regularisation parameters was presented in [Leurgans et al. (1993)]. Cross-validation is a well-established nonparametric model selection procedure to evaluate the validity of statistical predictions. One of its earliest applications have been presented in [Larson (1931)]. A cross-validation procedure entails the partitioning of the observations into subsamples, selecting and estimating a statistic which is first measured on one subsample, and then validated on the other hold-out subsample. The method of cross-validation is discussed in detail for example in [Stone (1974)], [Efron (1979)], [Browne (2000)], and more recently in [Arlot et al. (2010)]. The cross-validation approach specifically developed for CCA has been further extended in [Waaijenborg et al. (2008), Yamamoto et al. (2008), González et al. (2009), Soneson et al. (2010)].

In cross-validation, the size of the hold-out subsample varies depending on the size of the dataset. A leave-one-out cross-validation procedure is an option when the sample size is small and partitioning of the data into several folds, as is done in -fold cross-validation, is not feasible. -fold cross-validation saves computation time in relation to leave-one-out cross-validation if the sample size is large enough to partition the observations into five folds where each fold is used as a test set in turn.

In general, as demonstrated for example in [Krstajic et al. (2014)], a -fold cross-validation procedure should be repeated when an optimal set of parameters are searched for. Repetitions decrease the variance of the average values measured across the test folds. Algorithm 1 outlines an approach to determine the optimal regularisation parameters in CCA.

Comments

There are no comments yet.