## 1 Introduction

Kernel methods are powerful tools to capture the nonlinear representation of data by mapping the dataset to a high-dimensional feature space. Despite their tremendous success in various machine learning problems, kernel methods suffer from massive computational cost on large datasets. The time cost of computing the kernel matrix alone scales quadratically with data, and if the learning method involves inverting the matrix (e.g., kernel ridge regression), the cost would increase to cubic. This computational bottleneck motivated a great deal of research on kernel approximation, where the seminal work of

[1] on random featuresis a prominent point in case. For the class of shift-invariant kernels, they showed that one can approximate the kernel by Monte-Carlo sampling from the inverse Fourier transform of the kernel.

Due to the practical success of random features, the idea was later used for one of the ubiquitous problems in statistics and machine learning, namely Canonical Correlation Analysis (CCA). CCA derives a pair of linear mappings of two datasets, such that the correlation between the projected datasets is maximized. Similar to other machine learning methods, CCA also has a nonlinear counterpart called Kernel Canonical Correlation Analysis (KCCA) [2], which provides a more flexible framework for maximizing the correlation. Due to the prohibitive computational cost of KCCA, Randomized Canonical Correlation Analysis (RCCA) was introduced [3, 4] to serve as a surrogate for KCCA. RCCA uses random features for transformation of the two datasets. Therefore, it provides the flexibility of nonlinear mappings with a moderate computational cost.

On the other hand, more recently, data-dependent sampling of random features has been an intense focus of research in the machine learning community. The main objective is to modify the stochastic oracle from which random features are sampled to improve a certain performance metric. Examples include [5, 6] with a focus only on kernel approximation as well as [7, 8, 9]

with the goal of better generalization in supervised learning. While the proposed techniques in this realm improve their respective learning tasks, they are not necessarily suitable for other learning tasks.

In this paper, we propose a general scoring rule for sampling random features, which can be employed for various applications with some adjustments. In particular, our scoring rule depends on a positive definite matrix that can be adjusted based on the application. We first observe that a number of data-dependent sampling methods (e.g., leverage scores in [8] and energy-based sampling in [10]) can be recovered by our scoring rule using specific choices of the matrix. Then, we restrict our attention to CCA and provide a principled guide for finding the distribution maximizing the canonical correlations. Our result reveals a novel data-dependent method for sampling features, called Optimal Randomized Canonical Correlation Analysis (ORCCA). This suggests that prior data-dependent methods are not necessarily optimal for the CCA task. We conduct extensive numerical experiments verifying that ORCCA indeed introduces significant improvement over the state-of-the-art in random features for CCA.

## 2 Preliminaries and Problem Setting

Notation: We denote by the set of positive integers , by the trace operator, by the standard inner product, by

the spectral (respectively, Euclidean) norm of a matrix (respectively, vector), and by

the expectation operator. Boldface lowercase variables (e.g., ) are used for vectors, and boldface uppercase variables (e.g., ) are used for matrices. denotes the -th entry of matrix . The vectors are all in column form.### 2.1 Random features and Kernel Approximation

Kernel methods are powerful tools for data representation, commonly used in various machine learning problems. Let be a set of given points where for any , and consider a symmetric positive-definite function such that for . Then, is called a positive (semi-)definite kernel, serving as a similarity measure between any pair of vectors . This class of kernels can be thought as inner product of two vectors that map the points from a

-dimensional space to a higher dimensional space (and potentially infinite-dimensional space). The benefit is that in the new space the points may be separable, providing more flexibility compared to the original space. Examples of classical machine learning methods using kernels include kernel ridge regression, kernel support vector machine, and kernel clustering.

Despite the widespread use of kernel methods in machine learning, there is an evident computational issue in the implementing. Computing the kernel for every pair of points costs , and if the learning method requires inverting that matrix (e.g., kernel ridge regression), the cost would increase to . This particular disadvantage makes kernel method impractical for large-scale machine learning.

An elegant method to address this issue was the use of random Fourier features for kernel approximation [1]. Let

be a probability density with support

. Consider any kernel function in the following form(1) |

where is a feature map parameterized by . Following (1), the kernel function can be approximated as

(2) |

are independent samples from , called random features. Examples of kernels taking the form (1) include shift-invariant kernels [1] or dot product (e.g., polynomial) kernels [11] (see Table 1 in [12] for an exhaustive list). Let us now define

(3) |

Then, the kernel matrix can be approximated with where is defined as

(4) |

The low-rank approximation above can save significant computational cost when . As an example, for kernel ridge regression the time cost would reduce from to .

While [1] prove uniform convergence of random features for approximation of kernel functions of form (1), such bounds do not directly translate to statistical guarantees on the corresponding kernel matrix. The recent work of [8] tackles this important problem using the notion of -spectral approximation.

###### Theorem 1

[8] Let and be fixed constants. Assume that . If we use random Fourier features (sampled from ), the following identity

(5) |

holds with probability at least , where .

In view of (5), is dubbed a -spectral approximation of . Theorem 1 characterizes the number of random features required to achieve a -spectral approximation of the kernel matrix given a specific . We can observe that for a small , the number of random features dominates even the number of data points . Since the main motivation of using randomized features is to reduce the computational cost of kernel methods (with ), this observation will naturally raise the following question:

###### Problem 1

Can we develop a sampling (or re-sampling) mechanism that reduces the number of random features required for a learning task involving kernel matrix?

Next two subsections will shed light on Problem 1.

### 2.2 A General Scoring Rule for Sampling random features

Several recent works have answered to Problem 1 in the affirmative; however, quite interestingly, there is so much difference in adopted strategies given the learning task. For example, a sampling scheme that improves kernel approximation (e.g., orthogonal random features [13]) will not necessarily be competitive for supervised learning [10]. In other words, Problem 1 has been addressed in a task-specific fashion. In this paper, we propose a general scoring rule for sampling random features that lends itself to several important tasks in machine learning.

Let be a positive definite matrix and define the following score function for any

(6) |

where is the original probability density of random features. can be thought as an easy prior to sample from. Notice that the score function is also a probability density with support . The key advantage of the score function is that can be designed to improve sampling depending on the learning task. We will elaborate on this choice in Subsection 2.3, where we view some of recent sampling techniques in the literature with this lens.

Observe that given availability of the probability density , one can either sample from it to form a different kernel replacing by in (1), or recover the same kernel function using importance sampling

(7) |

where the approximated form is as follows

(8) |

with being independent samples from . We can then have the transformed matrix similar to (4), where

(9) |

In the following theorem, we present the spectral approximation guarantee for .

###### Theorem 2

### 2.3 Relation to Data-Dependent Sampling

A number of recent works have proposed the idea of sampling random features based on data-dependent distributions, mostly focusing on improving generalization in supervised learning. In this section, we show that the score function (6) will bring some of these methods under the same umbrella. More specifically, given a particular choice of the center matrix , we can recover a number of data-dependent sampling schemes, such as Leverage Scores (LS) [8, 14, 15, 16] and Energy-based Exploration of random features (EERF) [10].

Leverage Scores: Following the framework of [8]

, LS sampling is according to the following probability density function

(10) |

which can be recovered precisely when in (6). Using this sampling strategy, [8] proved that is a -spectral approximation of in the sense of . Importantly, the number of random features required for this approximation reduces from to , where we used to hide poly-log factors. It can be then observed that if the eigen-decay of is fast enough, the number of required features shrinks dramatically compared to applying vanilla random features. The major issue is that computing needs calculations, which is as costly as using the original version of the desired kernel method (e.g., ridge regression).

A practical implementation of LS was proposed in [14] and later used in the experiments of [16] for SVM. The algorithm is based on the observation that as in (4), which (after simple algebra) can be shown to reduce the time cost to linear with respect to data. The generalization properties of this practical algorithm was later analyzed by [17] for the case of ridge regression and by [18] for the case of Lipschitz losses, respectively.

Energy-Based Exploration of random features: The EERF algorithm was proposed in [10] for improving generalization. In supervised learning, the goal is to map input vectors to output variables , where for . The EERF algorithm employs the following scoring rule for random features

(11) |

where the score is calculated for a large pool of random features, and subset with largest score will be used for the supervised learning problem. Note that the algorithm greedily chooses the best features in the pool in lieu of sampling according to . Now, if we let , we can observe that is equivalent to (6) with the center matrix , because ordering the pool of features according to is equivalent to given above. [10] showed in their numerical experiments that EERF consistently outperforms plain random features and other data-independent methods in terms of generalization.

We remark that the kernel alignment method in [7] is also in a similar spirit. Instead of choosing features with largest scores, an optimization algorithm is proposed to re-weight the features such that the transformed input is correlated enough with output variable.

Given the success of algorithms like LS and EERF, we can hope that the scoring rule (6) has the potential to be adopted in various learning tasks. Indeed, the center matrix should be chosen based on the objective function that needs to be optimized in the learning task at hand.

## 3 Canonical Correlation Analysis with Score-Based Sampling

In this section, we study the application of the scoring rule (6) to nonlinear Canonical Correlation Analysis (CCA), after a brief review of CCA basics.

### 3.1 Overview of Canonical Correlation Analysis

Linear CCA was introduced in [19]

as a method of correlating linear relationships between two multi-dimensional random variables

and . This problem is often formulated as finding a pair of canonical bases and such that is minimized, where and is the Frobenius norm. The problem has a well-known closed-form solution (see e.g., [20]), relating canonical correlations and canonical pairs to the eigen-system of the following matrix(12) |

where , and

are regularization parameters to avoid singularity. In particular, the eigenvalues correspond to the canonical correlations and the eigenvectors correspond to the canonical pairs. The kernel version of CCA, called KCCA

[2, 21], investigates the correlation analysis using the eigen-system of another matrix, in which covariance matrices of (12) are replaced with kernel matrices, i.e.,(13) |

where and . As the inversion of kernel matrices involves time cost, [4] adopted the idea of kernel approximation with random features, introducing Randomized Canonical Correlation Analysis (RCCA). RCCA uses approximations and in (13), where and are the transformed matrices using random features as in (4). In other words,

### 3.2 Optimal Randomized Canonical Correlation Analysis

We now propose an adaptation of the scoring rule (6) for CCA, where the center matrix is selected particularly for maximizing the total canonical correlations. We start with an important special case of due to the natural connection to supervised learning. We will use index for any quantity in relation to , and for any quantity in relation to .

Optimal Randomized Canonical Correlation Analysis 1 ( and linear ): We first consider the scenario where is mapped into a nonlinear space (using random features) following (4). On the other hand, remains in its original space (with ). It is well-known that if for some , perfect (linear) correlation is achieved between and in view of (12) (with and ), simply because is a linear combination of the columns of . This motivates the idea that sampling schemes that are good for supervised learning may be natural candidates for CCA in that with we can achieve perfect correlation. The following proposition finds the optimal scoring rule of form (6) that maximizes the total canonical correlations.

###### Proposition 3

Interestingly, the principled way of selecting that maximizes total CCA leads to a sampling rule that was not previously investigated. It is clear that the density function in (14) is different from LS (10) and EERF (11). While the scoring rule (14) optimizes canonical correlations in view of Proposition 1, calculating would cost , which is not scalable to large datasets. The following corollary offers an approximated solution to avoid this issue.

###### Corollary 4

For any finite pool of random features , instead of sampling (or selecting) according to the scoring rule (14), we can approximate the scoring rule with the following empirical score

(15) |

for any and , where is formed with random features as in (4) and denotes the empirical score of the -th random features in the pool of features.

Observe that sampling according to the score rule above will reduce the computational cost from to , which is a significant improvement when . After constructing (15), we have two options in using the empirical score. We can either sample features out of the pool according to (15), or greedily select the features with highest empirical scores. In any case, instead of approximating the default kernel (using importance sampling), we obtain a “better” by using the data information. In this paper, we will use the greedy algorithm, called ORCCA1 presented in Algorithm 1.

Input: ,, the feature map , an integer , an integer , the prior distribution , the parameter .

(17) |

Output: Linear canonical correlations between and (with regularization parameter ) as in (12).

Optimal Randomized Canonical Correlation Analysis 2 ( or nonlinear ): We now follow the idea of KCCA with both views of data mapped to a nonlinear space. More specifically, is mapped to and is mapped to following (4). For this set up, we provide below the optimal scoring rule of form (6) that maximizes the total canonical correlations.

###### Proposition 5

Consider KCCA in (13) with , a nonlinear kernel matrix , and a nonlinear kernel . If we alternatively approximate and only in the right block matrix of (13), the optimal scoring rule maximizing the total canonical correlations can be expressed as

(18) |

for any and

(19) |

for any , respectively. The probability densities and are the priors defining the default kernel functions in the space of and according to (1).

We can associate the scoring rules above to the general scoring rule (6) as well. Indeed, for sampling the random features from to transform , the center matrix is , and for sampling the random features from to transform , the center matrix is . However, the same computational issue as (14) motivates the use of empirical version of this score, presented in the sequel.

###### Corollary 6

For any finite pool of random features and (sampled from priors and , respectively), instead of sampling (or selecting) according to the scoring rules (18) and (19), we can approximate them using the following empirical versions

for any and

for any , respectively. and are the transformed matrices of and as in (4) using random features. and denote the scores of the -th random features in the pools corresponding to and , respectively.

As we can observe, the computational cost is reduced from to similar to Corollary 4. Following the greedy fashion of ORCCA1, we now present ORCCA2 in Algorithm 2 in which both and are transformed nonlinearly.

Input: ,, the feature map , an integer , an integer , the prior densities and , the parameter .

(20) |

(21) |

Output: Linear canonical correlations between and (with regularization parameter ) as in (12).

## 4 Related Literature

Random Features: As discussed in Section 2.1, kernels of form (1) can be approximated using random features (e.g., shift-invariant kernels using Monte Carlo [1] or Quasi Monte Carlo [22] sampling, and dot product kernels [11]. A number of methods have been proposed to improve the time cost, decreasing it by a linear factor of the input dimension (see e.g., Fast-food [23, 5]). The generalization properties of random features have been studied for -regularized risk minimization [24] and ridge regression [15], both improving the early generalization bound of [25]. Also, [13]

develop orthogonal random features (ORF) to improve kernel approximation variance. It turns out that ORF provides optimal kernel estimator in terms of mean-squared error

[26]. A number of recent works have focused on kernel approximation techniques based on data-dependent sampling of random features. Examples include [27] on compact nonlinear feature maps, [5, 28] on approximation of shift-invariant/translation-invariant kernels, [6] on Stein effect in kernel approximation, and [29] on data-dependent approximation using greedy approaches (e.g., Frank-Wolfe). On the other hand, another line of research has focused on generalization properties of data-dependent sampling. In addition to works mentioned in Section 2.3, [9] also study data-dependent approximation of translation-invariant/rotation-invariant kernels for improving generalization in SVM. [30] recently propose a hybrid approach (based on importance sampling) to re-weight random features with application to both kernel approximation and supervised learning.Canonical Correlation Analysis: As discussed in Section 3.1, the computational cost of KCCA [2] motivated a great deal of research on kernel approximation for CCA in large-scale learning. Several methods tackle this issue by explicitly transforming datasets (e.g., randomized canonical correlation analysis (RCCA) [3, 4] and deep canonical correlation analysis (DCCA) [31]

). RCCA focuses on transformation using randomized 1-hidden layer neural networks, whereas DCCA considers deep neural networks. Perhaps not surprisingly, the time cost of RCCA is significantly smaller than DCCA

[4]. There exists other non-parametric approaches such as non-parametric canonical correlation analysis (NCCA) [32], which estimates the density of training data to provide a practical solution to Lancaster’s theory for CCA [33]. Also, more recently, a method is proposed in [34] for sparsifying KCCA through regularization. A different (but relevant) literature has focused on addressing the optimization problem in CCA. [35, 36] have discussed this problem by developing novel techniques, such as alternating least squares, shift-and-invert preconditioning, and inexact matrix stochastic gradient. In a similar spirit is [37], which presents a memory-efficient stochastic optimization algorithm for RCCA.Our work is radically different from previous literature in that we propose a general sampling rule for random features, which can be adopted for different learning tasks including CCA.

## 5 Numerical Experiments

We now investigate the empirical performance of ORCCA2 using four datasets from the UCI Machine Learning Repository. The datasets are MNIST, Adult, Seizure Detection, and Energy Use. Due to space limitations, experiments related to ORCCA1 are reported in the supplementary material (Section 6.6). We have also included the R code for implementation of ORCCA2 versus other benchmarks in the supplementary.

Dataset | ORCCA2 mean | Competitive mean | ORCCA2 standard error |
Competitive standard error |
---|---|---|---|---|

MNIST | 1.8 | 1.579 (LS) | 0.011 | 0.013 |

Adult | 0.609 | 0.605 (ORF) | 0.00063 | 0.00089 |

Seizure Detection | 1.284 | 1.102 (RFF) | 0.0106 | 0.0088 |

Energy Use | 5.61 | 5.32 (LS) | 0.028 | 0.027 |

Benchmark Algorithms: We compare ORCCA1 & ORCCA2 to four random feature based benchmark algorithms that have shown good performance in supervised learning and/or kernel approximation. The first one is plain random Fourier features (RFF) [1]. Next is orthogonal random features (ORF) [13], which improves the variance of kernel approximation. Then, we have two data-dependent sampling methods, LS [8, 14] and EERF [10], due to their success in supervised learning as mentioned in Section 2.3.

1) RFF [1] with as the feature map to approximate the Gaussian kernel.

are sampled from Gaussian distribution

andare sampled from uniform distribution

. We use this to transform to . The same procedure applies to to map it to .2) ORF [13] with as the feature map. are sampled from a Gaussian distribution

and then modified based on a QR decomposition step. The transformed matrices

and for ORF are different from other algorithms in that and . Given that the feature map is 2-dimensional here, to keep the comparison fair, the number of random features used for ORF will be half of other algorithms.3) LS [8, 14] with as the feature map. are sampled from Gaussian distribution and are sampled from uniform distribution . features are sampled from the pool of random Fourier features according to the scoring rule of LS (10). The transformed matrices and (as in (9)) are then used for the experiments.

4) EERF: [10] with as the feature map. are sampled from Gaussian distribution and are sampled from uniform distribution . random features are selected according to the scoring rule of EERF (11) to form . EERF is not included here because it only works for supervised learning and is suitable for comparison with ORCCA1. The results related to ORCCA1 is presented in the supplementary material.

Practical Considerations: Following [3], we work with empirical copula transformation of datasets to achieve invariance with respect to marginal distributions. For domain, the variance of random features is set to be the inverse of mean-distance of -th nearest neighbour (in Euclidean distance), following [13]. For EERF, LS, and ORCCA2, the pool size is when random features are used in CCA calculation. The regularization parameter for LS is chosen through grid search. After performing a grid search, the variance of random features for is set to be the same as , producing the best results for all algorithms. The regularization parameter is set to be small enough to make its effect on CCA negligible while avoiding numerical errors caused by singularity.

Performance: Our empirical results are reported in Figure 1 and Table 1. All the results are averaged over simulations. Figure 1 represents the total canonical correlations versus the number of random features for ORCCA2 (this work), RFF, ORF, and LS. We observe that ORCCA2 is superior compared to other benchmarks, and only for the Adult dataset ORF is initially on par with our algorithm. Given the dominance of ORCCA2 over LS, we can clearly conclude that data-dependent sampling methods that improve supervised learning are not necessarily best choices for CCA. Moreover, Table 1 tabulates the statistics of the results when averaged over 30 simulations (for ). The averaged total canonical correlations of ORCCA2 is compared with the most competitive algorithm (second best result). We can see that though ORF yields a similar result as ORCCA2 in Adult dataset, there is still significant statistical difference between the two algorithms according to standard errors.

## 6 Supplementary Material

###### Lemma 7

[8] Let be a fixed symmetrical matrix and be a set of symmetrical random matrices where for . If the following bounds hold

(22) |

and

(23) |

then, we define the following values

(24) |

and

(25) |

For the expectation estimator , we have the following inequality for all where

(26) |

### 6.1 Proof of Theorem 2

To prove Theorem 2, we represent the eigen-decomposition of the matrix with . Then, it is easy to see that proving a -spectral approximation for is equivalent to showing the following inequality:

(27) |

Now, consider the following sets of random matrices

(28) |

and recall from (6) that

We know that . Therefore, to apply Lemma 7, we need to calculate the relevant bounds. Since , we can use above to get

where the last line follows from . Letting , we have

following the calculations for . Therefore, we now obtain

given the fact that . Therefore, we can apply Lemma 7 with and , and derive

Due to the assumption that , we can see that

(29) |

The quantity in (25) will be

(30) |

where from Theorem 1. Finally, noting (29)-(30) and , we apply Lemma 7 to get

(31) | ||||

Therefore, when , is a spectral approximation of with probability at least .