## I Introduction

Supervised learning is a fundamental machine learning problem, where a learner is given input-output data samples (from an unknown distribution), and the objective is to find a mapping from inputs to outputs

[1]. Kernel methods are powerful tools to capture the nonlinear relationship between input-outputs. These methods implicitly map the inputs (features) to a high-dimensional space, without the need for knowledge of the feature map, an idea known as kernel trick. While kernel methods are theoretically well-justified, their practical applicability to large datasets is limited in that they require memory (and time) complexity that can scale quadratically (and cubically) with the size of data samples.In the past few years, this computational bottleneck has motivated a large body of research on (low-rank) kernel approximation [2, 3, 4] for efficient learning. In these scenarios, the training can scale linearly with respect to data, introducing a dramatic decrease in the computational cost. In this line of work, an elegant idea has been the use of the so-called random features for kernel approximation [4] as well as training shallow networks [5]. In this approach, random features are sampled from a stochastic oracle to form the nonlinear basis functions used to describe the input-output relationship. Replacing optimization, randomization circumvents the non-convexity in training and comes with a theoretical generalization guarantee [5].

Since its development, the randomized-feature approach has been successfully used for a wide range of problems (see e.g. [6] for matrix completion, [7]

for the correlation analysis of random variables, and

[8] for non-parametric statistical learning), but as pointed out in [9], since the basis functions are sampled from a distribution that is independent of data, the number of features required to learn the data subspace may be large. Therefore, a natural question is whether a data-dependent stochastic oracle can prove to be useful in improving the out-of-sample performance.Recently, a number of works have developed supervised data-dependent methods for sampling random features with the goal of improving generalization [10, 11, 12]. This objective is achieved by pre-processing the random features (e.g. via optimizing a metric) and focusing on promising features, which amounts to learning a “good” kernel based on input-output pairs. We provide a comprehensive review of these works in Section IV, but the focus of this work is on an unsupervised data-dependent method relying on leverage scores, calculated based only on inputs [13, 14, 15]. [14]

have discussed the impact of leverage scores for ridge regression,

[15] addressed the problem in the case of SVM, and [13]has established theoretical guarantees for Lipschitz continuous loss functions. Common in all these results is the fact that using leverage scores for sampling random features can significantly reduce the number of features required to achieve optimal learning bounds. The bounds are particularly useful when the eigenvalues of the integral operator corresponding to the underlying kernel decay fast enough. Nevertheless, these works do not aim to change the underlying base kernel.

There are two practical hurdles in using leverage scores: (i) they introduce an optimized distribution for re-sampling features, which is based on the infinite-dimensional integral operator associated to the underlying kernel, and (ii) the support set (domain of random features) is infinite-dimensional, making the optimized distribution impractical to sample from. An empirical sampling scheme is proposed in the experiments of [15] without the theoretical analysis, and as noted in [15] a result in the theoretical direction will be useful for guiding practitioners.

In this paper, we aim to address the problem above using empirical leverage scores. In this scenario, we must construct a finite counterpart of the optimized distribution to use for training. Interestingly, the out-of-sample performance of the algorithm (Theorem 1) reveals an interesting trade-off between two errors: (i) the approximation error of the kernel caused by finiteness of random features, and (ii) the eigenvalue decay of another kernel in the domain of random features defined based on data distribution. The proof of our main result uses a combination of the approximation error result of [13] as well as the spectral approximation result of [16] for ridge leverage functions (which builds on recent works on matrix concentration inequalities [17]). We also verify with numerical experiments (on practical datasets) that the empirical leverage score idea consistently outperforms vanilla Monte Carlo sampling [5], and with a minor modification in the sampling scheme, it can be even competitive to supervised data-dependent methods, without using outputs (labels).

## Ii Problem Formulation

#### Preliminaries:

We denote by the set of positive integers , by the trace operator, by

the spectral (respectively, Euclidean) norm of a matrix (respectively, vector), by

the expectation operator, and bythe variance operator. Boldface lowercase variables (e.g.

) are used for vectors, and boldface uppercase variables (e.g. ) are used for matrices. denotes the -th entry of matrix . The vectors are all in column form.represents the set of square integrable functions with respect the Borel probability measure

on the domain . We use to denote the inner product associated to an inner product space and for its corresponding norm. The subscript may be dropped when it is clear from the context (e.g. for the Euclidean space). For a positive semi-definite linear operator , the sequence denotes the set of (non-negative) eigenvalues in descending order. The sequence is finite if is finite-dimensional.### Ii-a Supervised Learning

In the supervised learning problem, we are given a training set in the form of input-output pairs, which are i.i.d. samples from an unknown distribution. The input feature space is -dimensional, i.e., for , we have , where is closed and convex. For regression, we assume , whereas for classification we have . The goal of supervised learning is to find a function based on the training set, which can generalize well, i.e., it can accurately predict the outputs of previously unseen inputs.

The problem above can be formulated as minimizing a risk functional , defined as

where is a task-dependent loss function (e.g. hinge loss for SVM), and the expectation is taken with respect to data. As this distribution is unknown, we can only minimize the empirical risk , instead of the true risk , and calculate the gap between the two using standard arguments from measures of function space complexity (e.g. VC dimension, Rademacher complexity, etc). We will discuss two related function classes in the next section.

### Ii-B Kernels and Random Features

To minimize the risk functional, we need to focus on a function class for . Let us consider a symmetric positive-definite function such that for . is then called a positive (semi-)definite kernel, and a possible class to consider is the Reproducing Kernel Hilbert Space (RKHS) associated to , defined as follows

(1) |

Minimizing the empirical risk over this class of functions by optimizing over is theoretically well-understood and justified; however, since this approach requires in space and in time (e.g. training ridge regression with naive matrix inversion), the practical applicability of kernel methods to large datasets is limited.

It is often useful to study RKHS through the following integral operator

(2) |

The spectral properties of the kernel matrix is related to that of (see e.g. [18]). When , is self-adjoint, positive semi-definite and trace-class^{2}^{2}2Note that this is a sufficient (but not a necessary) condition. is a weaker condition for which we can have the same properties [13]..

Let us now restrict our attention to kernels that can be written as,

(3) |

for all and a measure on . Many common kernels can take the form above. Examples include shift-invariant kernels [4] or dot product (e.g. polynomial) kernels [19]^{3}^{3}3We refer the reader to Table 1 in [20] as well as Table 1 in [21] for an exhaustive list..

The integral form (3) can be approximated using Monte Carlo sampling of so-called random features , which are i.i.d. vectors generated from . Then,

(4) |

where

(5) |

naturally leading to the function class

(6) |

The advantage of optimizing the risk function on (rather than ) is that the training can be considerably more efficient if we can keep . For example, in the case of ridge regression, the time would reduce to .

In fact, recently [14] showed that to achieve the same statistical accuracy as kernel ridge regression (i.e., risk error), we only require random features using vanilla Monte Carlo sampling. Note that randomized-feature approach would also reduce the computation time of the test phase from to .

### Ii-C Leverage Scores and Data-Dependent Sampling

The function class (6

) can be also viewed as a one-(hidden)layer neural network (i.e., a perceptron) with an activation function

. To minimize the empirical risk over (6), we can (in general) consider three possible paths: (1) Joint optimization over and , which fully trains the neural network by solving a non-convex optimization. (2) Monte Carlo sampling of and optimizing over [5], which was discussed in the previous section. (3) Data-dependent sampling of and optimizing over . Though (1) seems to be the most powerful technique, the main advantage of (2) and (3) is dealing with a convex problem that avoids (potentially bad) local minima. Another potential advantage is that we do not require the gradient of the activation function for training, which broadens the scope of applicability.Recently, a number of works have proposed supervised data-dependent sampling of random features to enhance the generalization [10, 11, 12]. This objective is achieved by pre-processing the random features (e.g. via optimizing a metric) and focusing on “good” ones (for the generalization purpose). We provide a comprehensive review of these works in Section IV, and here, we focus on presenting a promising unsupervised data-dependent method that relies upon leverage scores [13, 14, 15].

[14, 15] have discussed the impact of leverage scores for ridge regression and SVM. For , leverage score is defined as [13]

(7) |

where is the integral operator in (2). In turn, the optimized distribution for random features is derived as follows

(8) |

Notice that if we have access to , the unbiased approximation of the kernel takes the form

(9) |

with respect to the new measure . All of the aforementioned works have established theoretical results, showing that if the eigenvalues of decay fast enough, the number of random features to achieve error can significantly decrease ( to and even constant!). There are, however, practical challenges to consider.

Practical challenges: We can observe that sampling random features according to gives rise to two issues [13]: (i) we require the knowledge of the infinite-dimensional operator (which is not available), and (ii) the set might be large and impractical to sample from. An empirical mechanism of sampling has been proposed in the experiments of [15] without the theoretical analysis. As noted in [15] a result in the theoretical direction will be extremely useful for guiding practitioners, and we will discuss that in Section III after outlining the empirical leverage scores next.

### Ii-D Sampling Based on Empirical Leverage Scores

To start, let us first define the matrix

(10) |

where is given in (5). Observe that can be related to kernel function as follows,

(11) |

Now, consider another kernel defined as , which measures the dissimilarity of random features. Then, the following relationship holds

(12) |

It is shown in [13] that sampling random features using leverage scores (7) corresponds to selecting (re-weighting) them according to the diagonal elements of the matrix

(13) |

Though the dependence to the operator is relaxed, still the dimension of grows with the number of random features, which suggests that the more features we evaluate (from the set ), the more computational cost we incur. More importantly, another hurdle is that the data distribution is unknown and cannot be calculated. Therefore, appealing to seems to be a natural solution in practice. The outline of such method is given in Algorithm 1^{4}^{4}4This algorithm was also suggested in the experiments of [15].

Input: A sub-sample of inputs, the feature map , an integer , the sampling distribution , the parameter .

(15) |

Output: The new weights .

After running ELSS, we can use

as a discrete probability distribution to draw

samples and minimize the empirical risk over the function class(16) |

The function class is defined in consistent with the choice of approximated kernel given in (9). Note that (assuming that we can calculate the inverse in (14)) ELSS with precisely recovers the Random Kitchen Sinks (RKS) [5] and corresponds to uniform sampling. Figure 1 represents the histogram of weights with features for the Year Prediction dataset. As expected, for the measure is uniform on all samples (bottom left) which translates to a delta plot for the histogram (top left). For (right) the empirical leverage scores transform the distribution of weights to a completely non-uniform measure.

The algorithm requires computations to form the matrix in (14) and calculate the empirical leverage scores (assuming naive inversion of matrix). Parameters and can be selected using rule-of-thumb (without exhaustive tuning). We elaborate on this in the numerical experiments (Section V). Furthermore, the choice of the initial distribution and the feature map

depend on the kernel that we want to use for training. For instance, cosine feature maps and Gaussian distribution can be used for approximating a Gaussian kernel

[4].###### Remark 1.

(Column Sampling) To improve efficiency, column sampling ideas have been previously used in the approximation of large kernel matrices [22, 23] in order to deal with (the ridge-type) matrix in (13); however, those approaches are useful when the closed-form of the kernel matrix is readily available, which is not the case in our setup, as the data distribution is unknown, and the kernel function (in the domain of random features) must be approximated, i.e, we need to deal with (14).

###### Remark 2.

(Block Diagonal Approximation) Following Remark 1, another technique to improve the time cost in (13) is block kernel approximation. It has been shown in [24] that for shift-invariant kernels, block kernel approximation can improve the approximation error (depending on the kernel parameter). For our setup, we still have the same problem as in Remark 1 (unknown ).

## Iii Theoretical Guarantees

We now provide the generalization guarantees of ELSS. The following assumptions are used for the derivation of our result.

###### Assumption 1.

The loss function is uniformly G-Lipschitz-continuous in the first argument.

A number of commonly used loss functions satisfy the assumption above. Examples include the logistic loss and hinge loss for classification, and the quadratic loss for regression.

###### Assumption 2.

The feature map satisfies . This also implies due to (3).

Boundedness assumption is also standard (see e.g. [5]). For example, cosine or sigmoidal feature maps (activation functions) satisfy the assumption. In general, when and are compact, the feature map can be normalized to satisfy Assumption 2.

Algorithm 1 aims to approximate a distribution on an infinite-dimensional set () depending on an infinite-dimensional dimensional operator (). The main two challenges in analyzing ELSS is that we construct such distribution with finite data and finite random features. After the following definition, we state our main result, in which we use to hide poly-log factors.

###### Definition 1.

###### Theorem 1.

Let Assumptions 1-2 hold. For a fixed parameter , let , , and in Algorithm 1. Let random features sampled from (the output of Algorithm 1) define the class in (16), and let be the minimizer of the empirical risk over . If , for , we have

where the expectation is over data and random features, is the integral operator defined with respect to in (4), is a constant factor, and

Interpretation: The bound in Theorem 1 consists of three terms.

appears from calculating Rademacher complexity (estimation due to finite sample size

) and the choice of (which turns out to be optimal in view of (28)). depends on the variance of the approximation of kernel . Not only does it scale inversely with , but also it depends on the feature map. On the other hand, captures the impact of the minimum eigenvalue of (12). This quantity is a random number based on the choice of random features (but it is expected out over in the bound). In general, the trade-off between and is structure-dependent and non-trivial. Increasing can improve at the cost of making looser.As defined in Section II-D both and depend on the feature map , but the important insight is that

“ (which depends on ) is characterized by the data distribution , whereas ((which depends on ) is characterized by the distribution of random features .”

###### Example 1.

For Gaussian kernel using Monte Carlo sampling (see Lemma 2 in [25]), the variance is

for small , where . If (small variance for random features), then . If , by choosing for , and letting , we can maintain the optimal bound as . Similarly, if , can guarantee .

The detailed proof of the theorem is in the supplementary material. It combines several ideas with prior results in literature. To analyze the difference between and , we use the spectral approximation result of [16] for ridge leverage functions (based on recent works on matrix concentration inequalities [17]). We further employ the approximation error by [13] for bounding the error caused by selecting random features out of possible samples .

###### Remark 3.

Notice that the bound in Theorem 1 is for a Lipschitz continuous loss (similar to that of [13]), whereas the results in [14] and [15] are focused on ridge regression and SVM, respectively. On the other hand, our bound is in expectation, whereas the results in [14, 15] are in high probability. The major difference (our contribution) is that all three prior works assumed (i) availability of (8) and (ii) the possibility of sampling from it. Our work relaxes these two by constructing and sampling from (Algorithm 1).

###### Remark 4.

Given that Theorem 1 relates the generalization bound via to the variance of the kernel approximation, methods for variance reduction in sampling initial random features may be more effective than Monte Carlo. For example, Orthogonal Random Features (ORF) [25] is a potential technique for variance reduction in approximation of the Gaussian kernel. In general, assuming a structure on the kernel (more than the integral form (3)) can result in more explicit error term in the generalization bound, but pursuing this direction is outside of the scope of this work.

## Iv Related Literature

Random features: The idea of randomized features was proposed as an elegant technique for improving computational efficiency of kernel methods [4]. As previously mentioned, a wide variety of kernels (of the form (3)), can be approximated using random features (e.g. shift-invariant kernels using Monte Carlo [4] or Quasi Monte Carlo [26] sampling, and dot product kernels [19]. To further increase the efficiency with respect to the input dimension, a number of methods have been developed based on the properties of dense Gaussian random matrices (see e.g. Fast-food [27] and Structured Orthogonal Random Features [25]). These methods can decrease the time complexity by a factor of . To study supervised learning, [28] showed that using -regularization combined with randomized coordinate descent, random features can be made more efficient. More specifically, to achieve error on the risk, random features is required in contrast to in the early work of [5]. In the similar spirit and more recently, [14] showed that to achieve learning error in ridge regression, only random features is required.

Data-dependent random features: A number of recent works have focused on kernel approximation techniques based on data-dependent sampling of random features. Examples include [29] on compact nonlinear feature maps, [30, 31] on approximation of shift-invariant/translation-invariant kernels, [32] on Stein effect in kernel approximation, and [33] on data-dependent approximation using greedy approaches (e.g. Frank-Wolfe).

Another line of research has focused on generalization properties of data-dependent sampling. We discussed the unsupervised techniques based on leverage scores in the Introduction (e.g. [13, 14, 15]). On the other hand, there are supervised methods [10, 11, 12] with the goal of improvement of test error. [10] develop an optimization-based method to re-weight random features and sample important ones for better generalization. This method outperforms [5] in the experiments, but the theoretical bound still indicates the need for features to achieve learning error. In a similar fashion, [11] propose a (supervised) score function for resampling of random features. While effective in practice compared to its prior works, the method does not have a theoretical generalization guarantee. [12] study data-dependent approximation of translation-invariant/rotation-invariant kernels with a focus on SVM. Their technique works based on maximizing the kernel alignment in the Fourier domain. The theoretical bound is derived by applying no-regret learning to solve SVM dual. We finally remark that recently [34] have provided analysis of ELSS in the case of Ridge regression. However, our results are valid for Lipschitz losses and the generalization bound is different. In particular, our bound depends on the eigenvalue decay of the (random) feature gram matrix, highlighting the role of data distribution.

Taylor (explicit) features: Beside random features, explicit feature maps have also been used in speeding up kernel methods. Cotter et al [35] discuss the Taylor approximation of Gaussian kernel in training SVM and provide empirical comparisons with random features. Low-dimensional Taylor approximation has also been addressed in [36, 37] for Gaussian kernel as well as in [38] for other practical kernels. Furthermore, the authors of [39] quantify the approximation error of additive homogeneous kernels. Finally, greedy approximation using explicit features has been discussed in [40]. In general, the experiments of [35] for Gaussian kernel suggests that in comparison of Taylor vs random features, none clearly dominates the other, as the structure of data indeed plays an important role in having a better fit.

Nyström method: This work is also relevant to Nyström method which offers a data-dependent sampling scheme for kernel approximation [41, 42]. In this approach, we use a subset of training data to approximate a surrogate kernel matrix, and then we transform the data points using the approximated kernel. Though being data-dependent, the main difference of this line of research with this work is that we are concerned with learning good features for generalization.

## V Empirical Evaluations

In this section, we provide numerical experiments on four datasets from the UCI Machine Learning Repository.

Benchmark algorithms: We use the following methods in randomized kernel approximation as baselines:

1) RKS [5], with approximated Gaussian kernel: in (4), are sampled from a Gaussian distribution, and

are sampled from the uniform distribution on

.2) LKRF [10], with approximated Gaussian kernel: . random features are sampled and re-weighted by solving a kernel alignment optimization. The top features are used in the training.

3) EERF [11], with approximated Gaussian kernel: , and similar to LKRF, number of initial random features are sampled and then re-weighted according to a score function. The top random features are used in the training.

The selection of the baselines above allows us to evaluate the generalization performance of one *data-independent* method ([5]) and two *supervised* data-dependent methods ([10, 11]) for sampling random features, and compare them to ELSS, which is an *unsupervised* data-dependent method. It should be noted that LKRF and EERF learn a new kernel based on input-outputs, but in view of (9), ELSS only performs importance sampling and does not change the kernel. To change the kernel (still in an unsupervised manner) we modify ELSS (Algorithm 1) to choose the top features (out of ) that have the most weight without actually sampling them. We present that as ELSS (Algorithm 2) in the experiments.

Practical considerations: The Python code of our paper is available on Github ^{5}^{5}5https://github.com/…/ELSS (Suppressed for double-blind review). Grid search was performed to obtain the optimal hyper-parameter of each method for each dataset. For instance, to determine the width of the Gaussian kernel , we obtain the value of for each dataset using grid search in . Notice that for randomized approaches, this amounts to sampling random features from . Following the work of [11] and as a rule of thumb, we set for all algorithms. Theorem 1 suggests that , and given that in our experiments can go up to , even for , , so we simply use for each dataset. The hyper-parameters of the optimization step in LKRF [10] are tuned and the best results are reported.

Datasets: In this work we used four datasets from the UCI Machine Learning Repository, namely the Year Prediction, Online News Popularity, Adult, and Epileptic Seizure Detection datasets, where the former two datasets are regression tasks and the latter two are binary classification tasks. Table I tabulates the information for each dataset. If the training and test samples are not provided separately for a dataset, we split it randomly. We standardize the data by scaling the features to have zero mean and unit variance and the responses in regression to be inside .

Dataset | Task | |||
---|---|---|---|---|

Year prediction | Regression | 90 | 46371 | 5163 |

Online news popularity | Regression | 58 | 26561 | 13083 |

Adult | Classification | 122 | 32561 | 16281 |

Epileptic seizure recognition | Classification | 178 | 8625 | 2875 |

Performance: The results on datasets in Table I are reported in Figure 2. Each experiment was repeated times and the average generalization performance (i.e., test accuracy/error) of the methods are reported. It can be seen that ELSS Algorithm 1 performs better than RKS, which is data-independent. ELSS Algorithm 2 boosts the performance even further and brings the performance closer to that of supervised data-dependent methods, i.e., EERF and LKRF, especially for . It is interesting to observe that in Year Prediction and Seizure Detection, ELSS Algorithm 2, which changes the kernel unsupervised, outperforms LKRF.

Sensitivity to : Naturally a question arises regarding the sensitivity of the generalization performance of ELSS with respect to . From a theoretical point of view and for shift-invariant kernels, one expects to see uniformly distributed weights (i.e., equivalent to RFF) for too large and too small values of and a sweet spot in between. To confirm this, we calculated the train and test error of ELSS for the Year Prediction dataset and reported the results in 3. The top and bottom rows correspond to ELSS Algorithm 1 and Algorithm 2. Each experiment for each

was repeated 50 times and the mean and standard deviations are reported.

Learning with Less Labels (LwLL):

The existing state-of-the-art machine learning models, and specifically, deep learning architectures are data hungry and require a large number of labeled samples for training. Learning with few labels has been a long standing goal in the ML community. Semi-supervised learning, active-learning, and more recently zero-shot, one-shot, and few-shot learning paradigms study different aspects of this problem. Here, we show that the unsupervised nature of ELSS enables us to perform efficient LwLL.

We consider the scenario in which we have lots of unlabeled data with few labeled samples as our training set. To that end, for the Seizure Detection and Adult datasets we use only labeled samples per class for training. We then perform classification using Logistic Regression (LR), ELSS+LR, and a neural network (i.e. a perceptron). For ELSS we used random features and for the perceptron we used

latent neurons. We repeated the experiments for each classifier

times (with randomized sets of training samples) and measured the testing accuracy. The mean and standard deviation of the testing accuracy of these models for different number of ’s is depicted in Figure 4. Note that comparison of ELSS+LR and LR serves as an ablation study and shows the benefit of our proposed approach. In addition, comparison of ELSS+LR with the perceptron shows the benefit of our proposed method compared to neural networks in the LwLL setting.Concluding remarks: A main distinction between leverage scores and the existing literature on data-dependent random feature generation, is the *unsupervised* nature of ELSS. More interestingly, ELSS can provide generalization performance that is on par with *supervised* methods, which use input-output pairs for random feature generation, e.g., [10] and [11]. But why is it important to have an unsupervised data-dependent feature generator, specifically, when the final task is supervised learning? The answer lies in the realm of learning with less labels (LwLL). In supervised LwLL, one cannot afford to train complex classifiers due to the lack of enough labeled data. While linear classifiers generally perform poorly on “complex” datasets. In such scenarios, ELSS could leverage large number of unlabeled data and extract features that provide similar generalization performances to the ones extracted with full supervision. A linear classifier then can be trained on the extracted features with few labels.

## Vi Appendeix

### Vi-a Estimation error

We start with the definition of Rademacher complexity, which is quite standard, but we provide it for completeness.

###### Definition 2.

(Rademacher complexity [43]) For a finite-sample set , the empirical Rademacher complexity of a class is defined as

where the expectation is taken over that are independent samples uniformly distributed on the set . The Rademacher complexity is then .

### Vi-B The error of approximating with finite data

Next, we have the notion of spectral approximation, which will be used in the proof of our main result.

###### Definition 3.

(-spectral approximation [16]) A matrix is a -spectral approximation of another matrix , if the following relationship holds

We now provide the following theorem by [16] on spectral approximation. Note that to avoid confusion, we re-write the theorem with the notation in this work. In particular, observe that in [16] the kernel matrix is defined for data with respect to random features (similar to in (11)), whereas we state the result for which is defined for random features with respect to data^{6}^{6}6The role of random features and inputs are interchanged..
For the sake of simplicity in presentation we use instead of .

###### Theorem 2.

[16] Let and . Assume that . If we use random samples from , then is a -spectral approximation of with probability of at least .

We dropped an term inside the logarithm argument above (which does not affect our result). We now use Theorem 2 to obtain the spectral approximation of the kernel matrix in (12). Observe that is normalized with its dimension (unlike [16]), which necessitates refinement of some parameters in the theorem above before applying it.

###### Lemma 3.

Let and . Given a fixed scalar , let . Then, with probability at least , we have that is a -spectral approximation of , when and .

###### Proof.

Observe that

so we should apply Theorem 2 for . First, we should check the condition . Since

and the right-hand side is the gram matrix, which has a positive norm independent of . Then, we should verify

which holds since . Now with , we need samples to have

which by simple algebra implies

when . ∎

###### Lemma 4.

###### Proof.

Let us start with the fact that

and

Therefore, since

due to Lemma 3, we derive

(17) |

entailing

which finishes the proof. ∎

The above lemma is used to bound the total variation distance between and , as probability mass functions over . Note that from Section 4.2 of [13], the leverage score for . As a result, the optimized distribution with respect to the uniform measure on is

(18) |

and from Algorithm 1 recall that

###### Corollary 5.

Let and . Given a fixed scalar , let . Recall the definition of in (12). Then, with probability at least (over data points), we have that

as long as and .

###### Proof.

Let us start with

Since , the second term in the bound above simplifies to

which is smaller than the first term. Thus, we get

where we applied Lemma 4. Observing that

and plugging it in the bound completes the proof. ∎

### Vi-C The error of approximation using random features out of

For bounding the error caused by selecting random features out of possible samples , we use the approximation error by [13], which is adopted in our notation, following the subsequent definitions. For any probability mass function , we can define the following class of functions:

(19) |

Also, let be the kernel approximated using random features sampled from . Then,

(20) |

The following result [13] characterizes the (minimum) distance between these two classes.

###### Proposition 6.

(Approximation of the unit ball of for optimized distribution [13]) For and the distribution with density defined in equation (18) with respect to (the uniform measure on ). Let be sampled i.i.d. from the density , defining the kernel , and its associated RKHS in (19). Then, for any , with probability with respect to samples , we have,

(21) |

if , where is defined in Definition 1, and is the integral operator defined with respect to .

We remark that in [13], the result above has been stated for comparing with under the assumption of denseness of in to avoid zero eigenvalues of the operator. However, as mentioned in Section 2.1 of [13], this assumption can be relaxed^{7}^{7}7One can generate a sequence of nonzero positive numbers that sum to an infinitesimal number.. Since our base class is derived by the uniform measure on (rather than whole ), we can only compare with using [13]. In the next section, we compare with .

### Vi-D The approximation error of sampling random features from

###### Lemma 7.

###### Proof.

for any and , we have