## 1 Introduction

The kernel mean or the mean element, which corresponds to the mean of the kernel function in a reproducing kernel Hilbert space (RKHS) computed w.r.t. some distribution , has played a fundamental role as a basic building block of many kernel-based learning algorithms [Scholkopf98:NCA, Shawe04:KMPA, Scholkopf01:LKS], and has recently gained increasing attention through the notion of embedding distributions in an RKHS [Berlinet04:RKHS, Smola07Hilbert, Gretton07:MMD, Muandet12:SMM, Song10:KCOND, Muandet13:DG, Muandet13:OCSMM, Sriperumbudur10:Metrics, Fukumizu13a:KBR]. Estimating the kernel mean remains an important problem as the underlying distribution is usually unknown and we must rely entirely on the sample drawn according to .

Given a random sample drawn independently and identically (i.i.d.) from , the most common way to estimate the kernel mean is by replacing by the empirical measure, where is a Dirac measure at [Berlinet04:RKHS, Smola07Hilbert]. Without any prior knowledge about , the empirical estimator is possibly the best one can do. However, [Muandet14:KMSE]

showed that this estimator can be “improved” by constructing a shrinkage estimator which is a combination of a model with low bias and high variance, and a model with high bias but low variance. Interestingly, significant improvement is in fact possible if the trade-off between these two models is chosen appropriately. The shrinkage estimator proposed in

[Muandet14:KMSE], which is motivated from the classical James-Stein shrinkage estimator [Stein81:Multivariate]for the estimation of the mean of a normal distribution, is shown to have a smaller mean-squared error than that of the empirical estimator. These findings provide some support for the conceptual premise that we might be somewhat pessimistic in using the empirical estimator of the kernel mean and there is abundant room for further progress.

In this work, we adopt a spectral filtering approach to obtain shrinkage estimators of kernel mean that improve on the empirical estimator. The motivation behind our approach stems from the idea presented in [Muandet14:KMSE] where the kernel mean estimation is reformulated as an empirical risk minimization (ERM) problem, with the shrinkage estimator being then obtained through penalized ERM. It is important to note that this motivation differs fundamentally from the typical supervised learning as the goal of regularization here is to get the James-Stein-like shrinkage estimators [Stein81:Multivariate] rather than to prevent overfitting. By looking at regularization from a filter function perspective, in this paper, we show that a wide class of shrinkage estimators for kernel mean can be obtained and that these estimators are consistent for an appropriate choice of the regularization/shrinkage parameter.

Unlike in earlier works [Engl96:Reg, Vito061:Spectral, Vito05:RegInv, Baldassarre10:VFL] where the spectral filtering approach has been used in supervised learning problems, we here deal with unsupervised setting and only leverage spectral filtering as a way to construct a shrinkage estimator of the kernel mean. One of the advantages of this approach is that it allows us to incorporate meaningful prior knowledge. The resultant estimators are characterized by the filter function, which can be chosen according to the relevant prior knowledge. Moreover, the spectral filtering gives rise to a broader interpretation of shrinkage through, for example, the notion of early stopping and dimension reduction. Our estimators not only outperform the empirical estimator, but are also simple to implement and computationally efficient.

The paper is organized as follows. In Section 2, we introduce the problem of shrinkage estimation and present a new result that theoretically justifies the shrinkage estimator over the empirical estimator for kernel mean, which improves on the work of [Muandet14:KMSE] while removing some of its drawbacks. Motivated by this result, we consider a general class of shrinkage estimators obtained via spectral filtering in Section 3 whose theoretical properties are presented in Section 4. The empirical performance of the proposed estimators are presented in Section 5. The missing proofs of the results are given in the appendix.

## 2 Kernel mean shrinkage estimator

In this section, we present preliminaries on the problem of shrinkage estimation in the context of estimating the kernel mean [Muandet14:KMSE] and then present a theoretical justification (see Theorem 1) for shrinkage estimators that improves our understanding of the kernel mean estimation problem, while alleviating some of the issues inherent in the estimator proposed in [Muandet14:KMSE].

Preliminaries: Let be an RKHS of functions on a separable topological space . The space is endowed with inner product , associated norm , and reproducing kernel , which we assume to be continuous and bounded, i.e., . The kernel mean of some unknown distribution on and its empirical estimate—we refer to this as *kernel mean estimator* (KME)—from i.i.d. sample are given by

(1) |

respectively. As mentioned before, is the “best” possible estimator to estimate if nothing is known about . However, depending on the information that is available about , one can construct various estimators of that perform “better” than . Usually, the performance measure that is used for comparison is the mean-squared error though alternate measures can be used. Therefore, our main objective is to improve upon KME in terms of the mean-squared error, i.e., construct such that for all with strict inequality holding for at least one element in where is a suitably large class of Borel probability measures on . Such an estimator is said to be *admissible* w.r.t . If is the set of all Borel probability measures on , then satisfying the above conditions may not exist and in that sense, is possibly the best estimator of that one can have.

Admissibility of shrinkage estimator: To improve upon KME, motivated by the James-Stein estimator, , [Muandet14:KMSE] proposed a shrinkage estimator where is the shrinkage parameter that balances the low-bias, high-variance model () with the high-bias, low-variance model (). Assuming for simplicity , [Muandet14:KMSE] showed that if and only if where . While this is an interesting result, the resultant estimator is strictly not a “statistical estimator” as it depends on quantities that need to be estimated, i.e., it depends on whose choice requires the knowledge of , which is the quantity to be estimated. We would like to mention that [Muandet14:KMSE] handles the general case with being not necessarily zero, wherein the range for then depends on as well. But for the purposes of simplicity and ease of understanding, for the rest of this paper we assume . Since is not practically interesting, [Muandet14:KMSE] resorted to the following representation of and as solutions to the minimization problems [Muandet14:KMSE, Kim12:RKDE]:

(2) |

using which is shown to be the solution to the regularized empirical risk minimization problem:

(3) |

where and , i.e., . It is interesting to note that unlike in supervised learning (e.g., least squares regression), the empirical minimization problem in (2) is not ill-posed and therefore does not require a regularization term although it is used in (3) to obtain a shrinkage estimator of . [Muandet14:KMSE] then obtained a value for through cross-validation and used it to construct as an estimator of , which is then shown to perform empirically better than . However, no theoretical guarantees including the basic requirement of being consistent are provided. In fact, because is data-dependent, the above mentioned result about the improved performance of over a range of does not hold as such a result is proved assuming is a constant and does not depend on the data. While it is clear that the regularizer in (3) is not needed to make (2) well-posed, the role of is not clear from the point of view of being consistent and better than . The following result provides a theoretical understanding of from these viewpoints.

###### Theorem 1.

Let be constructed as in (3). Then the following hold.

(i) as and . In addition, if for some , then .

(ii) For with and , define where . Then and , we have .

###### Remark.

(ii) Suppose for some and , we choose , which means the resultant estimator is a proper estimator as it does not depend on any unknown quantities. Theorem 1(ii) shows that for any and , is a “better” estimator than . Note that for any , . This means is admissible if we restrict to which considers only those distributions for which is strictly less than a constant, . It is obvious to note that if is very small or is very large, then gets closer to one and behaves almost like , thereby matching with our intuition.

(iii) A nice interpretation for can be obtained as in Theorem 1(ii) when is a translation invariant kernel on . It can be shown that

contains the class of all probability measures whose characteristic function has an

norm (and therefore is the set of square integrable probability densities if has a density w.r.t. the Lebesgue measure) bounded by a constant that depends on , and (see §B in the appendix).∎## 3 Spectral kernel mean shrinkage estimator

Let us return to the shrinkage estimator considered in [Muandet14:KMSE], i.e., , where are the countable orthonormal basis (ONB) of —countable ONB exist since is separable which follows from being separable and being continuous [Steinwart:08, Lemma 4.33]. This estimator can be generalized by considering the shrinkage estimator where is a sequence of shrinkage parameters. If is the risk of this estimator, the following theorem gives an optimality condition on for which .

###### Theorem 2.

For some ONB , where and denote the risk of the th component of and , respectively. Then, if

(4) |

where and denote the Fourier coefficients of and , respectively.

The condition in (4) is a component-wise version of the condition given in [Muandet14:KMSE, Theorem 1] for a class of estimators which may be expressed here by assuming that we have a constant shrinkage parameter for all . Clearly, as the optimal range of may vary across coordinates, the class of estimators in [Muandet14:KMSE] does not allow us to adjust

accordingly. To understand why this property is important, let us consider the problem of estimating the mean of Gaussian distribution illustrated in Figure

1. For correlated random variable

, a natural choice of basis is the set of orthonormal eigenvectors which diagonalize the covariance matrix

of . Clearly, the optimal range ofdepends on the corresponding eigenvalues. Allowing for different basis

and shrinkage parameter opens up a wide range of strategies that can be used to construct “better” estimators.A natural strategy under this representation is as follows: i) we specify the ONB and project onto this basis. ii) we shrink each independently according to a pre-defined shrinkage rule. iii) the shrinkage estimate is reconstructed as a superposition of the resulting components. In other words, an ideal shrinkage estimator can be defined formally as a non-linear mapping:

(5) |

where is a shrinkage rule. Since we make no reference to any particular basis , nor to any particular shrinkage rule , a wide range of strategies can be adopted here. For example, we can view *whitening* as a special case in which is the data average and where and are the th eigenvalue and eigenvector of the covariance matrix, respectively.

Inspired by Theorem 2, we adopt the spectral filtering approach as one of the strategies to construct the estimators of the form (5). To this end, owing to the regularization interpretation in (3), we consider estimators of the form for some —looking for such an estimator is equivalent to learning a *signed measure* that is supported on . Since is a minimizer of (3), should satisfy where is an Gram matrix and .
Here the solution is trivially , i.e., the coefficients of the standard estimator if is invertible. Since may not exist and even if it exists, the computation of it can be numerically unstable, the idea of spectral filtering—this is quite popular in the theory of inverse problems [Engl96:Reg] and has been used in kernel least squares [Vito05:RegInv]—is to replace by some regularized matrices that approximates as goes to zero. Note that unlike in (3), the regularization is quite important here (i.e., the case of estimators of the form ) without which the the linear system is under determined. Therefore, we propose the following class of estimators:

(6) |

where is a filter function and is referred to as a shrinkage parameter. The matrix-valued function can be described by a scalar function on the spectrum of . That is, if is the eigen-decomposition of where , we have and . For example, the scalar filter function of Tikhonov regularization is . In the sequel, we call this class of estimators a *spectral kernel mean shrinkage estimator* (Spectral-KMSE).

###### Proposition 3.

The Spectral-KMSE satisfies , where

are eigenvalue and eigenfunction pairs of the empirical covariance operator

defined as .By virtue of Proposition 3, if we choose , the Spectral-KMSE is indeed in the form of (5) when and is the kernel PCA (KPCA) basis, with the filter function determining the shrinkage rule. Since by definition approaches the function as goes to 0, the function approaches 1 (no shrinkage). As the value of increases, we have more shrinkage because the value of deviates from 1, and the behavior of this deviation depends on the filter function . For example, we can see that Proposition 3 generalizes Theorem 2 in [Muandet14:KMSE] where the filter function is , i.e., . That is, we have , implying that the effect of shrinkage is relatively larger in the low-variance direction. In the following, we discuss well-known examples of spectral filtering algorithms obtained by various choices of . Update equations for and corresponding filter functions are summarized in Table 3. Figure 3 illustrates the behavior of these filter functions.

#### L2 Boosting.

This algorithm, also known as gradient descent or Landweber iteration, finds a weight by performing a gradient descent iteratively. Thus, we can interpret *early stopping* as shrinkage and the reciprocal of iteration number as shrinkage parameter, i.e., . The step-size does not play any role for shrinkage [Vito061:Spectral], so we use the fixed step-size throughout.

#### Accelerated L2 Boosting.

This algorithm, also known as -method, uses an accelerated gradient descent step, which is faster than L2 Boosting because we only need iterations to get the same solution as the L2 Boosting would get after iterations. Consequently, we have .

#### Iterated Tikhonov.

This algorithm can be viewed as a combination of Tikhonov regularization and gradient descent. Both parameters and play the role of shrinkage parameter.

#### Truncated Singular Value Decomposition.

This algorithm can be interpreted as a projection onto the first principal components of the KPCA basis. Hence, we may interpret *dimensionality reduction* as shrinkage and the size of reduced dimension as shrinkage parameter. This approach has been used in [Song13:LowRank] to improve the kernel mean estimation under the low-rank assumption.

Most of the above spectral filtering algorithms allow to compute the coefficients without explicitly computing the eigen-decomposition of , as we can see in Table 3, and some of which may have no natural interpretation in terms of regularized risk minimization. Lastly, an initialization of corresponds to the target of shrinkage. In this work, we assume that throughout.

## 4 Theoretical properties of Spectral-KMSE

This section presents some theoretical properties for the proposed Spectral-KMSE in (6). To this end, we first present a regularization interpretation that is different from the one in (3) which involves learning a smooth operator from to [GrunewalderAS13:SO]. This will be helpful to investigate the consistency of the Spectral-KMSE. Let us consider the following regularized risk minimization problem,

(7) |

where is a Hilbert-Schmidt operator from to . Essentially, we are seeking a smooth operator that maps to itself, where (7) is an instance of the regression framework in [GrunewalderAS13:SO]. The formulation of shrinkage as the solution of a smooth operator regression, and the empirical solution (8) and in the lines below, were given in a personal communication by Arthur Gretton. It can be shown that the solution to (7) is given by where is a covariance operator in defined as (see §E of the appendix for a proof). Define . Since is bounded, it is easy to verify that is Hilbert-Schmidt and therefore compact. Hence by the Hilbert-Schmidt theorem, where are the positive eigenvalues and are the corresponding eigenvectors that form an ONB for the range space of denoted as . This implies can be decomposed as . We can observe that the filter function corresponding to the problem (7) is . By extending this approach to other filter functions, we obtain which is equivalent to .

Since is a compact operator, the role of filter function is to regularize the inverse of . In standard supervised setting, the explicit form of the solution is where is the integral operator of kernel acting in and is the expected solution given by [Vito061:Spectral]. It is interesting to see that admits a similar form to that of , but it is written in term of covariance operator instead of the integral operator . Moreover, the solution to (7) is also in a similar form to the regularized conditional embedding [Song10:KCOND]. This connection implies that the spectral filtering may be applied more broadly to improve the estimation of conditional mean embedding, i.e., .

The empirical counterpart of (7) is given by

(8) |

resulting in where , which matches with the one in (6) with . Note that this is exactly the F-KMSE proposed in [Muandet14:KMSE]. Based on which depends on , an empirical version of it can be obtained by replacing and with their empirical estimators leading to . The following result shows that , which means the Spectral-KMSE proposed in (6) is equivalent to solving (8).

###### Proposition 4.

Let and be the sample counterparts of and given by and , respectively. Then, we have that , where is defined in (6).

Having established a regularization interpretation for , it is of interest to study the consistency and convergence rate of similar to KMSE in Theorem 1. Our main goal here is to derive convergence rates for a broad class of algorithms given a set of sufficient conditions on the filter function, . We believe that for some algorithms it is possible to derive the best achievable bounds, which requires ad-hoc proofs for each algorithm. To this end, we provide a set of conditions any *admissible* filter function, must satisfy.

###### Definition 1.

A family of filter functions is said to be admissible if there exists finite positive constants , , , and (all independent of ) such that and hold, where .

These conditions are quite standard in the theory of inverse problems [Engl96:Reg, Gerfo08:Spectral]. The constant is called the *qualification* of and is a crucial factor that determines the rate of convergence in inverse problems. As we will see below, that the rate of convergence of depends on two factors: (a) smoothness of which is usually unknown as it depends on the unknown and (b) qualification of which determines how well the smoothness of is captured by the spectral filter, .

###### Theorem 5.

Suppose is admissible in the sense of Definition 1. Let . If for some , then for any , with probability at least ,

where denotes the range space of and is some universal constant that does not depend on and . Therefore, with .

Theorem 5 shows that the convergence rate depends on the smoothness of which is imposed through the range space condition that for some . Note that this is in contrast to the estimator in Theorem 1 which does not require any smoothness assumptions on . It can be shown that the smoothness of increases with increase in . This means, irrespective of the smoothness of for , the best possible convergence rate is which matches with that of KMSE in Theorem 1. While the qualification does not seem to directly affect the rates, it controls the rate at which converges to zero. For example, if which corresponds to Tikhonov regularization, it can be shown that which means for , implying that cannot decay to zero slower than . Ideally, one would require a larger (preferably infinity which is the case with truncated SVD) so that the convergence of to zero can be made arbitrarily slow if is large. This way, both and control the behavior of the estimator.

In fact, Theorem 5 provides a choice for —which is what we used in Theorem 1 to study the admissibility of to —to construct the Spectral-KMSE. However, this choice of depends on which is not known in practice (although is known as it is determined by the choice of ). Therefore, is usually learnt from data through cross-validation or through Lepski’s method [Lepski-97] for which guarantees similar to the one presented in Theorem 5 can be provided. However, irrespective of the data-dependent/independent choice for , checking for the admissibility of Spectral-KMSE (similar to the one in Theorem 1) is very difficult and we intend to consider it in future work.

## 5 Empirical studies

#### Synthetic data.

Given the i.i.d. sample from where

, we evaluate different estimators using the loss function

. The risk of the estimator is subsequently approximated by averaging over independent copies of . In this experiment, we set , , and . Throughout, we use the Gaussian RBF kernelwhose bandwidth parameter is calculated using the median heuristic, i.e.,

. To allow for an analytic calculation of the loss , we assume that the distribution is a -dimensional mixture of Gaussians [Muandet12:SMM, Muandet14:KMSE]. Specifically, the data are generated as follows: where andare the uniform distribution and Wishart distribution, respectively. As in

[Muandet14:KMSE], we set .A natural approach for choosing is cross-validation procedure, which can be performed efficiently for the iterative methods such as Landweber and accelerated Landweber. For these two algorithms, we evaluate the leave-one-out score and select at the iteration that minimizes this score (see, e.g., Figure 2(a)). Note that these methods have the built-in property of computing the whole *regularization path* efficiently. Since each iteration of the iterated Tikhonov is in fact equivalent to the F-KMSE, we assume for simplicity and use the efficient LOOCV procedure proposed in [Muandet14:KMSE] to find at each iteration. Lastly, the truncation limit of TSVD can be identified efficiently by mean of generalized cross-validation (GCV) procedure [Golub79:GCV]. To allow for an efficient calculation of GCV score, we resort to the alternative loss function .

Figure 2 reveals interesting aspects of the Spectral-KMSE. Firstly, as we can see in Figure 2(a), the number of iterations acts as shrinkage parameter whose optimal value can be attained within just a few iterations. Moreover, these methods do not suffer from “over-shrinking” because as . In other words, if the chosen happens to be too large, the worst we can get is the standard empirical estimator. Secondly, Figure 2(b) demonstrates that both Landweber and accelerated Landweber are more computationally efficient than the F-KMSE. Lastly, Figure 2(c) suggests that the improvement of shrinkage estimators becomes increasingly remarkable in a high-dimensional setting. Interestingly, we can observe that most Spectral-KMSE algorithms outperform the S-KMSE, which supports our hypothesis on the importance of the geometric information of RKHS mentioned in Section 3. In addition, although the TSVD still gain from shrinkage, the improvement is smaller than other algorithms. This highlights the importance of filter functions and associated parameters.

#### Real data.

We apply Spectral-KMSE to the density estimation problem via kernel mean matching [Song08:TDE, Muandet14:KMSE]. The datasets were taken from the UCI repository^{1}^{1}1http://archive.ics.uci.edu/ml/ and pre-processed by standardizing each feature. Then, we fit a mixture model to the pre-processed dataset by minimizing subject to the constraint . Here is the mean embedding of the mixture model and is the empirical mean embedding obtained from . Based on different estimators of , we evaluate the resultant model by the negative log-likelihood score on the test data. The parameters are initialized by the best one obtained from the -means algorithm with 50 initializations. Throughout, we set and use 25% of each dataset as a test set.

Dataset | KME | S-KMSE | F-KMSE | Landweber | Acc Land | Iter Tik | TSVD |
---|---|---|---|---|---|---|---|

ionosphere | 36.1769 | 36.1402 | 36.1622 | 36.1204 | 36.1554 | 36.1334 | 36.1442 |

glass | 10.7855 | 10.7403 | 10.7448 | 10.7099 | 10.7541 | 10.9078 | 10.7791 |

bodyfat | 18.1964 | 18.1158 | 18.1810 | 18.1607 | 18.1941 | 18.1267 | 18.1061 |

housing | 14.3016 | 14.2195 | 14.0409 | 14.2499 | 14.1983 | 14.2868 | 14.3129 |

vowel | 13.9253 | 13.8426 | 13.8817 | 13.8337 | 14.1368 | 13.8633 | 13.8375 |

svmguide2 | 28.1091 | 28.0546 | 27.9640 | 28.1052 | 27.9693 | 28.0417 | 28.1128 |

vehicle | 18.5295 | 18.3693 | 18.2547 | 18.4873 | 18.3124 | 18.4128 | 18.3910 |

wine | 16.7668 | 16.7548 | 16.7457 | 16.7596 | 16.6790 | 16.6954 | 16.5719 |

wdbc | 35.1916 | 35.1814 | 35.0023 | 35.1402 | 35.1366 | 35.1881 | 35.1850 |

Table 1 reports the results on real data. In general, the mixture model obtained from the proposed shrinkage estimators tend to achieve lower negative log-likelihood score than that obtained from the standard empirical estimator. Moreover, we can observe that the relative performance of different filter functions vary across datasets, suggesting that, in addition to potential gain from shrinkage, incorporating prior knowledge through the choice of filter function could lead to further improvement.

## 6 Conclusion

We shows that several shrinkage strategies can be adopted to improve the kernel mean estimation. This paper considers the spectral filtering approach as one of such strategies. Compared to previous work [Muandet14:KMSE], our estimators take into account the specifics of kernel methods and meaningful prior knowledge through the choice of filter functions, resulting in a wider class of shrinkage estimators. The theoretical analysis also reveals a fundamental similarity to standard supervised setting. Our estimators are simple to implement and work well in practice, as evidenced by the empirical results.

### Acknowledgments

The first author thanks Ingo Steinwart for pointing out existing works along the line of spectral filtering, and Arthur Gretton for suggesting the connection of shrinkage to smooth operator framework. This work was carried out when the second author was a Research Fellow in the Statistical Laboratory, Department of Pure Mathematics and Mathematical Statistics at the University of Cambridge.

## Appendix A Proof of Theorem 1

(i) Since , we have

From [Gretton07:MMD], we have that and therefore the result follows.

(ii) Define . Consider

Substituting for in the r.h.s. of the above equation, we have

It is easy to verify that if

###### Remark.

If , then it is easy to check that where and

represent the mean vector and covariance matrix. Note that this choice of kernel yields a setting similar to classical James-Stein estimation, wherein for all

and all , is admissible for any , where . On the other hand, the James-Stein estimator is admissible for only but for any .## Appendix B Consequence of Theorem 1 if is translation invariant

Claim: Let where is a bounded continuous positive definite function with . For with and , define

where is the characteristic function of . Then and , we have