Multiple Kernel Learning from U-Statistics of Empirical Measures in the Feature Space

02/27/2019 ∙ by Masoud Badiei Khuzani, et al. ∙ Stanford University 0

We propose a novel data-driven method to learn multiple kernels in kernel methods of statistical machine learning from training samples. The proposed kernel learning algorithm is based on a U-statistics of the empirical marginal distributions of features in the feature space given their class labels. We prove the consistency of the U-statistic estimate using the empirical distributions for kernel learning. In particular, we show that the empirical estimate of U-statistic converges to its population value with respect to all admissible distributions as the number of the training samples increase. We also prove the sample optimality of the estimate by establishing a minimax lower bound via Fano's method. In addition, we establish the generalization bounds of the proposed kernel learning approach by computing novel upper bounds on the Rademacher and Gaussian complexities using the concentration of measures for the quadratic matrix forms.We apply the proposed kernel learning approach to classification of the real-world data-sets using the kernel SVM and compare the results with 5-fold cross-validation for the kernel model selection problem. We also apply the proposed kernel learning approach to devise novel architectures for the semantic segmentation of biomedical images. The proposed segmentation networks are suited for training on small data-sets and employ new mechanisms to generate representations from input images.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Kernel methods are one of the most pervasive techniques to capture the non-linear relationship between the representations of input data and labels in statistical machine learning algorithms. The kernel methods circumvent the explicit feature mapping that is required to learn a non-linear function or decision boundary in linear learning algorithms. Instead, the kernel methods only rely on the inner product of feature maps in the feature space, which is often known as the “kernel trick” in the machine learning literature. For large-scale classification problems, however, implicit lifting provided by the kernel trick comes with the cost of prohibitive computational and memory complexities as the kernel Gram matrix must be generated via evaluating the kernel function across all pairs of datapoints. As a result, large training sets incur large computational and storage costs. To alleviate this issue, Rahimi and Recht proposed random Fourier features that aims to approximate the kernel function via explicit random feature maps [2, 3].

Although the kernel SVMs with random features are typically formulated in terms of convex optimization problems, which have a single global optimum and thus do not require heuristic choices of learning rates, starting configurations, or other hyper-parameters, there are statistical model selection problems within the kernel approach due to the choice of the kernel. In practice, it can be difficult for the practitioner to find prior justification for the use of one kernel instead of another to generate the random feature maps. It would be more desirable to explore data-driven mechanisms that allow kernels to be chosen automatically.

While sophisticated model selection methods such as the cross-validation, jackknife, or their approximate surrogates [4, 5, 6] can address those model selection issues, they often slow down the training process as they repeatedly re-fit the model. Therefore, to facilitate the kernel model selection problem, Sinha and Duchi [7] proposed a novel scheme to learn the kernel from the training samples via robust optimization of the kernel-target alignment [8]. Cortes, et al. [9] also presents a kernel learning algorithm based on the notion of centered alignment-a similarity measure between kernels or kernel matrices. In a separate study, [10] have also studied the generalization bounds for learning kernels from a mixture class of base kernels. The same multiple kernel learning scheme have been recently revisited by Shahrampour and Tarokh [11].

An alternative approach to kernel learning is to sample random features from an arbitrary distribution with arbitrary hyper-parameters (i.e. no tuning) and then apply a supervised feature screening method to distill random features with high discriminative power from redundant random features. This approach has been adopted by Shahrampour et al. [12]

, where an energy-based exploration of random features is proposed. The feature selection algorithm in

[12] employs a score function based on the kernel polarization techniques of [13] to explore the domain of random features and retains samples with the highest score. Therefore, in the context of feature selection, the scoring rule of [12] belongs to the class of filter methods [14], i.e., the scoring rule of [12]

is based on intrinsic properties of the data, independent of the choice of the classifier (

e.g. logistic or SVM). In addition, the number of selected random features in [12] is a hyper-parameter of the feature selection algorithm which must be determined by human user or via the cross-validation.

In this paper, we propose a novel method for kernel learning from the mixture class of a set of known base kernels. In contrast to the previous kernel learning approaches that are either use a scoring rule or optimization of the kernel-target alignment, our approach is based on the maximal separation between the empirical marginal distributions of features given their class labels. More specifically, we select the mixing coefficients of the kernel as to maximize the distance between the empirical distributions based on the maximum mean discrepency (MMD). We analyze the consistency of the MMD for measuring the empirical distributions. In particular, we prove that a biased MMD estimator based on the empirical distributions converges to the MMD of the population measures. We also prove minimax lower bounds for the MMD estimators, i.e., we prove a lower bound on the difference of the population MMD and its estimator for all admissible MMD estimators. In addition, we prove novel generalization bounds based on the notions of Rademacher and Gaussian complexities of the class of functions defined by random features that improves upon previous bounds Cortes, et al. [10].

We leverage our kernel learning approach to develop two novel architectures for the semantic segmentation of biomedical images using the kernel SVMs. Specifically, in the first architecture, we leverage the VGG network that is pre-trained on natural images in conjunction with a kernel feature selection method to extract semantic features. In the second mechanism, we extract features via Mallat’s scattering convolution neural networks (SCNNs)

[15]. The scattering network uses a tree-like feature extractor in conjunction with pre-specified wavelet transforms. Each node in the tree is a filter that is obtained by scaling and rotating an atom to construct translation invariant representations from the input images. We then validate our kernel learning algorithm and evaluate the performance of our proposed segmentation networks for Angiodysplasia segmentation from colonoscopy images from the GIANA challenge dataset. Angiodysplasia is an abnormality with the blood vessels in the gastrointestinal (GI) tract. The GI tract includes the mouth, esophagus, small and large intestines, and stomach. This condition causes swollen or enlarged blood vessels, as well as the formation of bleeding lesions in the colon and stomach. Gold-standard examination for angiodysplasia detection and localization in the small bowel is performed using Wireless Capsule Endoscopy (WCE). As a benchmark for comparision, we also evaluate the segmentation performance of fully convolutional network (FCN) [16], a popular architecture for the semantic segmentation of natural images. We compare our segmentation results with that of FCN in terms of intersection-over-union (IoU) metric. We show that while FCN over-fit the small dataset, our proposed segmentation architecture provides accurate segmentation results.

1.1. Prior Works on Kernel Learning Methods

There exists a copious theoretical literature developed on kernel learning methods [17], [10]. Cortes, et al. studied a kernel learning procedure from the class of mixture of base kernels. They have also studied the generalization bounds of the proposed methods via notion of Rachademar complexity. The same authors have also studied a two-stage kernel learning in [18] based on a notion of alignment. The first stage of this technique consists of learning a kernel that is a convex combination of a set of base kernels. The second stage consists of using the learned kernel with a standard kernel-based learning algorithm such as SVMs to select a prediction hypothesis. In [19], the authors have proposed a semi-definite programming for the kernel-target alignment problem. An alternative approach to kernel learning is to sample random features from an arbitrary distribution (without tunning its hyper-parameters) and then apply a supervised feature screening method to distill random features with high discriminative power from redundant random features. This approach has been adopted by Shahrampour et al. [12], where an energy-based exploration of random features is proposed. The feature selection algorithm in [12] employs a score function based on the kernel polarization techniques of [13] to explore the domain of random features and retains samples with the highest score.

This paper is closely related to the work of Duchi, et al. [7]. Therein, the authors have studied a kernel learning method via maximizing the kernel-target alignment with respect to non-parametric distributions in the ball of a (user-defined) base distribution defined by the -divergence distance. Therefore, the proposed kernel alignment in [7] has two hyper-parameters, namely, the radius of the ball determining the size of the distribution class, and the base distribution determining the center of the ball.

The rest of this paper is organized as follows. In Section 2, we present some background on kernel methods in classification and regression problems. In Section 3, we review the notion of the kernel-target alignment. In Section 4, we relate the kernel-target alignment to the notion of two-sample test for discriminating between two distributions. We also provide a kernel selection procedure. In Section 5 we state the main assumptions and provide the generalization bounds and performance guarantees, while deferring the proofs to the appendices. In Section 7, we leverage the kernel learning algorithm to devise novel architectures for the semantic segmentation. In Section 8 we evaluate the performance of the proposed segmentation networks for localization and delineation of Angiodysplasia from colonoscopy images. In Section 9 we discuss future works and conclude the paper.

2. Preliminaries

In this section, we review preliminaries of kernel methods in classification and regression problems.

Notation and Definitions

. We denote vector by bold small letters,

e.g. , and matrices by the bold capital letters, e.g., . The unit sphere in -dimensions centered at the origin is denoted by . We denote the -by-identity matrix with , and the vector of all ones with . For a symmetric matrix , let denotes its spectral norm, and

denotes its Frobenius norm. The eigenvalues of the matrix

are ordered and denoted by . We alternatively write and for the minimum and maximum eigenvalues of the matrix , respectively. The empirical spectral measure of the matrix is denoted by

(2.1)

where is Dirac’s measure concentrated at . We denote the limiting spectral distribution of by , and its support by .

The Wigner’s semi-circle law is denoted by and is defined as below

(2.2)

where is the indicator function. Let

be a random matrix whose entries are independent identically distributed random variables with the zero mean and the variance

. Let and . The limiting distribution of the covariance matrix is given by the Marčhenko-Pastur law [20],

(2.3)

where

(2.4)

and . The free additive convolution between two laws and in the sense of Voiculescu [21] is denoted by . Similarly, the free multiplicative convolution is denoted by .

Definition 2.1.

(Orlicz Norm) The Young-Orlicz modulus is a convex non-decreasing function such that and when . Accordingly, the Orlicz norm of an integrable random variable with respect to the modulus is defined as

(2.5)

In the sequel, we consider the Orlicz modulus . Accordingly, the cases of and norms are called the sub-Gaussian and the sub-exponential norms and have the following alternative definitions:

Definition 2.2.

(Sub-Gaussian Norm) The sub-Gaussian norm of a random variable , denoted by , is defined as

(2.6)

For a random vector , its sub-Gaussian norm is defined as

(2.7)
Definition 2.3.

(Sub-exponential Norm) The sub-exponential norm of a random variable , denoted by , is defined as follows

(2.8)

For a random vector , its sub-exponential norm is defined as

(2.9)

For a given convex lower semi-continuous function with , the

-divergence between two Borel probability distributions

is defined as followsWe say is absolutely continuous w.r.t. , denoted by , if for all the subsets satisfying , we have .

Here, is the Radon-Nikodyme derivative of with respect to . In the case that , the -divergence corresponds to the Kullback-Leibler (KL) divergence . For , the -divergence is equivalent to the -divergence . Lastly, corresponds to the total variation distance .

We use asymptotic notations throughout the paper. We use the standard asymptotic notation for sequences. If and are positive sequences, then means that , whereas means that . Furthermore, implies . Moreover means that and means that . Lastly, we have if and . Finally, for positive , denote if is at most some universal constant.

2.1. Regularized Risk Minimization in Reproducing Kernel Hilbert Spaces

In this paper we focus on the classical setting of supervised learning, whereby we are given are given

feature vectors and their corresponding univariate class labels . In this paper, we are concerned with the binary classification. Therefore, the target spaces is given by .

In the following definition, we characterize the notion of kernel functions and reproducing Hilbert spaces:

Definition 2.4.

(Kernel Function [22]) Let be a non-empty set. Then a function is called a kernel function on if there exists a Hilbert space over , and a map such that for all , we have

(2.10)

where is the inner product in the Hilbert space .

Some examples of reproducing kernels on (in fact all these are radial) that appear throughout the paper are:

  • Gaussian: The Gaussian kernel is given by .

  • ANOVA: The ANOVA kernel is defined by and performs well in multidimensional regression problems.

  • Laplacian: The Laplacian kernel is similar to the Gaussian kernel, except that it is less sensitive to the bandwidth parameter. In particular, .

Definition 2.5.

(Characteristic Kernel [22]) We say that an RKHS is characteristic if the embedding map is injective.

If the kernel is bounded, Definition 2.5 is equivalent to saying that is dense in for any Borel probability measure [23]. Two examples of the characteristic kernels are the Gaussian and Laplacian kernels.

Recall that a positive definite kernel is a function such that for any , for any finite set of points in and real numbers , we have . For any symmetric and positive semi-definite kernel , by Mercer’s theorem [24], there exists: a unique functional Hilbert space referred to as the reproducing kernel Hilbert space (see Definition 2.6) on such that is the inner product in the space, and (ii) a map defined as that satisfies Definition 2.4. The function is called the feature map and the Hilbert space is often called the feature space.

Definition 2.6.

(Reproducing Kernel Hilbert Space [22]) A kernel is a reproducing kernel of a Hilbert space if . For a (compact) subset , and a Hilbert space of functions , we say is a Reproducing Kernel Hilbert Space if there exists a kernel such that has the reproducing property, and spans , where is the completion of a set .

Given a loss function

(e.g.

hinge function for SVM, or quadratic for linear regression), and a reproducing Hilbert kernel space

, a classifier is selected by minimizing the empirical risk minimization with a quadratic regularization,

(2.11)

where is the RKHS norm, is a regularization parameter, and is the joint empirical measure, i.e.,

(2.12)

where is Dirac’s delta function. The Representer Theorem due to Wahba [25] for quadratic loss functions, and its generalization due to Schölkopf, Herbrich, and Smola [26] for general losses states that the optimal solution of the empirical loss minimization in Eq. (2.11) admits the following representation

(2.13)

where , for all , where is an offset term. Moreover, we have that , where is the kernel matrix with columns . The key insight provided by the Representer Theorem is that even when the Hilbert space is high-dimensional or infinite dimensional, we can solve (2.11) in polynomial time via reducing the empirical loss minimization in Eq. (2.11) to a finite dimensional optimization over the spanning vector , namely,

(2.14)

For small training data-sets, the optimization problem in Eq. (2.14) is solvable in polynomial time using stochastic or batch gradient descent. However, for large training data-sets, solving the empirical risk minimization in Eq. (2.14) is cumbersome since evaluating the kernel expansion (2.13) at a test point requires operations. Moreover, the computational complexity of preparing the kernel matrix during training phase increases quadratically with the sample size .

To address this problem, Rahimi and Rechet proposed random Fourier features [2] to generate low-dimensional embedding of shift invariant kernels using explicit feature maps.An alternative approach to develop scalable kernel methods is the so-called Nyström approximation [27] which aims to construct a small sketch of the kernel matrix and use it in lieu of the full matrix to solve Problem (2.14). In particular, let be the explicit feature map, where is the support set of random features. Then, the kernel has the following

(2.15)
(2.16)

where is a probability measure, and is the set of Borel measures with the support set . In the standard framework of random fourier feature proposed by Rahimi and Rechet [2], , where , and . In this case, by Bochner’s Theorem [28],

is indeed the Fourier transform of the shift invariant kernel

.

For training purposes, the expression in Eq. (2.15) is approximated using the Monte Carlo sampling method. In particular, let be the i.i.d. samples. Then, the kernel function can be approximated by the sample average of the expectation in Eq. (2.16). Specifically, the following point-wise estimate has been shown in [2]:

(2.17)

where typically . In the preceding inequality, is assumed to be a compact subset with the diameter of , and .

Using the random Fourier features , the following empirical loss minimization is solved:

(2.18a)
(2.18b)

for some constant , where , and . The approach of Rahimi and Recht [2] is appealing due to its computational tractability. In particular, preparing the feature matrix during training requires computations, while evaluating a test sample needs computations, which significantly outperforms the complexity of traditional kernel methods.

3. Data-Dependent Kernel Selection via Kernel-Target Alignment

In this paper, we focus on a hinge loss function associated with the support vector machines (SVMs) for classification problems. To describe a kernel selection strategy for SVMs, we recall the definition of the margin based classifiers. Given a training dataset of

points of the form , and a feature map , the margin based classifier is given by

(3.1a)
(3.1b)

Soft-marign SVMs corresponds to the penalty function method for the optimization in Eq. (3.1), namely

(3.2)

where . Alternatively, by transforming Eq. (3.1) into its corresponding Lagrangian dual problem, the solution is given by

(3.3a)
(3.3b)

where is a -by- with elements . Moreover, there is the following connection between primal and dual variables

(3.4)

Therefore, the optimal margin of the classifier is related to the matrix via

(3.5)
(3.6)

To gain further insight about the matrix , we need to define the notion of the kernel alignment as a similarity measure between two kernel functions over a set of data-points:

Definition 3.1.

(Kernel Alignment, Cortes, et al. [9]) The empirical alignment of a kernel with a kernel with respect to the (unlabeled) samples is defined by

(3.7)

where , , and .

Define the kernel , defined by the labels . Notice that the kernel is optimal since it creates optimal separation between feature vectors, i.e., when , and when . From Definition 3.1, we now observe that the matrix captures the alignment between the kernel matrix and the optimal kernel . More specifically, since , the kernel alignment metric in Eq. (3.7) can be expressed in terms of the matrix as follows

(3.8)

where is the Frobenious norm, and is the vector of all ones. Therefore, the matrix captures the alignment of the kernel with the optimal kernel .

4. Kernel Alignment as an unbiased Two-Sample Test

In this section, we propose a scoring rule method for selection of random features. To specify the proposed method, we first review the notion of the Maximum Mean Discrepancy (MMD):

Definition 4.1.

(Maximum Mean Discrepancy [29]) Let be a metric space. Let be a class of functions . Let be two Borel probability measures from the set of Borel probability measures on . We define the maximum mean discrepancy (MMD) between two distributions and with respect to the function class as follows

(4.1)

As shown by Zhang, et al. [30], the kernel-target alignment in Eq. (3.7) is closely related to the notion of the Maximum Mean Discrepancy (MMD) between the distribution of two classes, measured by functions from RHKS, i.e., when in Definition 4.1. To see this, define the modified labels as for , and when each . Here, and denotes the number of positive and negative labels, such that . Let . Then, the kernel alignment can be reformulated as follows

(4.2)

The formula in Eq. (4.2) is an unbiased estimator of the MMD between distributions with respect to functions from RHKS,

(4.3)

equality follows by the reproducing property . Here, and are the marginal distributions, whereby features with the positive and negative labels are drawn, respectively.

For characteristic kernels (cf. Definition 2.5), the discrepancy indeed defines a valid metric on the space of Borel probability measures in the sense that for any , if and only if , see [31].

Let and denote the set of samples with positive and negative labels, respectively. We define the empirical marginal distributions corresponding to the observed samples

(4.4a)
(4.4b)

Instead of unbiased estimator in Eq. (

4.2), we leverage the following biased empirical estimate of Eq. (4.3) is the sum of two -statistics and a sample average as below (see [29])

(4.5)

For a balanced training data-set, , an unbiased empirical estimate of Eq. (4.3) can be computed as one -statistic

(4.6)

where , and , and

To simplify the analysis, we assume and consider the -statistics in Eq. (4.6). We also consider a class of mixture kernels of a set of known base kernels , i.e., we consider

(4.7)

where the mixing coefficients are determined based on the MMD scores

(4.8)

Computing efficient mixture coefficients for multiple kernel learning has been studied extensively over the past few years [32]. However, those approaches propose to solve an optimization problem to maximize the kernel alignment which can be computationally prohibitive when the number of base kernels is very large.

Associated with the mixture model of Eq. (4.7), we consider the following class of functions parameterized by :

(4.9)

where and the vector is defined via concatenation of random features associated with each base kernel, i.e.,

(4.10)

Furthermore, is the simplex of probability distributions with mass functions. Algorithm 1 outlines the procedure for SVMs with the multiple kernel learning algorithm proposed in this paper.

Remark 4.2.

Alternatively, the statistic in Eq. (4.5) can be rewritten in terms of random Fourier features. In particular, due to the bijection between the kernel and the generative distribution of random features, we can write

(4.11)

The correspondence between the base kernel and its generative distributions , provides the following alternative mixture model for the distribution of random features instead of kernels,

(4.12)

Using the mixture model in Eq. (4.12), we draw random feature maps , where , and let . In this paper, we only analyze the mixture kernel model in Eq. (4.7).

5. Theoretical Results: Consistency, Minimax Rate, and Generalization Bounds

In this section, we state our main theoretical results regarding the performance guarantees of the proposed kernel learning procedure in Section 4. The proof of theoretical results in presented in Appendix.

5.1. Technical Assumptions

To state our theoretical results, we first state the technical assumptions under which we prove our theoretical results:

  • The kernel is Hilbert-Schmidt, i.e.,

  • The base kernels are shift invariant, i.e., for all .

  • Given a loss function , we assume there exists a function dominating in the sense that for all and , . Further, we suppose is a Lipschitz function with respect to Euclidean distance on with constant , i.e.,