DeepAI
Log In Sign Up

Support Vector Machines for Additive Models: Consistency and Robustness

07/23/2010
by   Andreas Christmann, et al.
0

Support vector machines (SVMs) are special kernel based methods and belong to the most successful learning methods since more than a decade. SVMs can informally be described as a kind of regularized M-estimators for functions and have demonstrated their usefulness in many complicated real-life problems. During the last years a great part of the statistical research on SVMs has concentrated on the question how to design SVMs such that they are universally consistent and statistically robust for nonparametric classification or nonparametric regression purposes. In many applications, some qualitative prior knowledge of the distribution P or of the unknown function f to be estimated is present or the prediction function with a good interpretability is desired, such that a semiparametric model or an additive model is of interest. In this paper we mainly address the question how to design SVMs by choosing the reproducing kernel Hilbert space (RKHS) or its corresponding kernel to obtain consistent and statistically robust estimators in additive models. We give an explicit construction of kernels - and thus of their RKHSs - which leads in combination with a Lipschitz continuous loss function to consistent and statistically robust SMVs for additive models. Examples are quantile regression based on the pinball loss function, regression based on the epsilon-insensitive loss function, and classification based on the hinge loss function.

READ FULL TEXT VIEW PDF

page 27

page 28

page 29

page 30

01/29/2013

On the Consistency of the Bootstrap Approach for Support Vector Machines and Related Kernel Based Methods

It is shown that bootstrap approximations of support vector machines (SV...
05/14/2014

Learning rates for the risk of kernel based quantile regression estimators in additive models

Additive models play an important role in semiparametric statistics. Thi...
10/12/2015

On the Robustness of Regularized Pairwise Learning Methods Based on Kernels

Regularized empirical risk minimization including support vector machine...
03/20/2012

Asymptotic Confidence Sets for General Nonparametric Regression and Classification by Regularized Kernel Methods

Regularized kernel methods such as, e.g., support vector machines and le...
05/25/2018

Function Estimation via Reconstruction

This paper introduces an interpolation-based method, called the reconstr...
08/31/2018

Learning Data-adaptive Nonparametric Kernels

Traditional kernels or their combinations are often not sufficiently fle...
11/08/2011

On the stability of bootstrap estimators

It is shown that bootstrap approximations of an estimator which is based...

1 Introduction

Kernel methods such as support vector machines belong to the most successful learning methods since more than a decade, see Schölkopf and Smola (2002). Examples are classification or regression models where we have an input space , an output space

, some unknown probability measure

on , and an unknown function which describes the quantity of interest, e.g. the conditional quantile curve, of the conditional distribution of , . Support vector machines can informally be described as a kind of regularized M-estimators for functions and have demonstrated their usefulness in many complicated high-dimensional real-life problems. Besides several other nice features, one key argument for using SVMs has been the so-called “kernel trick” (Schölkopf et al., 1998), which decouples the SVM optimization problem from the domain of the samples, thus making it possible to use SVMs on virtually any input space

. This flexibility is in strong contrast to more classical learning methods from both machine learning and non-parametric statistics, which almost always require input spaces

. As a result, kernel methods have been successfully used in various application areas that were previously infeasible for machine learning methods. As examples we refer to (i) SVMs where using probability measures, e.g. histograms, as input samples, have been used to analyze histogram data and coloured images (Hein and Bousquet, 2005, Sriperumbudur et al., 2009), (ii) SVMs for text classification and web mining (Joachims, 2002, Lafferty and Lebanon, 2005), and (iii) SVMs with kernels from computational biology, e.g. kernels for trees and graphs (Schölkopf et al., 2004).

For a data set , the empirical SVM is defined as

(1)

That is, SVMs are based on three key components: (i) a convex loss function used to measure the quality of the prediction , (ii) a reproducing kernel Hilbert space (RKHS) of functions to specify the set of functions over which the expected loss is minimized, and (iii) the regularization term to reduce the danger of overfitting and to guarantee the existence of a unique SVM even if is not strictly convex. The RKHS is often implicitly defined by specifying a kernel . Details about the definition of SVMs and some examples will be given in Section 2.

During the last years a great part of the statistical research on SVMs has concentrated on the central question how to choose the loss function , the RKHS or its kernel , and sequences of regularization parameters to guarantee that SVMs are universally consistent and statistically robust for classification and regression purposes. In a nutshell, it turned out in a purely non-parametric setup that SVMs based on the combination of a Lipschitz continuous loss function and a bounded continuous kernel with a dense and separable RKHS are universally consistent with desirable statistical robustness properties for any probability measure from which we observed the data set, see, e.g., Steinwart and Christmann (2008) and Christmann et al. (2009) for details. Examples are the combination of the Gaussian RBF-kernel with the pinball loss function for nonparametric quantile regression, with the -insensitive loss function for nonparametric regression, or with the hinge loss function for nonparametric classification, see Section 2.

Although a nonparametric approach is often the best choice in practice due to the lack of prior knowledge on , a semiparametric approach or an additive model (Friedman and Stuetzle, 1981, Hastie and Tibshirani, 1990) can also be valuable. For example, we may be interested due to practical reasons only in functions which offer a nice interpretation because an interpretable prediction function can be crucial if the prediction has to be explainable to clients. This can be the case if the prediction is the expected claim amount of a client and these predictions are the basis for the construction of an insurance tariff. Here we will mainly consider additive models although models with a multiplicative can also be of interest. More precisely, for some , the input space is split up into non-empty spaces according to

(2)

and only additive functions of the form

are considered, where for .

To our best knowledge, there are currently no results on consistency and statistical robustness published on SVMs based on kernels designed for additive models. Of course, one can use one of the purely nonparametric SVMs described above, but the hope is, that SVMs based on kernels especially designed for such situations may offer better results.

In this paper we address the question how to design specific SVMs for additive models. The main goal of this paper is that we give an explicit construction principle of kernels — and thus of their RKHSs — which leads in combination with a Lipschitz continuous loss function to consistent and statistically robust SMVs for additive models. Examples are SVMs in additive models for quantile regression based on the pinball loss function, for regression based on the -insensitive loss function, and for classification based on the hinge loss function.

The rest of the paper is organized as follows. In Section 2 we collect some known results on loss functions, kernels and their RKHSs, and on support vector machines. These results are needed to state our results on consistency and statistical robustness of SVMs for additive models in Section 3. Although we have so far no result on the rates of convergence, our numerical examples given in Section 4 will demonstrate that SVMs based on kernels designed for additive models can easily outperform standard nonparametric SVMs if the assumption of an additive model is valid. Section 5 contains the discussion. All proofs are given in the Appendix.

2 Background on support vector machines

Let be a complete separable metric space and let be a closed subset of . We will always use the respective Borel--algebras. The set of all probability measures on the Borel--algebra of is denoted by  . The random input variables take their values in and the random output variables take their values in . It is assumed that are independent and identically distributed according to some unknown probability measure  . Since is closed, can be split into the marginal distribution on and the conditional distribution of given .

The goal is to find a good predictor which predicts the value of an output variable after observing the value of the corresponding input variable. The quality of a prediction is measured by a loss function

It is assumed that is measurable and for every – that is, the loss is zero if the prediction equals the actual value of the output variable. In addition, we make the standard assumption that

is convex for every and that additionally the following uniform Lipschitz property is fulfilled for some real number  :

(3)

We restrict our attention to Lipschitz continuous loss functions because the use of loss functions which are not Lipschitz continuous (such as the least squares loss which is only locally Lipschitz continuous on unbounded domains) usually conflicts with robustness; see, e.g., Steinwart and Christmann (2008, § 10.4).

The quality of a (measurable) predictor is measured by the risk

By different choices of and the loss function , different purposes are covered by this setup – e.g. binary classification for and the hinge loss

regression for and the -insensitive loss

where , and quantile regression for and the pinball loss

(4)

where .

An optimal predictor is a measurable function which attains the minimal risk, called Bayes-risk,

The optimal predictor in a set of measurable functions is an which attains the minimal risk

For example, the goal of quantile regression is to estimate a conditional quantile function, i.e., a function such that

for the quantile . If , then the conditional quantile function attains the minimal risk for the pinball loss (with parameter ) so that quantile regression can be done by trying to minimize the risk in .

One way to build a non-parametric predictor is to use a support vector machine

(5)

where is a reproducing kernel Hilbert space (RKHS) of a measurable kernel , and is a regularization parameter to reduce the danger of overfitting, see e.g., Vapnik (1998), Schölkopf and Smola (2002) or Steinwart and Christmann (2008) for details. The reproducing property of states that, for all and all ,

where denotes the canonical feature map. A kernel is called bounded, if

Using the reproducing property and , we obtain the well-known inequalities

(6)

and

(7)

for all and all . As an example of a bounded kernel, we mention the popular

Gaussian radial basis function

(GRBF) kernel defined by

(8)

where is some positive constant and . This kernel leads to a large RKHS which is dense in for all probability measures on . We will also consider the polynomial kernel

where , and . The dot kernel is a special polynomial kernel with and . The polynomial kernel is bounded if and only if is bounded.

Of course, the regularized risk

is in general not computable, because is unknown. However, the empirical distribution

corresponding to the data set can be used as an estimator of . Here denotes the Dirac distribution in . If we replace by in (5), we obtain the regularized empirical risk and the empirical SVM . Furthermore, we need analogous notions where

is replaced by random variables

. Thus, we define

Then, for every , is the empirical distribution corresponding to the data set and, accordingly, denotes the mapping , and denotes the mapping  .

Support vector machines need not exist for every probability measure ; for Lipschitz continuous loss functions it is sufficient for the existence of that . This condition may be violated by heavy-tailed distributions and, in this case, it is possible that for every .

In order to enlarge the applicability of support vector machines to heavy-tailed distributions, the following extension has been developed in Christmann et al. (2009). Following an idea already used by Huber (1967)

for M-estimates in parametric models, a

shifted loss function is defined by

Then, similar to the original loss function , define the  - risk by

and the regularized  - risk by

for every  . In complete analogy to (5), we define the support vector machine based on the shifted loss function by

(9)

If the support vector machine defined by (5) exists, we have seemingly defined in two different ways now. However, the two definitions coincide in this case and the following theorem summarizes some basic results of Christmann et al. (2009).

Theorem 1.

Let be a convex and Lipschitz continuous loss function and let be a bounded kernel. Then, for every and every  , there exists a unique SVM which minimizes  , i.e.

If the support vector machine defined by (5) exists, then the two definitions (5) and (9) coincide.

3 Support vector machines for additive models

3.1 Model and assumptions

As described in the previous section, the goal is to minimize the risk in a set of functions . In this article, we assume an additive model. Accordingly, let

where are non-empty sets. For every , let be a set of functions . Then, we only consider functions of the form

for . Thus,

(10)

In (10), we have identified with the map .

Such additive models can be treated by support vector machines in a very natural way. For every , choose a kernel on with RKHS . Then, the space of functions

is an RKHS on with kernel ; see Theorem 2 below. In this way, SVMs can be used to fit additive models and SVMs enjoy at least three appealing features: First, it is guaranteed that the predictor has the assumed additive structure . Second, it is possible to still use the standard SVM machinery including the kernel trick (Schölkopf and Smola, 2002, § 2) and implementations of SVMs – just by selecting a kernel . Third, the possibility to choose different kernels offers a great flexibility. For example, take and let be a GRBF kernel on and be a GRBF kernel on . Since the RKHS of a Gaussian kernel is an infinite dimensional function space, we get non-parametric estimates of and . As a second example, consider a semiparametric model with where is assumed to be a polynomial function of order at most and may be some complicated function. Then, this semiparametric model can be treated by simply taking a polynomial kernel on for and a GRBF kernel on for . This can be used, for example, in order to model changes in space (for and specifying the location) or in time (for and specifying the point in time).

Theorem 2.

For every , let be a non-empty set and

be a kernel with corresponding RKHS . Define . That is,

for every , . Then, is a kernel on with RKHS

and the norm of , given in (2), fulfills

(11)

If not otherwise stated, we make the following assumptions throughout the rest of the paper although some of the results are also valid under more general conditions.

Main assumptions

  1. For every , the set is a complete, separable metric space; is a continuous and bounded kernel on with RKHS . Furthermore, denotes the kernel on defined in Theorem 2 and denotes its RKHS.

  2. The subset is closed.

  3. The loss function is convex and fulfills the uniform Lipschitz continuity (3) with Lipschitz constant . In addition, for every .

Note that every closed subset of is a complete, separable metric space. We restrict ourselves to Lipschitz continuous loss functions and continuous and bounded kernels because it has been shown earlier that these assumptions are necessary in order to ensure good robustness properties; see e.g. Steinwart and Christmann (2008, § 10). The condition is quite natural and practically always fulfilled – it means that the loss of a correct prediction is 0. Our assumptions cover many of the most interesting cases. In particular, the hinge loss (classification), the -insensitive loss (regression) and the pinball loss (quantile regression) fulfill all assumptions. Many commonly used kernels are continuous. In addition, the Gaussian kernel is always bounded, the linear kernel and all polynomial kernels are bounded if and only if is bounded. From the assumption that the kernels are continuous and bounded on , it follows that the kernel is continuous and bounded on .

3.2 Consistency

SVMs are called universally consistent, if the risk of the SVM estimator converges, for all probability measures , in probability to the Bayes-risk, i.e.

(12)

In order to obtain universal consistency of SVMs, it is necessary to choose a kernel with a large RKHS. Accordingly most known results about universal consistency of SVMs assume that the RKHS is dense in where is a compact metric space (see e.g. Steinwart (2001)) or, at least, that the RKHS is dense in for some . In this paper, we consider an additive model where the goal is to minimize the risk in the set

For the consistency of SVMs in an additive model, we do not need that the RKHS is dense in the whole space ; instead, we only assume that each is dense in . As usual, denotes the set of all -integrable real-valued functions with respect to some measure and denotes the set of all equivalence classes in . Theorem 3 shows consistency of SVMs in additive models. That is, the -risk of converges in probability to the smallest possible risk in .

Theorem 3.

Let the main assumptions (p. 3.1) be valid. Let such that

and let be dense in with respect to . Then, for every sequence such that and ,

in probability.

In general, it is not clear whether convergence of the risks implies convergence of the SVM . However, the following theorem will show such a convergence for quantile regression in an additive model – under the condition that the quantile function actually lies in . In order to formulate this result, we define

where are arbitrary measurable functions. It is known that is a metric describing convergence in probability.

Theorem 4.

Let the main assumptions (p. 3.1) be valid. Let such that

and is dense in with respect to . Let and assume that the quantile function is  - almost surely unique and that

Then, for the pinball loss function and for every sequence such that and ,

in probability.

3.3 Robustness

During the last years some general results on the statistical robustness properties of SVMs have been shown. Many of these results are directly applicable to SVMs for additive models if the kernel is bounded and continuous (or at least measurable) and the loss function is Lipschitz continuous. For brevity we only give upper bounds for the bias and the Bouligand influence function for SVMs, which are both even applicable for non-smooth loss functions like the pinball loss for quantile regression, and refer to Christmann et al. (2009) and Steinwart and Christmann (2008, Chap. 10) for results on the classical influence function proposed by (Hampel, 1968, 1974) and to Hable and Christmann (2009) for qualitative robustness of SVMs.

Define the function

(13)

which maps each probability distribution to its SVM. In robust statistics we are interested in smooth and bounded functions

, because this will give us stable SVMs within small neighborhoods of . If an appropriately chosen derivative of is bounded, then we expect the value of to be close to the value of for distributions in a small neighborhood of .

The next result shows that the -norm of the difference of two SVMs increases with respect to the mixture proportion at most linearly in gross-error neighborhoods. The norm of total variation of a signed measure is denoted by .

Theorem 5 (Bounds for bias).

If the main assumptions (p. 3.1) are valid, then we have, for all , all , and all probability measures and on , that

(14)
(15)

where .

Because of (7), there are analogous bias bounds of SVMs with respect to the norm in , if we replace by .

While F.R. Hampel’s influence function is related to a Gâteaux-derivative which is linear, the Bouligand influence function is related to the Bouligand derivative which needs only to be positive homogeneous. Because this weak derivative is less known in statistics, we like to recall its definition. Let and be normed linear spaces. A function is called positive homogeneous if for all and for all . If is an open subset of , then a function is called Bouligand-differentiable at a point , if there exists a positive homogeneous function such that

see Robinson (1991).

The Bouligand influence function (BIF) of the map for a distribution in the direction of a distribution was defined by Christmann and Van Messem (2008) as

(16)

Note that the BIF is a special Bouligand-derivative

due to the fact, that and are fixed, and it is independent of the norm on . The partial Bouligand derivative with respect to the third argument of is denoted by . The BIF shares with F.R. Hampel’s influence function the interpretation that it measures the impact of an infinitesimal small amount of contamination of the original distribution in the direction of on the quantity of interest . It is thus desirable that the function has a bounded BIF. It is known that existence of the BIF implies existence of the IF and in this case they are equal. The next result shows that, under some conditions, the Bouligand influence function of SVMs exists and is bounded, see Christmann et al. (2009) for more related results.

Theorem 6 (Bouligand influence function).

Let the main assumptions (p. 3.1) be valid, but assume that is a complete separable normed linear space.111Bouligand derivatives are only defined in normed linear spaces. E.g., a linear subspace. Let . Let be the pinball loss function with or let be the -insensitive loss function with . Assume that for all there exist positive constants , , , and such that for all with the following inequalities hold for all and :

(17)

Then the Bouligand influence function of exists, is bounded, and equals

(18)

Note that the Bouligand influence function of the SVM only depends on via the second term in (18). The interpretation of the condition (17) is that the probability that given is in some small interval around the SVM is essentially at most proportional to the length of the interval to some power greater than one.

For the pinball loss function, the BIF given in (18) simplifies to

(19)

The BIF of the SVM based on the pinball loss function can hence be interpreted as the difference of the integrated and with weighted difference between the estimated quantile level and the desired quantile level .

Recall that the BIF is a special Bouligand derivative and thus positive homogeneous in . If the BIF exists, we then immediately obtain

for all . This equation gives us a nice approximation of the asymptotic bias term , if we consider the amount of contamination instead of .

4 Examples

In this section we would like to illustrate our theoretical results on SVMs for additive models with a few finite sample examples. The goals of this short section are twofold. We like to get some preliminary insight how SVMs based on kernels designed for additive models work for finite sample sizes when compared to the standard GRBF kernel defined on the whole input space and to get some ideas for further research on this topic. We also like to apply support vector machines based on the additive kernels treated in this paper to a real-life data set.

4.1 Simulated example

Let us consider the following situation of median regression. We have two independent input variables and

each with a uniform distribution on the interval

and the output variable given

has a Cauchy distribution (and thus not even the first moment does exist) with center

, where and . Hence the true function we like to estimate with SVMs has an additive structure, where the first function is a polynomial of order two and the second function is a smooth and bounded function but no polynomial. Please note, that here is bounded whereas is unbounded. As is bounded, even a polynomial kernel on is bounded which is not true for unbounded input spaces. We simulated three data sets of this type with sample sizes , , and . We compare the exact function with three SVMs fitted by the three data sets, where we use the pinball loss function with because we are interested in median regression.

  • Nonparametric SVM. We use an SVM based on the 2-dimensional GRBF kernel defined in (8) to fit in a totally nonparametric manner.

  • Nonparametric additive SVM. We use an SVM based on the kernel where and are 1-dimensional GRBF kernels.

  • Semiparametric additive SVM. We use an SVM based on the kernel where is a polynomial kernel of order 2 to fit the function and is a 1-dimensional GRBF kernel to fit the function .

Our interest in these examples is to check how well SVMs using kernels designed for additive models perform in these situations. No attempt was made to find optimal values of the regularization parameters and the kernel parameter by using a grid search or cross-validation, because we did not want to mix the quality of such optimization strategies with the choice of the kernels. We therefore fixed and used the simple non-stochastic specification for the regularization parameter which guarantees that our consistency result from Section 3 is applicable.

From Figures 1 to 3 we can draw the following conclusions for this special situation.

  1. If the additive model is valid, all three SVMs give comparable and reasonable results if the sample size is large enough even for Cauchy distributed error terms, see Figure 1. This is in good agreement with the theoretical results derived in Section 3.

  2. If the sample size is small to moderate and if the assumed additive model is valid, then both SVMs based on kernels especially designed for additive models show better results than the standard 2-dimensional GRBF kernel, see Figures 2 and Figure 3.

  3. The difference between the nonparametric additive SVM and semiparametric additive SVM was somewhat surprisingly small for all three sample sizes, although the true function had the very special structure which is in favour for the semiparametric additive SVM.

true function nonparametric SVM
nonparametric additive SVM semiparametric additive SVM
Figure 1: Quantile regression using SVMs and pinball loss function with . Model: , where and and and are observations of independent and identically uniform distributed random variables on the interval . The regularization parameter is , and the kernel parameter of the Gaussian RBF kernel is . Upper left subplot: true function . Upper right subplot: SVM fit based on GRBF kernel on . Lower left subplot: SVM fit based on the sum of two 1-dimensional GRBF kernels. Lower right subplot: SVM fit based on the sum of a 1-dimensional polynomial kernel on and a 1-dimensional GRBF kernel.
true function nonparametric SVM
nonparametric additive SVM semiparametric additive SVM
Figure 2: Quantile regression using SVMs and pinball loss function with . Model: , where and and and are observations of independent and identically uniform distributed random variables on the interval . The regularization parameter is , and the kernel parameter of the Gaussian RBF kernel is . Upper left subplot: true function . Upper right subplot: SVM fit based on GRBF kernel on . Lower left subplot: SVM fit based on the sum of two 1-dimensional GRBF kernels. Lower right subplot: SVM fit based on the sum of a 1-dimensional polynomial kernel on and a 1-dimensional GRBF kernel.
true function nonparametric SVM
nonparametric additive SVM semiparametric additive SVM
Figure 3: Quantile regression using SVMs and pinball loss function with . Model: , where and and and are observations of independent and identically uniform distributed random variables on the interval . The regularization parameter is , and the kernel parameter of the Gaussian RBF kernel is . Upper left subplot: true function . Upper right subplot: SVM fit based on GRBF kernel on . Lower left subplot: SVM fit based on the sum of two 1-dimensional GRBF kernels. Lower right subplot: SVM fit based on the sum of a 1-dimensional polynomial kernel on and a 1-dimensional GRBF kernel.

4.2 Example: additive model using SVMs for rent standard

Let us now consider a real-life example of the rent standard for dwellings in a large city in Germany. Many German cities compose so-called rent standards to make a decision making instrument available to tenants, landlords, renting advisory boards, and experts. Such rent standards can in particular be used for the determination of the local comparative rent, i.e. the net rent as a function of the dwelling size, year of construction of the house, geographical information etc. For the construction of a rent standard, a representative random sample is drawn from all households and questionnaires are used to determine the relevant information by trained interviewers. Fahrmeir et al. (2007) described such a data set consisting of rent prizes in Munich, which is one of the largest cities in Germany. They fitted the following additive model