1 Introduction
Kernel methods such as support vector machines belong to the most successful learning methods since more than a decade, see Schölkopf and Smola (2002). Examples are classification or regression models where we have an input space , an output space
, some unknown probability measure
on , and an unknown function which describes the quantity of interest, e.g. the conditional quantile curve, of the conditional distribution of , . Support vector machines can informally be described as a kind of regularized Mestimators for functions and have demonstrated their usefulness in many complicated highdimensional reallife problems. Besides several other nice features, one key argument for using SVMs has been the socalled “kernel trick” (Schölkopf et al., 1998), which decouples the SVM optimization problem from the domain of the samples, thus making it possible to use SVMs on virtually any input space. This flexibility is in strong contrast to more classical learning methods from both machine learning and nonparametric statistics, which almost always require input spaces
. As a result, kernel methods have been successfully used in various application areas that were previously infeasible for machine learning methods. As examples we refer to (i) SVMs where using probability measures, e.g. histograms, as input samples, have been used to analyze histogram data and coloured images (Hein and Bousquet, 2005, Sriperumbudur et al., 2009), (ii) SVMs for text classification and web mining (Joachims, 2002, Lafferty and Lebanon, 2005), and (iii) SVMs with kernels from computational biology, e.g. kernels for trees and graphs (Schölkopf et al., 2004).For a data set , the empirical SVM is defined as
(1) 
That is, SVMs are based on three key components: (i) a convex loss function used to measure the quality of the prediction , (ii) a reproducing kernel Hilbert space (RKHS) of functions to specify the set of functions over which the expected loss is minimized, and (iii) the regularization term to reduce the danger of overfitting and to guarantee the existence of a unique SVM even if is not strictly convex. The RKHS is often implicitly defined by specifying a kernel . Details about the definition of SVMs and some examples will be given in Section 2.
During the last years a great part of the statistical research on SVMs has concentrated on the central question how to choose the loss function , the RKHS or its kernel , and sequences of regularization parameters to guarantee that SVMs are universally consistent and statistically robust for classification and regression purposes. In a nutshell, it turned out in a purely nonparametric setup that SVMs based on the combination of a Lipschitz continuous loss function and a bounded continuous kernel with a dense and separable RKHS are universally consistent with desirable statistical robustness properties for any probability measure from which we observed the data set, see, e.g., Steinwart and Christmann (2008) and Christmann et al. (2009) for details. Examples are the combination of the Gaussian RBFkernel with the pinball loss function for nonparametric quantile regression, with the insensitive loss function for nonparametric regression, or with the hinge loss function for nonparametric classification, see Section 2.
Although a nonparametric approach is often the best choice in practice due to the lack of prior knowledge on , a semiparametric approach or an additive model (Friedman and Stuetzle, 1981, Hastie and Tibshirani, 1990) can also be valuable. For example, we may be interested due to practical reasons only in functions which offer a nice interpretation because an interpretable prediction function can be crucial if the prediction has to be explainable to clients. This can be the case if the prediction is the expected claim amount of a client and these predictions are the basis for the construction of an insurance tariff. Here we will mainly consider additive models although models with a multiplicative can also be of interest. More precisely, for some , the input space is split up into nonempty spaces according to
(2) 
and only additive functions of the form
are considered, where for .
To our best knowledge, there are currently no results on consistency and statistical robustness published on SVMs based on kernels designed for additive models. Of course, one can use one of the purely nonparametric SVMs described above, but the hope is, that SVMs based on kernels especially designed for such situations may offer better results.
In this paper we address the question how to design specific SVMs for additive models. The main goal of this paper is that we give an explicit construction principle of kernels — and thus of their RKHSs — which leads in combination with a Lipschitz continuous loss function to consistent and statistically robust SMVs for additive models. Examples are SVMs in additive models for quantile regression based on the pinball loss function, for regression based on the insensitive loss function, and for classification based on the hinge loss function.
The rest of the paper is organized as follows. In Section 2 we collect some known results on loss functions, kernels and their RKHSs, and on support vector machines. These results are needed to state our results on consistency and statistical robustness of SVMs for additive models in Section 3. Although we have so far no result on the rates of convergence, our numerical examples given in Section 4 will demonstrate that SVMs based on kernels designed for additive models can easily outperform standard nonparametric SVMs if the assumption of an additive model is valid. Section 5 contains the discussion. All proofs are given in the Appendix.
2 Background on support vector machines
Let be a complete separable metric space and let be a closed subset of . We will always use the respective Borelalgebras. The set of all probability measures on the Borelalgebra of is denoted by . The random input variables take their values in and the random output variables take their values in . It is assumed that are independent and identically distributed according to some unknown probability measure . Since is closed, can be split into the marginal distribution on and the conditional distribution of given .
The goal is to find a good predictor which predicts the value of an output variable after observing the value of the corresponding input variable. The quality of a prediction is measured by a loss function
It is assumed that is measurable and for every – that is, the loss is zero if the prediction equals the actual value of the output variable. In addition, we make the standard assumption that
is convex for every and that additionally the following uniform Lipschitz property is fulfilled for some real number :
(3) 
We restrict our attention to Lipschitz continuous loss functions because the use of loss functions which are not Lipschitz continuous (such as the least squares loss which is only locally Lipschitz continuous on unbounded domains) usually conflicts with robustness; see, e.g., Steinwart and Christmann (2008, § 10.4).
The quality of a (measurable) predictor is measured by the risk
By different choices of and the loss function , different purposes are covered by this setup – e.g. binary classification for and the hinge loss
regression for and the insensitive loss
where , and quantile regression for and the pinball loss
(4) 
where .
An optimal predictor is a measurable function which attains the minimal risk, called Bayesrisk,
The optimal predictor in a set of measurable functions is an which attains the minimal risk
For example, the goal of quantile regression is to estimate a conditional quantile function, i.e., a function such that
for the quantile . If , then the conditional quantile function attains the minimal risk for the pinball loss (with parameter ) so that quantile regression can be done by trying to minimize the risk in .
One way to build a nonparametric predictor is to use a support vector machine
(5) 
where is a reproducing kernel Hilbert space (RKHS) of a measurable kernel , and is a regularization parameter to reduce the danger of overfitting, see e.g., Vapnik (1998), Schölkopf and Smola (2002) or Steinwart and Christmann (2008) for details. The reproducing property of states that, for all and all ,
where denotes the canonical feature map. A kernel is called bounded, if
Using the reproducing property and , we obtain the wellknown inequalities
(6) 
and
(7) 
for all and all . As an example of a bounded kernel, we mention the popular
Gaussian radial basis function
(GRBF) kernel defined by(8) 
where is some positive constant and . This kernel leads to a large RKHS which is dense in for all probability measures on . We will also consider the polynomial kernel
where , and . The dot kernel is a special polynomial kernel with and . The polynomial kernel is bounded if and only if is bounded.
Of course, the regularized risk
is in general not computable, because is unknown. However, the empirical distribution
corresponding to the data set can be used as an estimator of . Here denotes the Dirac distribution in . If we replace by in (5), we obtain the regularized empirical risk and the empirical SVM . Furthermore, we need analogous notions where
is replaced by random variables
. Thus, we defineThen, for every , is the empirical distribution corresponding to the data set and, accordingly, denotes the mapping , and denotes the mapping .
Support vector machines need not exist for every probability measure ; for Lipschitz continuous loss functions it is sufficient for the existence of that . This condition may be violated by heavytailed distributions and, in this case, it is possible that for every .
In order to enlarge the applicability of support vector machines to heavytailed distributions, the following extension has been developed in Christmann et al. (2009). Following an idea already used by Huber (1967)
for Mestimates in parametric models, a
shifted loss function is defined byThen, similar to the original loss function , define the  risk by
and the regularized  risk by
for every . In complete analogy to (5), we define the support vector machine based on the shifted loss function by
(9) 
If the support vector machine defined by (5) exists, we have seemingly defined in two different ways now. However, the two definitions coincide in this case and the following theorem summarizes some basic results of Christmann et al. (2009).
3 Support vector machines for additive models
3.1 Model and assumptions
As described in the previous section, the goal is to minimize the risk in a set of functions . In this article, we assume an additive model. Accordingly, let
where are nonempty sets. For every , let be a set of functions . Then, we only consider functions of the form
for . Thus,
(10) 
In (10), we have identified with the map .
Such additive models can be treated by support vector machines in a very natural way. For every , choose a kernel on with RKHS . Then, the space of functions
is an RKHS on with kernel ; see Theorem 2 below. In this way, SVMs can be used to fit additive models and SVMs enjoy at least three appealing features: First, it is guaranteed that the predictor has the assumed additive structure . Second, it is possible to still use the standard SVM machinery including the kernel trick (Schölkopf and Smola, 2002, § 2) and implementations of SVMs – just by selecting a kernel . Third, the possibility to choose different kernels offers a great flexibility. For example, take and let be a GRBF kernel on and be a GRBF kernel on . Since the RKHS of a Gaussian kernel is an infinite dimensional function space, we get nonparametric estimates of and . As a second example, consider a semiparametric model with where is assumed to be a polynomial function of order at most and may be some complicated function. Then, this semiparametric model can be treated by simply taking a polynomial kernel on for and a GRBF kernel on for . This can be used, for example, in order to model changes in space (for and specifying the location) or in time (for and specifying the point in time).
Theorem 2.
For every , let be a nonempty set and
be a kernel with corresponding RKHS . Define . That is,
for every , . Then, is a kernel on with RKHS
and the norm of , given in (2), fulfills
(11) 
If not otherwise stated, we make the following assumptions throughout the rest of the paper although some of the results are also valid under more general conditions.
Main assumptions

For every , the set is a complete, separable metric space; is a continuous and bounded kernel on with RKHS . Furthermore, denotes the kernel on defined in Theorem 2 and denotes its RKHS.

The subset is closed.

The loss function is convex and fulfills the uniform Lipschitz continuity (3) with Lipschitz constant . In addition, for every .
Note that every closed subset of is a complete, separable metric space. We restrict ourselves to Lipschitz continuous loss functions and continuous and bounded kernels because it has been shown earlier that these assumptions are necessary in order to ensure good robustness properties; see e.g. Steinwart and Christmann (2008, § 10). The condition is quite natural and practically always fulfilled – it means that the loss of a correct prediction is 0. Our assumptions cover many of the most interesting cases. In particular, the hinge loss (classification), the insensitive loss (regression) and the pinball loss (quantile regression) fulfill all assumptions. Many commonly used kernels are continuous. In addition, the Gaussian kernel is always bounded, the linear kernel and all polynomial kernels are bounded if and only if is bounded. From the assumption that the kernels are continuous and bounded on , it follows that the kernel is continuous and bounded on .
3.2 Consistency
SVMs are called universally consistent, if the risk of the SVM estimator converges, for all probability measures , in probability to the Bayesrisk, i.e.
(12) 
In order to obtain universal consistency of SVMs, it is necessary to choose a kernel with a large RKHS. Accordingly most known results about universal consistency of SVMs assume that the RKHS is dense in where is a compact metric space (see e.g. Steinwart (2001)) or, at least, that the RKHS is dense in for some . In this paper, we consider an additive model where the goal is to minimize the risk in the set
For the consistency of SVMs in an additive model, we do not need that the RKHS is dense in the whole space ; instead, we only assume that each is dense in . As usual, denotes the set of all integrable realvalued functions with respect to some measure and denotes the set of all equivalence classes in . Theorem 3 shows consistency of SVMs in additive models. That is, the risk of converges in probability to the smallest possible risk in .
Theorem 3.
Let the main assumptions (p. 3.1) be valid. Let such that
and let be dense in with respect to . Then, for every sequence such that and ,
in probability.
In general, it is not clear whether convergence of the risks implies convergence of the SVM . However, the following theorem will show such a convergence for quantile regression in an additive model – under the condition that the quantile function actually lies in . In order to formulate this result, we define
where are arbitrary measurable functions. It is known that is a metric describing convergence in probability.
Theorem 4.
Let the main assumptions (p. 3.1) be valid. Let such that
and is dense in with respect to . Let and assume that the quantile function is  almost surely unique and that
Then, for the pinball loss function and for every sequence such that and ,
in probability.
3.3 Robustness
During the last years some general results on the statistical robustness properties of SVMs have been shown. Many of these results are directly applicable to SVMs for additive models if the kernel is bounded and continuous (or at least measurable) and the loss function is Lipschitz continuous. For brevity we only give upper bounds for the bias and the Bouligand influence function for SVMs, which are both even applicable for nonsmooth loss functions like the pinball loss for quantile regression, and refer to Christmann et al. (2009) and Steinwart and Christmann (2008, Chap. 10) for results on the classical influence function proposed by (Hampel, 1968, 1974) and to Hable and Christmann (2009) for qualitative robustness of SVMs.
Define the function
(13) 
which maps each probability distribution to its SVM. In robust statistics we are interested in smooth and bounded functions
, because this will give us stable SVMs within small neighborhoods of . If an appropriately chosen derivative of is bounded, then we expect the value of to be close to the value of for distributions in a small neighborhood of .The next result shows that the norm of the difference of two SVMs increases with respect to the mixture proportion at most linearly in grosserror neighborhoods. The norm of total variation of a signed measure is denoted by .
Theorem 5 (Bounds for bias).
If the main assumptions (p. 3.1) are valid, then we have, for all , all , and all probability measures and on , that
(14)  
(15) 
where .
Because of (7), there are analogous bias bounds of SVMs with respect to the norm in , if we replace by .
While F.R. Hampel’s influence function is related to a Gâteauxderivative which is linear, the Bouligand influence function is related to the Bouligand derivative which needs only to be positive homogeneous. Because this weak derivative is less known in statistics, we like to recall its definition. Let and be normed linear spaces. A function is called positive homogeneous if for all and for all . If is an open subset of , then a function is called Bouliganddifferentiable at a point , if there exists a positive homogeneous function such that
see Robinson (1991).
The Bouligand influence function (BIF) of the map for a distribution in the direction of a distribution was defined by Christmann and Van Messem (2008) as
(16) 
Note that the BIF is a special Bouligandderivative
due to the fact, that and are fixed, and it is independent of the norm on . The partial Bouligand derivative with respect to the third argument of is denoted by . The BIF shares with F.R. Hampel’s influence function the interpretation that it measures the impact of an infinitesimal small amount of contamination of the original distribution in the direction of on the quantity of interest . It is thus desirable that the function has a bounded BIF. It is known that existence of the BIF implies existence of the IF and in this case they are equal. The next result shows that, under some conditions, the Bouligand influence function of SVMs exists and is bounded, see Christmann et al. (2009) for more related results.
Theorem 6 (Bouligand influence function).
Let the main assumptions (p. 3.1) be valid, but assume that is a complete separable normed linear space.^{1}^{1}1Bouligand derivatives are only defined in normed linear spaces. E.g., a linear subspace. Let . Let be the pinball loss function with or let be the insensitive loss function with . Assume that for all there exist positive constants , , , and such that for all with the following inequalities hold for all and :
(17) 
Then the Bouligand influence function of exists, is bounded, and equals
(18) 
Note that the Bouligand influence function of the SVM only depends on via the second term in (18). The interpretation of the condition (17) is that the probability that given is in some small interval around the SVM is essentially at most proportional to the length of the interval to some power greater than one.
For the pinball loss function, the BIF given in (18) simplifies to
(19) 
The BIF of the SVM based on the pinball loss function can hence be interpreted as the difference of the integrated and with weighted difference between the estimated quantile level and the desired quantile level .
Recall that the BIF is a special Bouligand derivative and thus positive homogeneous in . If the BIF exists, we then immediately obtain
for all . This equation gives us a nice approximation of the asymptotic bias term , if we consider the amount of contamination instead of .
4 Examples
In this section we would like to illustrate our theoretical results on SVMs for additive models with a few finite sample examples. The goals of this short section are twofold. We like to get some preliminary insight how SVMs based on kernels designed for additive models work for finite sample sizes when compared to the standard GRBF kernel defined on the whole input space and to get some ideas for further research on this topic. We also like to apply support vector machines based on the additive kernels treated in this paper to a reallife data set.
4.1 Simulated example
Let us consider the following situation of median regression. We have two independent input variables and
each with a uniform distribution on the interval
and the output variable givenhas a Cauchy distribution (and thus not even the first moment does exist) with center
, where and . Hence the true function we like to estimate with SVMs has an additive structure, where the first function is a polynomial of order two and the second function is a smooth and bounded function but no polynomial. Please note, that here is bounded whereas is unbounded. As is bounded, even a polynomial kernel on is bounded which is not true for unbounded input spaces. We simulated three data sets of this type with sample sizes , , and . We compare the exact function with three SVMs fitted by the three data sets, where we use the pinball loss function with because we are interested in median regression.
Nonparametric SVM. We use an SVM based on the 2dimensional GRBF kernel defined in (8) to fit in a totally nonparametric manner.

Nonparametric additive SVM. We use an SVM based on the kernel where and are 1dimensional GRBF kernels.

Semiparametric additive SVM. We use an SVM based on the kernel where is a polynomial kernel of order 2 to fit the function and is a 1dimensional GRBF kernel to fit the function .
Our interest in these examples is to check how well SVMs using kernels designed for additive models perform in these situations. No attempt was made to find optimal values of the regularization parameters and the kernel parameter by using a grid search or crossvalidation, because we did not want to mix the quality of such optimization strategies with the choice of the kernels. We therefore fixed and used the simple nonstochastic specification for the regularization parameter which guarantees that our consistency result from Section 3 is applicable.
From Figures 1 to 3 we can draw the following conclusions for this special situation.

The difference between the nonparametric additive SVM and semiparametric additive SVM was somewhat surprisingly small for all three sample sizes, although the true function had the very special structure which is in favour for the semiparametric additive SVM.
true function  nonparametric SVM  
nonparametric additive SVM  semiparametric additive SVM  
true function  nonparametric SVM  
nonparametric additive SVM  semiparametric additive SVM  
true function  nonparametric SVM  
nonparametric additive SVM  semiparametric additive SVM  
4.2 Example: additive model using SVMs for rent standard
Let us now consider a reallife example of the rent standard for dwellings in a large city in Germany. Many German cities compose socalled rent standards to make a decision making instrument available to tenants, landlords, renting advisory boards, and experts. Such rent standards can in particular be used for the determination of the local comparative rent, i.e. the net rent as a function of the dwelling size, year of construction of the house, geographical information etc. For the construction of a rent standard, a representative random sample is drawn from all households and questionnaires are used to determine the relevant information by trained interviewers. Fahrmeir et al. (2007) described such a data set consisting of rent prizes in Munich, which is one of the largest cities in Germany. They fitted the following additive model