## 1 Introduction

Support vector machine (SVM), originally introduced by Vapnik (1995)

, is well known to be a popular and powerful technique mainly due to its successful practical performances and nice theoretical foundations in machine learning. For supervised classification problems, SVM is based on the margin-maximization principle endowed with a specified kernel, which is formulated by a nonlinear map from the input space to the feature space.

Denote and as the input space and corresponding output space, respectively. Let

be a random vector drawn from an unknown joint distribution

on . Suppose that all the observations are available from . In empirical risk minimization, the standard -norm SVM has the widely-used hinge loss plus-norm penalty formulation. Recall that the empirical hinge loss function is defined by

where the hinge loss is , with denoting the positive part of . The standard SVM can be expressed as the following regularization problem

where is the regularized parameter for controlling the functional complexity of . Note that is referred to a reproducing kernel Hilbert space (RKHS), often specified in advance. See Section 2 for more details on RKHS. The book by Steinwart and Christmann (2008) contains a good overview of SVMs and the particularly related learning theory.

Among various kernel-based learning schemes including SVM, it is full of challenges how to select a suitable kernel and there are not any perfect answers for such problem until now. See related work on kernel learning (Lanckriet et al., 2004; Micchelli and Pontil, 2005; Wu et al., 2007; Kloft et al., 2011; Micchelli et al., 2016) for instances. In this paper, we consider a semi-parametric SVM problem of the linear kernel plus a general nonlinear kernel. Indeed, partial linear models in statistics have received a great attention in the last several decades, see (Muller and van de Geer, 2015; Hardle and Liang, 2007; Speckman, 1988). Particularly, the linear part in the partial linear modelsaims at the model interpretation, and the nonlinear part is used to enhance the model flexibility. As a concrete example in stock market, the future return of a stock may depend on several company management indexes (e.g. shareholders structure) which are homogeneous for all the companies, and we allow linear relation with . However, the other features (e.g., from financial statements) should be nonlinear to the response, in that a company has a complex curve in terms of operation or profit pattern. In practice, load forecasting using semi-parametric SVM gets a better prediction than the conventional way (Jacobus and Abhisek, 2009). The semi-parametric SVM are also successfully applied to analyze pharmacokinetic and pharmacodynamic data (Seok et al., 2011). However, to the best of our knowledge, the theoretical research on the semi-parametric support vector machines is still lacking, and this paper focuses on this topic in high dimensional setting.

High dimensional case refers to the setting where the ambient dimension of the covariates is very large (e.g. ), but only a small subset of the covariates are significantly relevant to the response. The high dimensional estimation and inference for various models have been investigated in the last years, and the interested readers can refer to two related book written by Buhlmann (2019) and Giraud (2014). Specially, the high dimensional inference for the linear (or additive) SVM has been wildly studied in recent years, see (Tarigan and Geer, 2006; Zhao and Liu, 2012; Zhang et al., 2016; Peng et al., 2016). Precisely, Tarigan and Geer (2006) consider a -penalized parametric estimation in high dimensions for SVM and prove the convergence rates of the excess risk term under regularity conditions. Similarly, Zhao and Liu (2012) propose a group-Lasso type regularized approachs for the nonparametric additive SVM, and provide the oracle properties of the estimator and develop an efficient numerical algorithm to compute it. For high dimensional linear SVM, Zhang et al. (2016) and Peng et al. (2016) explicitly investigate the statistical performance of the

-norm and non-convex-penalized SVM such as variable selection consistency. However, all the aforementioned works only consider a single kernel in high dimensions. By contrast, a partially linear SVM has to consider the mutual correlation between these two kernels with different structures, and also considers the mutual effects between the sparsity and the nonlinear functional complexity. So the non-asymptotic analysis of such semi-parametric models in high-dimensional SVM appears to be considerably more complicated than those based on a single kernel.

Under our partial linear setting, the whole input feature consists of two parts: , where has a linear relation to the response, while the sub-feature has a nonlinear effect to the response. Given all the observations with the sample size , we consider a two-fold regularized learning scheme for the high dimensional PLQR, and the semi-parametric estimation pair is the unique solution by minimizing the following unconstrained optimization

(1.1) |

where are two regularized hyper-parameters for controlling the coefficients of the sparsity and functional complexity, respectively. In the partial linear setting, the adopted hypothesis space for SVM is a summation of the linear kernel and the general nonlinear kernel. More precisely,

To investigate the statistical performance of the proposed semi-parameter estimator (1.1), we introduce a population target function for the partial linear SVM within . In this paper, the target function we will focus on is a global solution of the following population minimization on ,

(1.2) |

Under the partial linear framework, can be written as: , where is the nonparametric component, belonging to a specific RKHS that will be defined in Section 2. For the parametric part, one often assumes that the structure of is sparse under high dimensional setting, in sense that the cardinality of is far less than the ambient dimension . Note that, the target function is quite different from the Bayes rule, and the latter is an optimal decision function taken over all the measurable functions. We can treat as a sparse approximation to the Bayes rule within , particularly when the true function is not sparse. In the current literatures, we are not concerned with any approximation error induced by sparse approximation or kernel misspecification.

In this paper, we are primarily concerned with learning rates of the excess risk and the estimation errors of the parametric estimator and the nonparametric estimator for the high dimensional SVM. Interestingly, the theoretical results reveal that our derived rate of the parametric estimator depends on not only the sample size and the sparsity parameter, but also the functional complexity generated by the non-parametric component and vice versa. As a byproduct, we develop a new weighted empirical process to refine our analysis. This is one of the key theoretical tools in the high dimensional literatures of the semi-parametric estimation.

The rest of this article is organized as follows. In Section 2, we introduce some basic notations on RKHS that is used to characterize the functional complexity. Then we impose some regular assumptions required to establish the convergence rates. In the end of Section 2, we explicitly propose our main theoretical results in terms of the excess risk and estimation errors. Section 3 is devoted to a detailed proof for the main theorems, and also proves some useful lemmas associated with the weighted empirical process. Section 4 concludes this paper with discussions and future possible researches.

Notations. We use to denote the set . For a vector , the -norm is defined as . For two sequences of numbers and , we use an to denote that for some finite positive constant for all . If both and , we use the notation . We also use an for . For a function, we denote the -norm of by with some distribution .

## 2 Conditions and Main Theorems

We begin with the background and notation required for the main statements of our problem. First of all, we introduce the notation of RKHS. RKHS can be defined by any symmetric and positive semidefinite kernel function . For each , the function is contained with the Hilbert space ; moreover, the Hilbert space is endowed with an inner product such that acts as the representer of the evaluation. Especially, the reproducing property of RKHS plays an important role in the theoretical analysis and numerical optimization for any kernel-based method,

(2.1) |

This property also implies that with . Moreover, by Mercer’s theorem, a kernel defined on a compact subset of admits the following eigen-decomposition,

(2.2) |

where

are the eigenvalues and

is an orthonormal basis in . The decay rate of completely characterizes the complexity of RKHS induced by a kernel , and generally it has equivalent relationships with various entropy numbers, see Steinwart and Christmann (2008) for details. With these preparations, we define the quantity,(2.3) |

Let be the smallest positive solution to the inequality, where is only a technical constant.

Then, due to the mutual effects between the high dimensional parametric component and the nonparametric one, we introduce the following quantity related to the convergence rates of the semi-parametric estimate, as illustrated in (2.4),

(2.4) |

We now describe our main assumptions. Our first assumption deals with the tail behavior of the covariate of the linear part.

Assumption A. (i) For simplicity, we assume that with some positive constant ; (ii) The largest eigenvalue of is finite, denoted by .

It appears that a bound on the

-values is a restrictive assumption, ruling out the standard sub-gaussian covariates. However, we can usually approximate a non-bounded distribution with its truncated version. Imposing such assumption is only for technical simplicity and may be relaxed to general thin-tail random variables. Assumption A(ii) is fairly standard in the literature to identify the coefficients associated to

.Assumption B. There exist the constants and such that, for all , the equation (2.5) holds,

(2.5) |

The parameter is called the Bernstein parameter introduced by Bartlett and Mendelson (2006); Pierre et al. (2019). Fast rates will usually be derived when . This condition is essentially a qualification of the identifiability condition of the objective function at its minimum . Note that, the Bernstein parameter is slightly different from the classical margin parameter adopted by Tarigan and Geer (2006); Chen et al. (2004).

To estimate the parametric and nonparametric parts respectively, some conditions concerning correlations between and are required. For each , let be the projection of onto . To be precise, with (2.6),

(2.6) |

Let and . Each function can be viewed as the best approximation of within . In the extreme case ( is uncorrelated with ), . The following condition is quite common in the semi-parametric estimation (Muller and van de Geer, 2015), ensuring that there is enough information in the data to identify the parametric coefficients.

Assumption C. The smallest eigenvalue of is bounded below by a constant .

Note that, the equation (2.7) always holds with the definition of projection on the -norm,

(2.7) |

This equality ensures that the parametric estimation can be separated from the total estimation, which is very useful in our proof.

We are in a position to derive the learning rate of the estimator defined by minimization (1.1). We allow that the number of dimension and the number of active covariates which are increasing with respect to the sample size , while and the dimension of is fixed.

###### Theorem 1.

Let be the proposed semi-parametric estimator for SVM defined in (1.1), with the regularization parameters and . If Assumptions A, B, and C hold, the equation (2.8

) holds with the probability at least

with some ,(2.8) |

and at the meantime the estimation error has the form (2.9),

(2.9) |

Remark that, this rate may be interpreted as the sum of a subset selection term () for the linear part and a fixed dimensional non-parametric estimation term (). Depending on the scaling of the triple and the smoothness of the RKHS , either the subset selection term or the non-parametric estimation term may dominate the estimation. In general, if , the -dimensional parametric term can dominate the estimation, so can the vice versa otherwise. At the boundary, the scalings of the two terms are equivalent. In the best situation (), our derived rate of the excess risk is the same as the optimal rate achieved by those least square approaches, see (Koltchinskii and Yuan, 2010; Muller and van de Geer, 2015) for details.

Note also that, it is easy to check that Theorems 1 still holds if in the confidence probability is replaced by an arbitrary such that . In this case, the divergence of is not needed and the probability bounds in the theorem becomes .

A number of corollaries of Theorem 1 can be obtained with particular choices of different kernels. First of all, we present finite-dimensional -rank operators, i.e., the kernel function can be expressed in terms of eigenfunctions. These eigenfunctions include the linear functions, polynomial functions, as well as the function class based on finite dictionary expansions.

###### Corollary 1.

For a finite rank kernel and for any , we have , which follows by the result of Theorem 1. Corollary 1 corresponds to the linear case for SVM when . The existing theory in the literatures on the linear SVM has paid constant attention to the analysis of the generalization error and variable selection consistency. Zhang et al. (2016) considers the non-convex penalized SVM in terms of the variable selection consistency and oracle property in high dimension, however, their results are based on a restrictive condition in case of . So the ultra-high dimensional cases ( with ) are excluded. Under the constrained eigenvalues constant condition, Peng et al. (2016) provides a tight upper bound of the linear SVM estimator in the norm, with an order , which is the same as our rate in Corollary 1 when .

Secondly, we state a result for the RKHS with countable eigenvalues, decaying at a rate for some smooth parameter . In fact, this type of scaling covers the Sobolev spaces, consisting of derivative functions with .

###### Corollary 2.

In the previous corollary, we need to compute the critical univariate rate . Given the assumption of polynomial eigenvalue decay, a truncation argument shows that , i.e., . As opposed to Corollary 1, we now discuss the special case where the functional complexity dominates the esimation, that is, the rate of the excess risk is an order . This is a better rate campared with those in Chen et al. (2004) and Wu et al. (2007). The learning rate in Chen et al. (2004) is derived as , where is a separation parameter corresponding to , and is a power appearing in the covering number, satisfying . Our rate can be proved to be better than that of Chen in the best case with . The similar arguments also hold when considering the result in Wu et al. (2007).

## 3 Proofs

In this section, we provide the proofs of our main theorem (Theorem 1). At a high-level, Theorem 1 is based on an appropriate adaptation to the semi-parametric settings of various techniques, developed for sparse linear regression or additive non-parametric estimation in high dimensions

(Buhlmann, 2019). In contrast to the parametric setting or additive setting, it involves structural deals from the empirical process theory to control the error terms in the semi-parametric case . In particular, we make use of several concentration theorems for the empirical processes (Geer, 2000), as well as the results on the Rademacher complexity of kernel classes (Bartlett et al., 2005).### 3.1 Proof for Theorem 1

We write the total empirical objective as the equation (3.1),

(3.1) |

The population risk for partial linear SVM is defined by (3.2),

(3.2) |

According to the definition of , it holds that . That means,

(3.3) |

The inequality (3.3) can be rewritten into the form (3.1),

(3.4) |

For simplicity, we denote,

(3.5) |

In order to derive the upper bound of in (3.5

), a new weighted empirical process is proposed in our semi-parametric high dimensional setting. The process is relevant to the uniform law of large number in a mixed function space. The

Lemma 1 can be derived via the peeling device which is often used in probabilistic theory.###### Lemma 1.

We continue our proof (3.1) along with the results established in Lemma 1. Apply the weighted empirical process and we can obtain (3.1) from (3.1),

(3.8) |

Therefore, when the conditions and are both satisfied, (3.1) holds after the inequality (3.1),

(3.9) |

where we use the basic inequality . Since implies that , (3.10) can be derived with Assumption B, C and the equality (2.7),

(3.10) |

Moreover, (3.11) follows by Assumption A after some simple computations,

(3.11) |

where the last inequality follows from the Cauchy-Schwartz inequality. Substitute (3.11) into (3.1) and we can obtain the inequality (3.1),

(3.12) |

For any , we can then derive the inequality (3.13) with the Young inequality ( with ),

(3.13) | |||

(3.14) |

where . (3.1) holds if is small enough to satisfy ,

(3.15) |

Furthermore, combine (3.10), (3.13) with (3.1) and we can conclude the inequality (3.1),

(3.16) |

where the condition is additionnally required so that is ignorable. In this case, we can derive (3.17),

(3.17) | |||

(3.18) |

Moreover, we will obtain (3.19) based on (3.17) and (3.1),

(3.19) |

where we choose and . Therefore, it is concluded that (3.20) holds by the triangle inequality and Assumption C,

(3.20) |

Finally, plugging the derived upper bounds into (3.1), we obtain the desired upper bound of the excess risk . This completes the proof.

### 3.2 The semi-parametric weighted emprical process

In order to prove Lemma 1, some auxiliary results is required which is on uniform law of large number or concerntation inequalities, stated as Lemmas 2 (Massart, 2000).

###### Lemma 2.

Let be independent and identically distributed copies of a random variable . Let be a class of real-valued functions on satisfying for all . Define

and

Then there exists a universal constant such that

Proof for Lemma 1. For any , we define (3.21) to apply Lemma 2,

(3.21) |

Based on (3.21), a bounded set of functions is introduced,

where we write the triplet . Since in (3.21) is Lipschitz with constant , the inequality (3.22) holds for any ,

(3.22) |

(3.22) implies that if we take (3.23) in Lemma 2,

(3.23) |

and (3.24) is also derived by the Lipschitz property,

(3.24) |

we can plug (3.23) and (3.25) into Lemma 2 to yield (3.26),

(3.25) |

(3.26) |

It remains to provide the upper bound of the term . Let be a Rademacher sequence independent of . The inequality (3.27) can be obtained by symmetrization and the contraction inequality,

(3.27) |

By Bernstein inequality and the union bound, we can get the inequality (3.2),

(3.28) |

Moreover, applying Talagrand’s concentration inequality once again, we get (3.29) with the probability at least

(3.29) |

Besides the result in Koltchinskii and Yuan (2010) that,

the inequality (3.30) holds with the probability at least ,

(3.30) |

where is some constant in Section 5 in Koltchinskii and Yuan (2010). Thus, combining (3.2), (3.30) with (3.26), we get (3.31) on an event of the probability at least ,

(3.31) |

where is some constant depending on </

Comments

There are no comments yet.