1 Some questions regarding machine learning
Kernels and in particular Mercer or reproducing kernels play a crucial role in statistical learning theory and functional estimation. But very little is known about the associated hypothesis set, the underlying functional space where learning machines look for the solution. How to choose it? How to build it? What is its relationship with regularization? The machine learning community has been interested in tackling the problem the other way round. For a given learning task, therefore for a given hypothesis set, is there a learning machine capable of learning it? The answer to such a question allows to distinguish between learnable and nonlearnable problem. The remaining question is: is there a learning machine capable of learning any learnable set.
We know since [14] that learning is closely related to the approximation theory, to the generalized spline theory, to regularization and, beyond, to the notion of reproducing kernel Hilbert space (). This framework is based on the minimization of the empirical cost plus a stabilizer (i.e. a norm is some Hilbert space). Then, under these conditions, the solution to the learning task is a linear combination of some positive kernel whose shape depends on the nature of the stabilizer. This solution is characterized by strong and nice properties such as universal consistency.
But within this framework there remains a gap between theory and practical solutions implemented by practitioners. For instance, in , kernels are positive. Some practitioners use hyperbolic tangent kernel while it is not a positive kernel: but it works. Another example is given by practitioners using nonhilbertian framework. The sparsity upholder uses absolute values such as or : these are norms. They are not hilbertian. Others escape the hilbertian approximation orthodoxy by introducing prior knowledge (i.e. a stabilizer) through information type criteria that are not norms.
This paper aims at revealing some underlying hypothesis of the learning task extending the reproducing kernel Hilbert space framework. To do so we begin with reviewing some learning principle. We will stress that the hilbertian nature of the hypothesis set is not necessary while the reproducing property is. This leads us to define a non hilbertian framework for reproducing kernel allowing non positive kernel, nonhilbertian norms and other kinds of stabilizers.
The paper is organized as follows. The first point is to establish the three basic principles of learning. Based on these principles and before entering the nonhilbertian framework, it appears necessary to recall some basic elements of the theory of reproducing kernel Hilbert space and how to build them from non reproducing Hilbert space. Then the construction of nonhilbertian reproducing space is presented by replacing the dot (or inner) product by a more general duality map. This implies distinguishing between two different sets put in duality, one for hypothesis and the other one for measuring. In the hilbertian framework these two sets are merged in a single Hilbert space.
But before going into technical details we think it advisable to review the use of in the learning machine community.
2 perspective
2.1 Positive kernels
The interest of arises from its associated kernel. As it were, a is a set of functions entirely defined by a kernel function. A Kernel may be characterized as a function from to (usually ). Mercer [12] first establishes some remarkable properties of a particular class of kernels: positive kernels defining an integral operator. These kernels have to belong to some functional space (typically , the set of square integrable functions on ) so that the associated integral operator is compact. The positivity of kernel is defined as follows:
where denotes the dot product in . Then, because it is compact, the kernel operator admits a countable spectrum and thus the kernel can be decomposed. Based on that, the work by Aronszajn [3] can be presented as follows. Instead of defining the kernel operator from to Aronszajn focuses on the embeded with its dot product . In this framework the kernel has to be a pointwise defined function. The positivity of kernel is then defined as follows:
(1) 
Aronszajn first establishes a bijection between kernel and . Then L. Schwartz [17] shows that this was a particular case of a more general situation. The kernel doesn’t have to be a genuine function. He generalizes the notion of positive kernels to weakly continuous linear application from the dual set
of a vector space
to itself. To share interesting properties the kernel has to be positive in the following sense:where denotes the duality product between and its dual set . The positivity is no longer defined in terms of scalar product. But there is still a bijection between positive Schwartz kernels and Hilbert spaces.
Of course this is only a short part of the story. For a detailed review on and a complete literature survey see [4, 15]. Moreover some authors consider nonpositive kernels. A generalization to Banach sets has been introduced [5] within the framework of the approximation theory. Nonpositive kernels have been also introduced in Kreĭn spaces as the difference between two positive ones ([2] and [17] section 12).
2.2 and learning in the literature
The first contribution of to the statistical learning theory is the regression spline algorithm. For an overview of this method see Wahba’s book [21]. In this book two important hypothesis regarding the application of the theory to statistics are stressed. These are the nature of pointwise defined functions and the continuity of the evaluation functional^{1}^{1}1These definition are formaly given section 3.5, definition 3.1 and equation (3). An important and general result in this framework is the socalled representer theorem [10]. This theorem states that the solution of some class of approximation problems is a linear combination of a kernel evaluated at the training points. But only applications in one or two dimensions are given. This is due to the fact that, in that work, the way to build was based on some derivative properties. For practical reason only low dimension regressors were considered by this means.
Poggio and Girosi extended the framework to large input dimension by introducing radial functions through regularization operator [14]
. They show how to build such a kernel as the green functions of a differential operator defined by its Fourier transform.
Support vector machines (SVM) perform another important link between kernel, sparsity and bounds on the generalization error [20]. This algorithm is based on Mercer’s theorem and on the relationship between kernel and dot product. It is based on the ability for positive kernel to be separated and decomposed according to some generating functions. But to use Mercer’s theorem the kernel has to define a compact operator. This is the case for instance when it belongs to functions defined on a compact domain.
Links between green functions, SVM and reproducing kernel Hilbert space were introduced in [9] and [18]. The link between and bounds on a compact learning domain has been presented in a mathematical way by Cucker and Smale [6].
Another important application of to learning machines comes from the bayesian learning community. This is due to the fact that, in a probabilistic framework, a positive kernel is seen as a covariance function associated to a gaussian process.
3 Three principles on the nature of the hypothesis set
3.1 The learning problem
A supervised learning problem is defined by a learning domain
where denotes the number of explicative variables, the learning codomain and a dimensional sample : the training set.Main stream formulation of the learning problem considers the loading of a learning machine based on empirical data as the minimization of a given criterion with respect to some hypothesis lying in a hypothesis set . In this framework hypotheses are functions from to and the hypothesis space is a functional space.
Technically a convergence criterion is needed in , i.e. has to be embedded with a topology. In the remaining, we will always assumed to be a convex topological vector space.
Learning is also the minimization of some criterion. Very often the criterion to be minimized contains two terms. The first one, , represents the fidelity of the hypothesis with respect to data while , the second one, represents the compression required to make a difference between memorizing and learning. Thus the learning machine solves the following minimization problem:
(2) 
The fact is, while writing this cost function, we implicitly assume that the value of function at any point is known. We will now discuss the important consequences this assumption has on the nature of the hypothesis space .
3.2 The evaluation functional
By writing we are assuming that function can be evaluated at this point. Furthermore if we want to be able to use our learning machine to make a prediction for a given input , has to exist for all : we want pointwise defined functions. This property is far from being shared by all functions. For instance function is not defined in 0. Hilbert space of square integrable functions is a quotient space of functions defined only almost everywhere (i.e. not on the singletons ). functions are not pointwise defined because the elements are equivalence classes.
To formalize our point of view we need to define as the set of all pointwise defined functions from to . For instance when all finite polynomials (including constant function) belong to . We can lay down our second principle:
Of course this is not enough to define a hypothesis set properly and at least another fundamental property is required.
3.3 Continuity of the evaluation functional
The pointwise evaluation of the hypothesis function is not enough. We want also the pointwise convergence of the hypothesis. If two functions are closed in some sense we don’t want them to disagree on any point. Assume is our unknown target function to be learned. For a given sample of size a learning algorithm provides a hypothesis . Assume this hypothesis converges in some sense to the target hypothesis. Actually the reason for hypothesis is that it will be used to predict the value of at a given . For any we want to converge to as follows:
We are not interested in global convergence properties but in local convergence properties. Note that it may be rather dangerous to define a learning machine without this property. Usually the topology on is defined by a norm. Then the pointwise convergence can be restated as follow:
(3) 
At any point , the error can be controlled.
It is interesting to restate this hypothesis with the evaluation functional
Definition 3.1
the evaluation functional
Applied to the evaluation functional our prerequisite of pointwise convergence is equivalent to its continuity.
Since the evaluation functional is linear and continuous, it belongs to the topological dual of . We will see that this is the key point to get the reproducing property.
Note that the continuity of the evaluation functional does not necessarily imply uniform convergence. But in many practical cases it does. To do so one additional hypothesis is needed, the constants have to be bounded: . For instance this is the case when the learning domain is bounded. Differences between uniform convergence and evaluation functional continuity is a deep and important topic for learning machine but out of the scope of this paper.
3.4 Important consequence
To build a learning machine we do need to choose our hypothesis set as a reproducing space to get the pointwise evaluation property and the continuity of this evaluation functional. But the Hilbertian structure is not necessary. Embedding a set of functions with the property of continuity of the evaluation functional has many interesting consequences. The most useful one in the field of learning machine is the existence of a kernel , a twovariable function with generation property^{2}^{2}2this property means that the set of all finite linear combinations of the kernel is dense in . See proposition 4.1 for a more precise statement.:
being a finite set of indices. Note that for practical reasons may have a different representation.
If the evaluation set is also a Hilbert space (a vector space embedded with a dot product) it is a reproducing kernel Hilbert space (). Although not necessary, are widly used for learning because they have a lot of nice practical properties. Before moving on more general reproducing sets, let’s review the most important properties of for learning.
3.5 the set of the pointwise defined functions on
In the following, the function space of the pointwise defined functions will be seen as a topological vector space embedded with the topology of simple convergence.
will be put in duality with the set of all functions on equal to zero everywhere except on a finite subset of . Thus all functions belonging to can be written in the following way:
were the indicator function is null everywhere except on where it is equal to one.
Note that the indicator function is closely related to the evaluation functional since they are in bijection through:
But formally, is a set of linear forms while is a set of pointwise defined functions.
4 Reproducing Kernel Hilbert Space ()
Definition 4.1 (Hilbert space)
A vector space embedded with the positive definite dot product is a Hilbert space if it is complete for the induced norm (i.e. all Cauchy sequences converge in ).
For instance , the set of polynomials of order lower or equals to , , the set of square sumable sequences seen as functions on are Hilbert spaces. and the set of bounded functions are not.
Definition 4.2 (reproduction kernel Hilbert space ())
A Hilbert space is a if it is defined on (pointwise defined functions) and if the evaluation functional is continuous on (see the definition of continuity equation 3).
For instance , as any finite dimensional set of genuine functions are . is also a . The CameronMartin space defined example 8.1.2 is a while is not because it is not a set of pointwise functions.
Definition 4.3 (positive kernel)
A function from to is a positive kernel if it is symmetric and if for any finite subset of and any sequence of scalar
This definition is equivalent to Aronszajn definition of positive kernel given equation (1).
Proposition 4.1 (bijection between and Kernel)
Proof.

from to Kernel. Let be a . By hypothesis the evaluation functional is a continuous linear form so that it belongs to the topological dual of . Thanks to the Riesz theorem we know that for each there exists a function belonging to such that for any function :
is a function from to and thus can be written as a two variable function . This function is symmetric and positive since, for any real finite sequence , , we have:

from kernel to . For any couple of (there exist two finite sequences and and two sequence of points , such that and ) we define the following bilinear form:
Let . defines a dot product on the quotient set . Now let’s define as the completion for the corresponding norm. is a with kernel by construction.
Proposition 4.2 (from basis to Kernel)
Let be a . Its kernel can be written:
for all orthonormal basis of , being a set of indices possibly infinite and noncountable.
Proof. implies there exits a real sequence such that . Then for all element of the orthonormal basis:
by identification we have .
Remark 4.1
Thanks to this results it is also possible to associate to any positive kernel a basis, possibly uncountable. Consequenty to proposition 4.1 we now how to associate a to any positive kernel and we get the result because every Hilbert space admit an orthonormal basis.
The fact that the basis is countable or uncountable (that the corresponding is separable or not) has no consequences on the nature of the hypothesis set (see example 8.1.7). Thus Mercer kernels are a particlar case of a more general situation since every Mercer kernel is positive in the Aronszajn sense (definition 4.3) while the converse is false. Consequenty, when possible functionnal formulation is preferible to kernel formulation of learning algorithm.
5 Kernel and kernel operator
5.1 How to build ?
It is possible to build from a Hilbert space where is a set (usualy ) and a measure. To do so, an operator is defined to map functions onto the set of the pointwise valued functions . A general way to define such an operator consists in remarking that the scalar product performs such a linear mapping. Based on that remark this operator is built from a family of functions when in the following way:
Definition 5.1 (Carleman operator)
Let be a family of functions. The associated Carleman operator is
That is to say . To make apparent the bijective restriction of it is convenient to factorize it as follows:
(4) 
where is the quotient set, the bijective restriction of and the cannonical injection.
This class of integral operators is known as Carleman operators [19]. Note that this operator unlike HilbertSchmidt operators need not be compact neither bounded. But when is a compact set or when (it is a square integrable function with respect to both of its variables) is a HilbertSchmidt operator. As an illustration of this property, see the gaussian example on in table 1. In that case ^{3}^{3}3 To clarify the not so obvious notion of pointwise defined function, whenever possible, we use the notation when the function is not a pointwise defined function and denotes functions. Here is a pointwise defined function with respect to variable but not with respect to variable . Thus, whenever possible, the confusing notation is omitted..
Proposition 5.1 (bijection between Carleman operators and the set of )
Proof.

Consider the bijective restriction of defined in equation (4). can be embedded with the induced dot product defined as follows:
With respect to the induced norm, is an isometry. To prove is a , we have to check the continuity of the evaluation functional. This works as follows:
with . In this framework reproducing kernel verifies . It can be built based on :

Let be a orthonormal basis and an orthonormal basis of . We admit there exists a couple (G,) such that (take for instance the counting measure on the suitable set). Define as a family. Let be the associated Carleman operator. The image of this Carleman operator is the span by since:
and family is orthonormal since .
To put this framework at work the relevant function has to be found. Some examples with popular kernels illustrating this definition are shown table 1.
Name  

Cameron Martin  
Polynomial  
Gaussian 
5.2 Carleman operator and the regularization operator
The same kind of operator has been introduced by Poggio and Girosi in the regularization framework [14]. They proposed to define the regularization term (defined equation 2) by introducing a regularization operator from hypothesis set to such that . This framework is very attractive since operator models the prior knowledge about the solution defining its regularity in terms of derivative or Fourier decomposition properties. Furthermore the authors show that, in their framework, the solution of the learning problem is a linear combination of a kernel (a representer theorem). They also give a methodology to build this kernel as the green function of a differential operator. Following [3] in its introduction the link between green function and is straightforward when green function is a positive kernel. But a problem arises when operator is chosen as a derivative operator and the resulting kernel is not derivable (for instance when is the simple derivation, the associated kernel is the nonderivable function ). A way to overcome this technical difficulty is to consider things the other way round by defining the regularization term as the norm of the function in the built based on Carleman operator . In this case we have . Thus since is bijective we can define operator as: = . This is no longer a derivative operator but a generalized derivative operator where the derivation is defined as the inverse of the integration ( is defined as ).
5.3 Generalization
It is important to notice that the above framework can be generalized to non Hilbert spaces. A way to see this is to use Kolmogorov’s dilation theorem [8]. Furthermore, the notion of reproducing kernel itself can be generalized to nonpointwise defined function by emphasizing the role played by continuity through positive generalized kernels called Schwartz or hilbertian kernels [17]. But this is out of the scope of our work.
6 Reproducing kernel spaces (RKS)
By focusing on the relevant hypothesis for learning we are going to generalize the above framework to nonhilbertian spaces.
6.1 Evaluation spaces
Definition 6.1 (Es)
Let be a real topological vector space (t.v.s.) on an arbitrary set ,
.
is an evaluation space if and only if:
ES are then topological vector spaces in which (the evaluation functional at ) is continuous, i.e. belongs to the topological dual of .
Remark 6.1
Topological vector space with the topology of simple convergence is by construction an ETS (evaluation topological space).
In the case of normed vector space, another characterization can be given:
Proposition 6.1 (normed ES or BES)
Let be a real normed vector space on an arbitrary set
, .
is an evaluation kernel space if and only if the evaluation functional:
if it is complete for the corresponding norme it is a Banach evaluation space (BES).
Remark 6.2
In the case of a Hilbert space, we can identify and and, thanks to the Riesz theorem, the evaluation functional can be seen as a function belonging to : it is called the reproducing kernel.
This is an important point: thanks to the Hilbertian structure the evaluation functional can be seen as a hypothesis function and therefore the solution of the learning problem can be built as a linear combination of this reproducing kernel taken different points. Representer theorem [10] demonstrates this property when the learning machine minimizes a regularized quadratic error criterion. We shall now generalize these properties to the case when no hilbertian structure is available.
6.2 Reproducing kernels
The key point when using Hilbert space is the dot product. When no such bilinear positive functional is available its role can be played by a duality map. Without dot product, the hypothesis set is no longer in self duality. We need another set to put in duality with . This second set is a set of functions measuring how the information I have at point helps me to measure the quality of the hypothesis at point . These two sets have to be in relation through a specific bilinear form. This relation is called a duality.
Definition 6.2 (Duality between two sets)
Two sets are in duality if there exists a bilinear form on that separates and (see [11] for details on the topological aspect of this definition).
Let be such a bilinear form on that separate them. Then we can define a linear application and its reciprocal as follows:
where (resp. ) denotes the dual set of (resp. ).
Let’s take an important example of such a duality.
Proposition 6.2 (duality of pointwise defined functions)
Let be any set (not necessarily compact). and are in duality
Proof. Let’s define the bilinear application as follows:
Another example is shown in the two following functional spaces:
where for instance denotes the Lebesgue measure. Theses two spaces are put in duality through the following duality map:
Definition 6.3 (Evaluation subduality)
Two sets and form an evaluation subduality iff:

they are in duality through their duality map ,

they both are subsets of

the continuity of the evaluation functional is preserved through:
The key point is the way of preserving the continuity. Here the strategy to do so is first to consider two sets in duality and then to build the (weak) topology such that the dual elements are (weakly) continuous.
Hilbertian case  General case  

(IR^X)’ @<0.5ex>[dr]_i^* @.>@/^1.7pc/[ddrr]^ϰ  
H’=RieszH@<0.5ex>[dr]_i  
IR^X  (IR^X)’ [d]_i^* [r]^j^* @.>@/^1.7pc/[ddrr]_ϰ M’[dr]^θ_M  
H’@>[dr]_θ_H  H[d]^i  
M[r]_j  IR^X  
Proposition 6.3 (Subduality kernel)
A unique weakly continuous linear application is associated to each subduality. This linear application, called the subduality kernel, is defined as follows:
where and are the canonical injections from to and respectively from to (figure 1).
Proof. for details see [11].
We can illustrate this mapping detailing all performed applications as in figure 1:
Definition 6.4 (Reproducing kernel of an evaluation subduality)
Let be an evaluation subduality with respect to map associated with subduality kernel . The reproducing kernel associated with this evaluation subduality is the function of two variables defined as follows:
This structure is illustrated in figure 1. Note that this kernel no longer needs to be definite positive. If the kernel is definite positive it is associated with a unique . However, as shown in example 8.2.1 it can also be associated with evaluation subdualities. A way of looking at things is to define as the generalization of the Schwartz kernel while is the generalization of the Aronszajn kernel to non hilbertian structures. Based on these definitions the important expression property is preserved.
Proposition 6.4 (generation property)
and
Proof. This property is due to the density of Span in . For more details see [11] Lemma 4.3.
Just like , another important point is the possibility to build an evaluation subduality, and of course its kernel, starting from any duality.
Proposition 6.5 (building evaluation subdualities)
Let be a duality with respect to map . Let be a total family in and be a total family in . Let (reps. ) be the linear mapping from (reps. ) to associated with (reps. ) as follows:
Then and are injective and is an evaluation subduality with the reproducing kernel defined by:
Proof. see [11] Lemma 4.5 and proposition 4.6
Duality  Evaluation subduality  

B’[dr]^θ_(B,A)  
A’@>[dr]_θ_(A,B)  A  
B  
(IR^X)’ [d]_i^* [r]^j^* @.>@/^1.7pc/[ddrr]_ϰ M’[dr]^θ_(M,H)  
H’@>[dr]_θ_(H,M)  H[d]^i  
M[r]_j  IR^X 
An example of such subduality is obtained by mapping the duality to using injective operators defined by the families and :
and
In this case , and . We define the duality map between and through:
See example 8.2.1 for details.
All useful properties of – pointwise evaluation, continuity of the evaluation functional, representation and building technique – are preserved. A missing dot product has no consequence on this functional aspect of the learning problem.
7 Representer theorem
Another issue is of paramount practical importance: determining the shape of the solution. To this end representer theorem states that, when is a , the solution of the minimization of the regularized cost defined equation (2) is a linear combination of the reproducing kernel evaluated at the training examples [10, 16]. When hypothesis set is a reproducing space associated with a subduality we have the same kind of result. The solution lies in a finite dimensional subspace of . But we don’t know yet how to systematically build a convenient generating family in this subspace.
Theorem 7.1 (representer)
Assume is a subduality of with kernel . Assume the stabilizer is convex and differentiable ( denotes its subdifferential set).
If then the solution of cost minimization lies in a dimensional subspace of .
Proof. Define a subset . Let be the orthogonal in the sense of the duality map (i.e. ). Then for all . Now let be the complement vector space defined such that
The solution of the minimizing problem lies in since:

constant

(thanks to the convexity of )

and by hypothesis
By construction a dimensional subspace of .
The nature of vector space depends on kernel and on regularizer . In some cases it is possible to be more precise and retrieve the nature of . Let’s assume regularizer is given. may be chosen as the set of function such that . Then, if it is possible to build a subduality with kernel such that
and if the vector space spaned by the kernel belongs to the regularizer subdifferential :
then solution of the minimization of the regularized empirical cost is a linear combination of the kernel:
An example of such result is given with the following regularizer based on the norm on :
The hypothesis set is Sobolev space (the set of functions defined on whose generalized derivative is integrable) put in duality with (with ) through the following duality map:
The associated kernel is just like in Cameron Martin case . Some tedious derivations lead to:
Thus the kernel verifies
This question of the representer theorem is far from being closed. We are still looking for a way to derive a generating family from the kernel and the regularizer. To go more deeply into general and constructive results, a possible way to investigate is to go through Fenchel dual.
8 Examples
8.1 Examples in Hilbert space
The examples in this section all deal with r.k.h.s included in a space.

Schmidt ellipsoid:
Let be a measure space, a basis of being a countable set of indices. Any sequence defines a HilbertSchmidt operator on with kernel function , thus a reproducing kernel Hilbert space with kernel function:The closed unit ball of the verifies
and is then a Schmidt ellipsoid in . An interesting discussion about Schmidt ellipsoids and their applications to sample continuity of Gaussian measures may be found in [7].

CameronMartin space:
Let be the Carleman integral operator on ( is the Lebesgue measure) with kernel functionit defines a r.k.h.s with reproducing kernel . The space is the Sobolev space of degree 1, also called the CameronMartin space.
Comments
There are no comments yet.