1 Introduction
In recent years it has become increasingly easy to collect large amount of information, especially with respect to the number of explanatory variables or ‘features’. However the additional information provided by each of these features may not be significant for explaining the phenomenon at hand. Learning the functional connection between the explanatory variables and the response from such highdimensional data can itself be quite challenging. Moreover some of these explanatory variables or features may contain redundant or noisy information and this may hamper the quality of learning. One way to overcome this problem is to use variable selection (also referred to as feature elimination) techniques to find a smaller set of variables that is able to perform the learning task sufficiently well.
In this work we discuss feature elimination in empirical risk minimization and support vector machines, focusing mainly on the latter. The popularity of support vector machines (SVM) as a set of supervised learning algorithms is motivated by the fact that SVM learning methods are easytocompute techniques that enable estimation under weak or no assumptions on the distribution
(see Steinwart and Chirstmann, 2008). SVM learning methods, which we review in detail in Section 2, are a collection of algorithms that attempt to minimize a regularized version of the empirical risk over some reproducing kernel Hilbert space (RKHS) with respect to some loss function. The standard SVM decision function typically utilizes all the input variables. Hence, when the input dimension is large, it can suffer from the socalled ‘Curse of Dimensionality’
(Hastie et al., 2001). A procedure for variable selection is thus of importance to obtain a more intelligible solution with improved efficiency. The advantages of variable selection are multi fold: it increases the generalized performance of the learning, it clarifies the causal relationship in the inputoutput space, and results in reduced cost of data collection and storage and better computational properties.One of the earliest works on variable selection in SVM was formulated by Guyon et al. (2002). Guyon et al. developed a backward elimination procedure based on recursive computation of the SVM learning function, known widely as recursive feature elimination (RFE). The RFE algorithm performs a recursive ranking of a given set of features. At each recursive step of the algorithm, it calculates the change in the RKHS norm of the estimated SVM function after deletion of each of the features remaining in the model, and removes the one with the lowest change in such norm. The process thus performs an implicit ranking of the features and can even be generalized to remove chunks of features at each step of recursion. A number of modified approaches have been developed since then, inspired by RFE (see Rakotomamonjy, 2003; Aksu et al., 2010; Aksu, 2012)
. Although there is no dearth of rich literature on RFE for SVMs, the theoretical properties of it have never been studied. The arguments for RFE have mostly been heuristic and its ability to produce successful datadriven performances in simulated or reallife settings. A key reason behind this lack of theory is the absence of a wellestablished framework for building, justifying, and collating the theoretical foundation of such a feature elimination method. This paper aims at building such a framework and validating RFE as a theoretically sound procedure for feature elimination in SVMs.
Developing a theoretical structure for RFE is challenging. At each stage of the feature elimination process, we move down to a ‘lower dimensional’ feature space and the functional spaces need to be adjusted to cater to the appropriate version of the problem in these subspaces. Euclidean spaces, for example, as well as many specialized functional classes admit a nested structure in this regard, but as we will see later, this is not true in general. As mentioned before, SVM attempts to minimize the empirical regularized risk within an RKHS of functions. Starting with a given RKHS, one daunting task is redefining the functional space so that it retains the premises of the original space (i.e. admits the reproducing structure) and that these spaces remain cognate to one another. The basis for the theory on RFE depends heavily on correctly specifying these pseudosubspaces, and a contribution of this paper is to formulate a way to do this.
Another contribution of this paper is a modification of the criterion for deletion and ranking of features in Guyon et al.’s RFE to enable theoretical consistency. Here we develop a ranking of the features based on the lowest difference observed in the regularized empirical risk after removing each feature from the existing model. The definition of RFE used here can thus be generalized to the much broader yet simpler setting of empirical risk minimization where we can apply the same idea to the empirical risk. This can thus serve as a useful starting point for more indepth theoretical analysis of feature elimination in SVM. While Guyon et al.’s RFE tends to rely on the penalization criterion in the SVM objective function for ranking features, our approach is riskbased, in that we utilize the entire objective function for ranking. The heuristic reasoning behind this is that if any of the features do not contribute to the model at all, the increase in the regularized risk will be inconsequential.
In this paper, we show that the modified RFE is asymptotically consistent in finding the ‘correct’ feature space both for SVMs and empirical risk minimization (ERM) under reasonable regularity conditions. Although these regularity conditions are true for most of the relevant problems at hand, we show through appropriate examples that consistency results for RFE might fail in general, and for correct utilization of RFE as a consistent tool for feature elimination in SVMs, we need these regularity conditions to hold. The notion of consistency in such a context has not been defined previously. This paper also aims at positing a basis for which such results are meaningful. A comprehensive statistical analysis of SVMs can be found in Steinwart and Chirstmann (2008) (hereafter abbreviated SC08, ) which is used in this paper to develop the concept of consistency for RFE in the context of feature elimination in SVM and ERM. We give an indepth analysis of a few case studies, including the setting of risk minimization in linear models and SVM for classification with a Gaussian RBF kernel, to show how the results developed here can be applied to specific examples. We also provide some simulation results to validate our theoretical conclusions and discuss how to utilize the proposed deletion criteria to select the important features in a given setting.
While RFE is a popular and simple method for variable selection, several other methods do exist in the context of feature elimination in SVMs. RFE is a classic example of a wrapper that uses the learning method itself to score feature subsets. Alternative wrapperbased selection methods have also been formulated for feature elimination in SVMs (Weston et al., 2001; Chapelle et al., 2002). Other basic types of variable selection techniques include filters that select subsets of the feature space as a preprocessing step or embedded methods that construct the learning algorithm in a way to include feature elimination as an inbuilt phenomenon. Filters have been used for feature elimination in SVMs in many previous works (see for example Mladenic et al., 2004; Peng et al., 2005) . Embedded variable selection methods include redefining the SVM training to include sparsity (Weston et al., 2003; Chan et al., 2007). For example, Bradley and Mangasarian (1998) suggested the use of the penalty to encourage feature sparsity. Zhu et al. (2003) suggested an algorithm to compute the solution path for this norm SVM efficiently. Other methods include introducing different penalty functions like the SCAD penalty (Zhang et al., 2006), the penalty (Liu et al., 2007), a combination of and penalty (Liu and Wu, 2007), the elastic net (Wang et al., 2006), the norm (Zou and Yuan, 2006), and using a penalty functional in the framework of the smoothing spline ANOVA (Zhang, 2006)
. Although these alternative methods appear to perform well in practice, RFE still remains the most widely used methodology for feature selection in support vector machines due to its simplicity and generality.
In Section 2, we give a short preliminary background for empirical risk minimization and support vector machines. In Section 3 we present the proposed version of RFE for ERM and SVM. In Section 4 we discuss the concept of feature elimination in these frameworks. In Section 5 we give the necessary assumptions for RFE in empirical risk minimization and support vector machines and provide a short discussion on the meaningfulness of these assumptions in varied situations. The associated consistency results for RFE are given in Section 6. In Section 7 we discuss our results in some known settings of ERM and SVM. In Section 8 we provide some simulation results to demonstrate how RFE works and how it can be used in intelligent selection of features. A discussion is provided in Section 9, detailed proofs are given in the Appendix, and the resources for the necessary codes are given in LABEL:sec:suppA.
2 Preliminaries
We start off with some preliminaries and define the notations that we will follow for the rest of the paper. We also give a brief introduction to support vector machines and empirical risk minimization.
2.1 Empirical risk minimization
Empirical risk minimization (ERM) is a general setting of many supervised learning problems.
Let the input space be measurable, such that where is an open Euclidean ball centered at , and and let be a measure on . A function is called a loss function if it is measurable. We say that a loss function is convex if is convex for every and . A loss function is called locally Lipschitz continuous with Lipschitz local constant if for every ,
The loss function is said to be Lipschitz continuous if there is a constant such that .
For any measurable function we define the Lrisk of with respect to the measure as . The Bayes Risk with respect to the loss function and the measure is defined as , where the infimum is taken over the set of all measurable functions, . A function that achieves this infimum is called a Bayes decision function.
Let be a nonempty functional space, and be any loss function. Let
(1) 
be the minimizer of infinitesample risk within the space . Define the minimal risk within the space as . Define the empirical risk as .
A learning method whose decision function satisfies for all and is called empirical risk minimization (ERM) with respect to and .
2.2 Support vector machines
The results developed for SVM in this paper are valid not only for classification, but also for regression under some general assumptions on the output space , however throughout this paper we would refer to all these versions as SVM.
Let be an Hilbert function space over , then a function is called a reproducing kernel of if for all and has the reproducing property that for all and all . The space is called a Realvalued Reproducing Kernel Hilbert Space (RKHS) over if for all , the Dirac functional defined by is continuous for all (For more details refer to chapter of SC08).
Let be a convex, locally Lipschitz continuous loss function and be a separable RKHS of a measurable kernel on . Let be a set of
i.i.d observations drawn according to the probability measure
and fix a . Define the empirical SVM decision function as(2) 
where is the empirical risk defined as before.
For a given , define the SVM learning method as the map defined by for all . We say that a learning method is measurable if it is measurable for all with respect to the minimal completion of the product field on . Lemma of SC08, under the conditions given in Section 2.2 above yields that the corresponding SVM that produces the decision functions for is a measurable learning method for all and for all . The maps mapping to are measurable. Since Lemma of SC08 shows that the map defined by is measurable, we therefore obtain measurability of .
Define and define the approximation error
(3) 
2.3 Entropy Numbers
For a metric space and for any integer , the th entropy number of is defined as
(4) 
where is the ball of radius centered at , with respect to the metric . Let be a bounded linear operator between normed spaces and , then we write , where is the unit ball in .
Note: If we have for a given kernel , Lemma of SC08 implies that every is bounded which further implies that , where .
3 Feature Elimination Algorithm
The original RFE (Recursive Feature Elimination) Method was proposed for SVMs by Guyon et al. (2002). The feature elimination procedure version we propose here is similar to the one in Guyon et al. except for the elimination criterion. While Guyon et al. use the criterion Hilbert space norm to eliminate features recursively, we use the entire objective function including the regularized Hilbert Space norm along with the empirical risk. Hence while Guyon’s RFE is only applicable in analyses involving SVM, the modified RFE that we propose here can be used in ERM as well.
3.1 The Algorithms
The RFE was originally developed for support vector machines, hence we provide the algorithm for SVM first. The definition for ERM follows similarly. First we define some related concepts. A detailed discussion on these will be given in Section 4.
Definition 1.
For any set of indices and a given functional space , define , where is the projection map taking to (), such that is produced from by replacing elements of indexed in the set , by zero.
We define the space , such that is a surjection.
Definition 2.
For a given RKHS indexed by a kernel and for a given , define , where .
Now we are ready to provide the algorithm. Assume the support vector machine framework, where we are given an RKHS with respect to a kernel .
Algorithm 3.
Start off with empty and let .
STEP 1: In the cycle of the algorithm choose dimension for which
(5) 
STEP 2: Update . Go to STEP 1.
Continue this until the difference
becomes larger than a predetermined quantity .
Now for an empirical risk minimization framework with respect to a given functional space , Algorithm 3 can be modified to match the setting of ERM.
Algorithm 4.
Replace the regularized empirical risk in Algorithm 3, (defined for a given set of indices ) by the empirical risk .
3.2 Cycle of RFE
We define ‘cycle’ of the RFE algorithm as the number of ‘dimensions’ deleted in one step of the algorithm. The algorithms in 3.1 has cycle . But one can define it for cycles of value greater than in which case one deletes chunks of dimensions at a time, equal to the size of the cycle. It can also be defined adaptively such that in different runs of the algorithm the cycle sizes are different. The theoretical results derived in this paper will hold for cycles of any size. Hence, for the sake of simplicity, we set the cycle size to .
4 Functional Spaces on Lower Dimensional Domains
4.1 Feature Elimination in ERM
Suppose we have a functional class ^{1}^{1}1Note that the loss functions we consider in this paper (unless otherwise mentioned) are convex and locally Lipschitz with , and hence by and Proposition of SC08, we have . Hence instead of it suffices to consider the smaller subspace ., where is as defined in Section 2 and let our goal be to find a function within the functional class which minimizes an empirical criterion (like empirical risk in ERM). But if the dimension of the input space is too large, it might lead to more complex solutions when in fact a simpler solution might be good enough. Now suppose that the minimizer of the appropriate infinitesample version of the empirical criterion (like risk or expected loss in case of ERM and SVM) with respect to the probability measure on and the functional class , actually resides in where is a lower dimensional version of the given input space . Then to avoid over fitting it is necessary that we try to find the empirical minimizer in a suitably defined lower dimensional version of . We define the lower dimensional adaptations of the original functional space as in Definition 1.
First note that may not be a subspace of , because for any , may not be contained in . Note that the assertion holds trivially for any Euclidean open ball centered at , and from Section 2 we have that , for some . Hence we will assume that the functional space can be sufficiently redefined as , where the domain of functions in is instead of , such that . This makes the functional classes welldefined, and unless otherwise mentioned, we will assume from hereon that for all possible .
Note also that may not be a subspace of . Although it is more desirable for these functional classes to accept a nested structure between each other, so that as we go down from a space to its lower dimensional version (that is, from to where , we can have the simple relation that ), it does not hold in general.
We now provide a few results that connect the space with its lower dimensional versions. Note that our definition trivially implies that . Now if we define , then Lemma 5 says that is isomorphic to the space . Lemma 6 below shows that by defining the functional classes in this way, many of the nice properties of the functional class are carried forward to the s. The proofs can be found in Appendix A.1 and A.2.
Lemma 5.
.
Lemma 6.
Let be a nonempty functional subspace. Then for any ,

If is dense in , then is dense in .

If is compact, then so is .

, where is the entropy number of the set with respect to the norm as defined in Section 2.3.
4.2 Feature Elimination in SVM
In empirical risk minimization problems our primary focus is the empirical risk, whereas in the case of support vector machines we concentrate on the regularized version of the empirical risk, . The minimization is typically computed over an RKHS , that is, our objective is to find . The regularization term is used to penalize functions with a large RKHS norm. Complex functions which model too closely the output values in the training data set , tend to have large norms (Refer to Exercise in SC08 for a clear motivation). Again we assume that where is an open Euclidean ball centered at . We will also assume that we can sufficiently redefine the RKHS as , such that the domain of the functions in is the Euclidean open ball instead of . So we can extend the domain of the kernel of the RKHS from to and from here onwards we assume . The usual way that we defined the lower dimensional functional spaces in the previous section may not be sufficient here mainly because in SVM, the minimization is computed over an RKHS, and the properties of RKHSs dictate a lot of the statistical analyses. Hence we need to find a way to define them so that these spaces are RKHSs as well.
First we review some properties of RKHS:
The ‘’ space for an RKHS with kernel is defined as . is the completion of the space (See Theorem of SC08 for details).
Let be any set and be a map. Let be the kernel on . Then define the map as, . Observe that is a kernel on (Paulsen, 2009, Proposition 5.13).
The next theorem gives a natural relationship between the RKHS on and the RKHS on . Also when is a subset of and is the inclusion map of into , then the kernel is the restriction of the kernel on .
Theorem 7.
Let and be two sets and let be a kernel function on and let be a function. Then , and for we have that .
Now let be a subset of and be the restriction of the kernel on and be the RKHS admitting as its reproducing kernel and be the RKHS with its reproducing kernel . Then by the above theorem and defining to be the inclusion id map from to , we have
and
for .
For a given RKHS , we can now define these new functional spaces in the following ways:

Projection of the Functional Space: We can define it as we did in the previous section. So define on as, .

Projection of the kernel: defined on as . Note that by defining them like this, the new spaces that we obtain are all RKHSs on .
From the discussion below Theorem 7 and in Def 2 we have,
(6) 
Also note from Def 1 that
(7) 
So we see that restrictions of both of these functional spaces to are the same and the restriction space is itself an RKHS on . Also note trivially that and hence from now onwards we would refer to the space as simply .
Next we redefine Lemma 6 for the RKHSs. The proofs are similar and hence omitted:
Lemma 8.
Let be a nonempty RKHS on . Then for any ,

If is dense in , then is dense in .

If the closure of the unit ball is compact, then so is .

If is separable, then so is .

, where is the th entropy number of the unit ball of the RKHS , with respect to the norm.
In order to provide a heuristic understanding of the importance of the above projection spaces in feature selection, we give an alternative definition of lower dimensional versions of the input space. First, define the map such that for , . So is the dimensional vector containing only those elements of which are given in the index set . Hence we can define the deleted input space as, .
Consider the set up of Theorem 7, with , and . Consider the restricted kernel on with for all . Now for any define the map as for any satisfying . Or in other words the map takes an element from the deleted space, fills in the gaps with zeros and returns an element from the projected space. Note then that is a bijection, and hence the spaces and are isomorphic to each other.
Hence from Theorem 7, we see that is a kernel defined on and with the corresponding RKHS on . Suppose that instead of , our input space is . We want to know when can we define a kernel on such that it is the natural abridgment of the kernel on (in the sense of being able to define it on deleted vectors) and we want to know if there exists a natural connection between and in those cases.
The motivation for the definition of stems from previous works on feature elimination in Support Vector Machines. The Recursive Feature Elimination procedure developed in Guyon et al. (2002) and subsequently revisited and modified in Rakotomamonjy (2003) starts off with a given input space and eliminates features using a weight criteria recursively computed by retraining the SVM on the lower dimensional spaces . From their discussion, it is seen that if the Gram matrix of the training vectors is given by , then the Gram matrix of the training vectors after deleting a particular variable say is taken to be where . This clearly takes into account the assumption that the kernel can be defined on deleted vectors as well, that is, is well defined for any pair of vectors and where and . This is clearly not true in general for any kernel on . So we prefer to work with the projected spaces instead of the deleted spaces as this is more general. But we will show in the discussions below that in most practical cases as discussed in Guyon et al. (2002), and Rakotomamonjy (2003), many of the kernels that we work with satisfy an intrinsic relationship between and . Hence in those cases it is appropriate to work with either of the two setups.
4.2.1 Kernels in Statistical Learning
Most popular kernels in statistical learning can be categorized into three main groups: translation invariant kernels, kernels originating from generative models (like Jaakkola and Haussler, Watsons) and dot product kernels (see Smola et al., 2000). In this paper we restrict our attention to only translation invariant and dot product kernels.

Translation Invariant Kernels: A translation invariant Kernel satisfies . The class of translation invariant kernels also includes Radial Kernels which satisfy , where is the usual Euclidean norm of vector in its correct dimension.

DotProduct Kernels: A dotproduct kernel satisfies , where is the standard inner product between vectors and in their correct dimension.
Lemma 9.
For Radial Kernels and Dot Product Kernels, .
The proof is simple and therefore omitted.
Also note that for kernels defined on weighted norms, ( where , with being a positive diagonal matrix), the above condition is also satisfied.
4.2.2 Universal Kernels
A continuous kernel on a compact metric space is called universal if the RKHS is dense in , i.e., for every function and all there exists an such that (where denotes the set of continuous functions from ). From Proposition of SC08 we see that if is a compact metric space with , the RKHS of a universal kernel on , a distribution on and a convex, locally Lipschitz continuous loss with , then we have that . Universal kernels produce particularly large RKHSs.
4.2.3 Universality of Kernels:
For this we refer our Readers to Micchelli et al. (2006) where the notion of universality for most of the special types of kernels are discussed in details (including dot product and radial kernels). However we state two results on radial kernels here to show that under quite weak assumptions, all of the non trivial radial kernels are universal.

Representation of Radial Kernels: Schoenberg (1938) showed that a function defined as
(8) where is the usual Euclidean norm, is a valid kernel on for all iff there exists a finite Borel measure on such that for all ,
(9)
All kernels of this type are not universal. Indeed, the choice of a measure concentrated only at gives a kernel that is identically constant and therefore it is not universal. The next result shows that this is the only exceptional case.
4.3 Notion of risk in Lower Dimensional Versions of the Input Space
Note that the functional space (and equivalently RKHS ) is defined on the entire input space and not only on . So we can define risk for a function (or ) for the entire input space and not just for
. Hence for a probability distribution
on , define as . This means that we can compare the risk of functions in different lower dimensional versions of the original functional space.5 Assumptions for RFE and Their Implications
In this section we discuss the assumptions needed for consistency of RFE for both ERM and SVM. We then discuss validity of these assumptions under practical settings.
5.1 Assumptions
Consider the setting of risk minimization (regularized or non regularized) with respect to a given functional space (which are typically RKHSs in case of SVM). Our main aim is to provide a framework where the modified recursive feature elimination method we proposed earlier is consistent in finding the correct lower dimensional subspace of the input space, and the assumptions required for this are:

Let be a subset of . Let the function minimize risk within the space with respect to the probability distribution on . Here, we define . We assume that there exists a nontrivial that is and that satisfies the criterion that for any pair such that , and such that with and such that .
In other words, Assumption (A1) says that there exists a ‘path’ from the original input space to the correct lower dimensional space in the sense of equality of the minimized risk within the functional spaces s along this ‘path’. So there exists a sequence of indices from to , where , such that is the same for all . Note that may not be unique and there might be more than one path leading to . Also note that may not be unique in general, but any one of them would work for our purpose. So we will assume it to be unique in this paper.

Let be the exhaustive list of such paths from to , and let . There exists such that whenever , .
In Section 6 we will show that Assumptions (A1) and (A2) are sufficient for a recursive feature elimination algorithm like RFE to work (in terms of consistency). Here we try to show the necessity of Assumption (A1) in order for a welldefined recursive feature elimination algorithm to work.
5.2 Necessity of existence of a path in (A1)
Example 10.
Consider the empirical risk minimization framework. Let and let . Let where is some distribution on and . Let the functional space be . Let the loss function be the squared error loss, i.e., . By Definition 1, and and . We see that but both and . Hence even if the correct lowdimensional functional space may have minimized risk the same as that of the original functional space, if there does not exist a path going down to that space, the recursive algorithm will not work. Note that the minimizer of the risk belongs to but there is no path from to , in the sense of (A1).
5.3 Necessity of Equality in (A1)
It would appear that for the algorithm to work, we don’t have to necessarily work with equalities along the path and that we can relax (A1) to include inequalities as well. Suppose we redefine (A1) as (A1*), where the equality of minimized risk along the path is replaced by ‘’. So now we assume that minimized risk is not necessarily constant along the path, but that it does not increase. We show below that under this modified assumption, our recursive search algorithm might fail to find the correct lower dimensional subspace of the input space.
Example 11.
Let and such that . Let , and let the loss function be squared error loss. Now by definition, , , , , , , and .
By simple calculations, we see that , , and . Note that the correct dimensional subspace of the input space is and there exists paths leading to this space via since or via since in the sense of Assumption (A1*). But there also exists the blind path since which does not lead to the correct subspace. Hence the recursive search in this case may not be guaranteed to lead to the correct subspace.
5.4 Validity of the Assumptions in Practical Situations
In this section we discuss the validity of the assumptions in 5.1, with respect to practical situations of risk minimization.
In ERM, our main aim is to find a function within a class that minimizes empirical risk within that class. Choice of the functional space is important as it determines a fine balance between complexity of the solution (see discussion in 4) on one hand and finding a function that has risk close to the Bayes Risk on the other. Often the spaces we consider for minimization satisfy properties that make the assumptions in Section 5.1 fairly natural. For SVM, the choice of the RKHS is just as important as it was in choosing a functional class in ERMs. Again, in most practical situations, the RKHS will satisfy some properties that would make Assumptions (A1) and (A2) quite standard conditions for feature elimination.
5.4.1 Nested spaces in empirical risk minimization
Often the space we consider for ERM is such that for any , , that is, it admits the nested property. So for any with we have . This means we also have nested inequalities in the form of for such and . One example is the linear combination space where the coefficients are allowed to take values in a compact interval containing , . In these cases, simple observation shows that Assumption (A1) translates to saying that there exists a minimizer which minimizes infinitesample risk in , satisfying the criterion that which further implies that for any . Then the results of Section 5.1 imply that for any
Comments
There are no comments yet.