# Feature Selection for Ridge Regression with Provable Guarantees

We introduce single-set spectral sparsification as a deterministic sampling based feature selection technique for regularized least squares classification, which is the classification analogue to ridge regression. The method is unsupervised and gives worst-case guarantees of the generalization power of the classification function after feature selection with respect to the classification function obtained using all features. We also introduce leverage-score sampling as an unsupervised randomized feature selection method for ridge regression. We provide risk bounds for both single-set spectral sparsification and leverage-score sampling on ridge regression in the fixed design setting and show that the risk in the sampled space is comparable to the risk in the full-feature space. We perform experiments on synthetic and real-world datasets, namely a subset of TechTC-300 datasets, to support our theory. Experimental results indicate that the proposed methods perform better than the existing feature selection methods.

Comments

There are no comments yet.

## Authors

• 2 publications
• 14 publications
• ### Ridge Regression and Provable Deterministic Ridge Leverage Score Sampling

Ridge leverage scores provide a balance between low-rank approximation a...
03/15/2018 ∙ by Shannon R. McCurdy, et al. ∙ 0

read it

• ### Generalized Fisher Score for Feature Selection

Fisher score is one of the most widely used supervised feature selection...
02/14/2012 ∙ by Quanquan Gu, et al. ∙ 0

read it

• ### Sparse Feature Selection in Kernel Discriminant Analysis via Optimal Scoring

We consider the two-group classification problem and propose a kernel cl...
02/12/2019 ∙ by Alexander F. Lapanowski, et al. ∙ 0

read it

• ### A Deterministic Streaming Sketch for Ridge Regression

We provide a deterministic space-efficient algorithm for estimating ridg...
02/05/2020 ∙ by Benwei Shi, et al. ∙ 5

read it

• ### On Feature Interactions Identified by Shapley Values of Binary Classification Games

For feature selection and related problems, we introduce the notion of c...
01/12/2020 ∙ by Sandhya Tripathi, et al. ∙ 1

read it

• ### Robust Regression via Online Feature Selection under Adversarial Data Corruption

The presence of data corruption in user-generated streaming data, such a...
02/05/2019 ∙ by Xuchao Zhang, et al. ∙ 0

read it

• ### A Minimum Description Length Approach to Multitask Feature Selection

Many regression problems involve not one but several response variables ...
05/30/2009 ∙ by Brian Tomasik, et al. ∙ 0

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Ridge regression is a popular technique in machine learning and statistics. It is a commonly used penalized regression method. Regularized Least Squares Classifier (RLSC) is a simple classifier based on least squares and has a long history in machine learning

(Zhang and Peng, 2004; Poggio and Smale, 2003; Rifkin et al., 2003; Fung and Mangasarian, 2001; Suykens and Vandewalle, 1999; Zhang and Oles, 2001; Agarwal, 2002)

. RLSC is also the classification analogue to ridge regression. RLSC has been known to perform comparably to the popular Support Vector Machines (SVM)

(Rifkin et al., 2003; Fung and Mangasarian, 2001; Suykens and Vandewalle, 1999; Zhang and Oles, 2001). RLSC can be solved by simple vector space operations and do not require quadratic optimization techniques like SVM.
We propose a deterministic feature selection technique for RLSC with provable guarantees. There exist numerous feature selection techniques, which work well empirically. There also exist randomized feature selection methods like leverage-score sampling, (Dasgupta et al., 2007)

with provable guarantees which work well empirically. But the randomized methods have a failure probability and have to be re-run multiple times to get accurate results. Also, a randomized algorithm may not select the same features in different runs. A deterministic algorithm will select the same features irrespective of how many times it is run. This becomes important in many applications. Unsupervised feature selection involves selecting features oblivious to the class or labels.

In this work, we present a new provably accurate unsupervised feature selection technique for RLSC. We study a deterministic sampling based feature selection strategy for RLSC with provable non-trivial worst-case performance bounds.
We also use single-set spectral sparsification and leverage-score sampling as unsupervised feature selection algorithms for ridge regression in the fixed design setting. Since the methods are unsupervised, it will ensure that the methods work well in the fixed design setting, where the target variables have an additive homoskedastic noise. The algorithms sample a subset of the features from the original data matrix and then perform regression task on the reduced dimension matrix. We provide risk bounds for the feature selection algorithms on ridge regression in the fixed design setting.
The number of features selected by both algorithms is proportional to the rank of the training set. The deterministic sampling-based feature selection algorithm performs better in practice when compared to existing methods of feature selection.

## 2 Our Contributions

We introduce single-set spectral sparsification as a provably accurate deterministic feature selection technique for RLSC in an unsupervised setting. The number of features selected by the algorithm is independent of the number of features, but depends on the number of data-points. The algorithm selects a small number of features and solves the classification problem using those features. Dasgupta et al. (2007) used a leverage-score based randomized feature selection technique for RLSC and provided worst case guarantees of the approximate classifier function to that using all features. We use a deterministic algorithm to provide worst-case generalization error guarantees. The deterministic algorithm does not come with a failure probability and the number of features required by the deterministic algorithm is lesser than that required by the randomized algorithm. The leverage-score based algorithm has a sampling complexity of , whereas single-set spectral sparsification requires to be picked, where is the number of training points, is a failure probability and is an accuracy parameter. Like in Dasgupta et al. (2007), we also provide additive-error approximation guarantees for any test-point and relative-error approximation guarantees for test-points that satisfy some conditions with respect to the training set.
We introduce single-set spectral sparsification and leverage-score sampling as unsupervised feature selection algorithms for ridge regression and provide risk bounds for the subsampled problems in the fixed design setting. The risk in the sampled space is comparable to the risk in the full-feature space. We give relative-error guarantees of the risk for both feature selection methods in the fixed design setting.
From an empirical perspective, we evaluate single-set spectral sparsification on synthetic data and 48 document-term matrices, which are a subset of the TechTC-300 (Davidov et al., 2004) dataset. We compare the single-set spectral sparsification algorithm with leverage-score sampling, information gain, rank-revealing QR factorization (RRQR) and random feature selection. We do not report running times because feature selection is an offline task. The experimental results indicate that single-set spectral sparsification out-performs all the methods in terms of out-of-sample error for all 48 TechTC-300 datasets. We observe that a much smaller number of features is required by the deterministic algorithm to achieve good performance when compared to leverage-score sampling.

## 3 Background and Related Work

### 3.1 Notation

denote matrices and denote column vectors; (for all ) is the standard basis, whose dimensionality will be clear from context; and is the identity matrix. The Singular Value Decomposition (SVD) of a matrix is equal to where

is an orthogonal matrix containing the left singular vectors,

is a diagonal matrix containing the singular values , and is a matrix containing the right singular vectors. The spectral norm of is . and are the largest and smallest singular values of . is the condition number of . denotes any orthogonal matrix whose columns span the subspace orthogonal to . A vector can be expressed as: for some vectors and , i.e. has one component along and another component orthogonal to .

### 3.2 Matrix Sampling Formalism

We now present the tools of feature selection. Let be the data matrix consisting of points and dimensions, be a matrix such that contains rows of Matrix is a binary indicator matrix, which has exactly one non-zero element in each row. The non-zero element of indicates which row of will be selected. Let be the diagonal matrix such that rescales the rows of that are in The matrices and are called the sampling and re-scaling matrices respectively. We will replace the sampling and re-scaling matrices by a single matrix , where denotes the matrix specifying which of the rows of are to be sampled and how they are to be rescaled.

### 3.3 RLSC Basics

Consider a training data of points in dimensions with respective labels for

The solution of binary classification problems via Tikhonov regularization in a Reproducing Kernel Hilbert Space (RKHS) using the squared loss function results in Regularized Least Squares Classification (RLSC) problem

(Rifkin et al., 2003), which can be stated as:

 minx∈Rn∥Kx−y∥22+λxTKx (1)

where is the kernel matrix defined over the training dataset, is a regularization parameter and is the dimensional class label vector. In matrix notation, the training data-set is a matrix, consisting of data-points and features . Throughout this study, we assume that is a full-rank matrix. We shall consider the linear kernel, which can be written as Using the SVD of , the optimal solution of Eqn. 1 in the full-dimensional space is

 xopt=V(Σ2+λI)−1VTy. (2)

The vector can be used as a classification function that generalizes to test data. If is the new test point, then the binary classification function is:

 f(q)=xToptXTq. (3)

Then, gives the predicted label ( or ) to be assigned to the new test point .

Our goal is to study how RLSC performs when the deterministic sampling based feature selection algorithm is used to select features in an unsupervised setting. Let be the matrix that samples and re-scales rows of thus reducing the dimensionality of the training set from to and is proportional to the rank of the input matrix. The transformed dataset into dimensions is given by and the RLSC problem becomes

 minx∈Rn∥∥~Kx−y∥∥22+λxT~Kx, (4)

thus giving an optimal vector . The new test point is first dimensionally reduced to , where and then classified by the function,

 ~f=f(~q)=~xTopt~XT~q. (5)

In subsequent sections, we will assume that the test-point is of the form The first part of the expression shows the portion of the test-point that is similar to the training-set and the second part shows how much the test-point is novel compared to the training set, i.e. measures how much of lies outside the subspace spanned by the training set.

### 3.4 Ridge Regression Basics

Consider a data-set of points in dimensions with . Here contains i.i.d samples from the dimensional independent variable. is the real-valued response vector. Ridge Regression(RR) or Tikhonov regularization penalizes the norm of a parameter vector

and shrinks the estimated coefficients towards zero. In the fixed design setting, we have

where

is the homoskedastic noise vector with mean 0 and variance

. Let be the solution to the ridge regression problem. The RR problem is stated as:

 ^βλ=argminβ∈Rd1n∥∥y−XTβ∥∥22+λ∥β∥22. (6)

The solution to Eqn.6 is . One can also solve the same problem in the dual space. Using change of variables, , where and let be the linear kernel defined over the training dataset. The optimization problem becomes:

 ^αλ=argminα∈Rn1n∥y−Kα∥22+λαTKα. (7)

Throughout this study, we assume that is a full-rank matrix. Using the SVD of , the optimal solution in the dual space (Eqn. 7) for the full-dimensional data is given by The primal solution is

In the sampled space, we have The dual problem in the sampled space can be posed as:

 ~αλ=argminα∈Rn1n∥∥y−~Kα∥∥22+λαT~Kα. (8)

The optimal dual solution in the sampled space is The primal solution is

### 3.5 Related Work

The work most closely related to ours is that of Dasgupta et al. (2007) who used a leverage-score based randomized feature selection technique for RLSC and provided worst case bounds of the approximate classifier with that of the classifier for all features. The proof of their main quality-of-approximation results provided an intuition of the circumstances when their feature selection method will work well. The running time of leverage-score based sampling is dominated by the time to compute SVD of the training set i.e. , whereas, for single-set spectral sparsification, it is . Single-set spectral sparsification is a slower and more accurate method than leverage-score sampling. Another work on dimensionality reduction of RLSC is that of Avron et al. (2013) who used efficient randomized-algorithms for solving RLSC, in settings where the design matrix has a Vandermonde structure. However, this technique is different from ours, since their work is focused on dimensionality reduction using linear combinations of features, but not on actual feature selection.
Lu et al. (2013) used Randomized Walsh-Hadamard transform to lower the dimension of data matrix and subsequently solve the ridge regression problem in the lower dimensional space. They provided risk-bounds of their algorithm in the fixed design setting. However, this is different from our work, since they use linear combinations of features, while we select actual features from the data.

## 4 Our main tools

### 4.1 Single-set Spectral Sparsification

We describe the Single-Set Spectral Sparsification algorithm (BSS111The name BSS comes from the authors Batson, Spielman and Srivastava. for short) of Batson et al. (2009) as Algorithm  1. Algorithm  1 is a greedy technique that selects columns one at a time. Consider the input matrix as a set of column vectors , with Given and , we iterate over . Define the parameters and . For and

a symmetric positive definite matrix with eigenvalues

, define

 Φ(L,A)=ℓ∑i=11λi−L;^Φ(U,A)=ℓ∑i=11U−λi

as the lower and upper potentials respectively. These potential functions measure how far the eigenvalues of are from the upper and lower barriers and respectively. We define and as follows:

 L(u,δL,A,L)=uT(A−(L+δL)Iℓ)−2uΦ(L+δL,A)−Φ(L,A)−uT(A−(L+δL)Iℓ)−1u

At every iteration, there exists an index and a weight such that, and Thus, there will be at most columns selected after iterations. The running time of the algorithm is dominated by the search for an index satisfying

 U(uiτ,δU,Aτ,Uτ)≤L(uiτ,δL,Aτ,Lτ)

and computing the weight One needs to compute the upper and lower potentials and and hence the eigenvalues of . Cost per iteration is and the total cost is For , we need to compute and for every which can be done in for every iteration, for a total of Thus total running time of the algorithm is We present the following lemma for the single-set spectral sparsification algorithm.

###### Lemma 1.

BSS (Batson et al., 2009): Given satisfying and , we can deterministically construct sampling and rescaling matrices and with , such that, for all

 (1−√ℓ/r)2∥Uy∥22≤∥RUy∥22≤(1+√ℓ/r)2∥Uy∥22.

We now present a slightly modified version of Lemma 1 for our theorems.

###### Lemma 2.

Given satisfying and , we can deterministically construct sampling and rescaling matrices and such that for ,

 ∥∥UTU−UTRTRU∥∥2≤3√ℓ/r.
###### Proof.

From Lemma 1, it follows,

 σℓ(UTRTRU)≥(1−√ℓ/r)2 and σ1(UTRTRU)≤(1+√ℓ/r)2.

Thus,

 λmax(UTU−UTRTRU)≤(1−(1−√ℓ/r)2)≤2√ℓ/r.

Similarly,

 λmin(UTU−UTRTRU)≥(1−(1+√ℓ/r)2)≥3√ℓ/r.

Combining these, we have

Note: Let It is possible to set an upper bound on by setting the value of . We will assume . ∎

### 4.2 Leverage Score Sampling

Our randomized feature selection method is based on importance sampling or the so-called leverage-score sampling of Rudelson and Vershynin (2007). Let be the top- left singular vectors of the training set

. A carefully chosen probability distribution of the form

 pi=∥Ui∥22n, for % i=1,2,...,d, (9)

i.e. proportional to the squared Euclidean norms of the rows of the left-singular vectors and select rows of in i.i.d trials and re-scale the rows with . The time complexity is dominated by the time to compute the SVD of .

###### Lemma 3.

(Rudelson and Vershynin, 2007) Let be an accuracy parameter and be the failure probability. Given satisfying Let , let be as Eqn. 9 and let . Construct the sampling and rescaling matrix . Then with probability at least ,

## 5 Theory

In this section we describe the theoretical guarantees of RLSC using BSS and also the risk bounds of ridge regression using BSS and Leverage-score sampling. Before we begin, we state the following lemmas from numerical linear algebra which will be required for our proofs.

###### Lemma 4.

(Stewart and Sun, 1990) For any matrix , such that is invertible,

###### Lemma 5.

(Stewart and Sun, 1990) Let and be invertible matrices. Then

 ~A−1−A−1=−A−1E~A−1.
###### Lemma 6.

(Demmel and Veselic, 1992) Let and be matrices such that the product is a symmetric positive definite matrix with matrix . Let the product be a perturbation such that, Here corresponds to the smallest eigenvalue of . Let be the i-th eigenvalue of and let be the i-th eigenvalue of Then,

###### Lemma 7.

Let . Then

The proof of this lemma is similar to Lemma 4.3 of Drineas et al. (2006).

### 5.1 Our Main Theroems on RLSC

The following theorem shows the additive error guarantees of the generalization bounds of the approximate classifer with that of the classifier with no feature selection. The classification error bound of BSS on RLSC depends on the condition number of the training set and on how much of the test-set lies in the subspace of the training set.

###### Theorem 1.

Let be an accuracy parameter, be the number of features selected by BSS. Let be the matrix, as defined in Lemma 2. Let with , be the training set, is the reduced dimensional matrix and be the test point of the form . Then, the following hold:

• If , then

• If , then

###### Proof.

We assume that is a full-rank matrix. Let and . Using the SVD of , we define

 (10)

The optimal solution in the sampled space is given by,

 ~xopt=V(Δ+λI)−1VTy. (11)

It can be proven easily that and are invertible matrices. We focus on the term Using the SVD of , we get

 qTXxopt = αTXTXxopt+βU⊥T(UΣVT)xopt (12) = αTVΣ2(Σ2+λI)−1VTy = αTV(I+λΣ−2)−1VTy. (13)

Eqn(12) follows because of the fact and by substituting from Eqn.(2). Eqn.(13) follows from the fact that the matrices and are invertible. Now,

 ∣∣qTXxopt−~qT~X~xopt∣∣ = ∣∣qTXxopt−qTRTRX~xopt∣∣ (15) ≤ ∣∣qTXxopt−αTXTRTRX~xopt∣∣ +∣∣βTU⊥TRTRX~xopt∣∣.

We bound (15) and (15) separately. Substituting the values of and ,

 αTXTRTRX~xopt = αTVΔVT~xopt (16) = αTVΔ(Δ+λI)−1VTy = αTV(I+λΔ−1)−1VTy = αTV(I+λΣ−1(I+E)−1Σ−1)−1VTy = αTV(I+λΣ−2+λΣ−1ΦΣ−1)−1VTy.

The last line follows from Lemma 4 in Appendix, which states that , where . The spectral norm of is bounded by,

 ∥Φ∥2=∥∥ ∥∥∞∑i=1(−E)i∥∥ ∥∥2≤∞∑i=1∥E∥i2≤∞∑i=1ϵi=ϵ/(1−ϵ). (17)

We now bound (15). Substituting (13) and (16) in (15),

 ∣∣qTXxopt−αTXTRTRX~xopt∣∣ = ≤ ∥∥αTV(I+λΣ−2)∥∥2∥∥VTy∥∥2∥Ψ∥2.

The last line follows because of Lemma 5 and the fact that all matrices involved are invertible. Here,

 Ψ = λΣ−1ΦΣ−1(I+λΣ−2+λΣ−1ΦΣ−1)−1 = λΣ−1ΦΣ−1(Σ−1(Σ2+λI+λΦ)Σ−1)−1 = λΣ−1Φ(Σ2+λI+λΦ)−1Σ.

Since the spectral norms of and are bounded, we only need to bound the spectral norm of to bound the spectral norm of . The spectral norm of the matrix is the inverse of the smallest singular value of From perturbation theory of matrices Stewart and Sun (1990) and (17), we get

 ∣∣σi(Σ2+λI+λΦ)−σi(Σ2+λI)∣∣≤∥λΦ∥2≤ϵλ.

Here, represents the singular value of the matrix .
Also, where are the singular values of .

 σi2+(1−ϵ)λ≤σi(Σ2+λI+λΦ)≤σi2+(1+ϵ)λ.

Thus,

 ∥∥∥(Σ2+λI+λΦ)−1∥∥∥2=1/σmin(Σ2+λI+λΦ)≤1/(σ2min+(1−ϵ)λ)).

Here, and denote the largest and smallest singular value of . Since , (condition number of ) we bound (15):

 ∣∣qTXxopt−αTXTRTRX~xopt∣∣≤ϵλκXσ2min+(1−ϵ)λ∥∥∥αTV(I+λΣ−2)−1∥∥∥2∥∥VTy∥∥2. (18)

For , the term in Eqn.(18) is always larger than , so it can be upper bounded by (assuming ). Also,

 ∥∥∥αTV(I+λΣ−2)−1∥∥∥2≤∥∥αTV∥∥2∥∥∥(I+λΣ−2)−1∥∥∥2≤∥α∥2.

This follows from the fact, that and as is a full-rank orthonormal matrix and the singular values of are equal to ; making the spectral norm of its inverse at most one. Thus we get,

 (19)

We now bound (15). Expanding (15) using SVD and ,

 ∣∣βTU⊥TRTRX~xopt∣∣ = ≤ ∥∥qTU⊥U⊥TRTRU∥∥2∥∥Σ(Δ+λI)−1∥∥2∥∥VTy∥∥2 ≤ ϵ∥∥U⊥U⊥Tq∥∥2∥∥VTy∥∥2∥∥Σ(Δ+λI)−1∥∥2 ≤ ϵ∥β∥2∥y∥2∥∥Σ(Δ+λI)−1∥∥2.

The first inequality follows from ; and the second inequality follows from Lemma 7. To conclude the proof, we bound the spectral norm of . Note that from Eqn.(10), and ,

 Σ(Δ+λI)−1=(Σ−1ΔΣ−1+λΣ−2)−1Σ−1=(I+λΣ−2+E)−1Σ−1.

One can get a lower bound for the smallest singular value of using matrix perturbation theory and by comparing the singular values of this matrix to the singular values of We get,

 (1−ϵ)+λσi2≤σi(I+E+λΣ−2)≤(1+ϵ)+λσi2.
 ∥∥∥(I+λΣ−2+E)−1Σ−1∥∥∥2 ≤ σ2max((1−ϵ)σ2max+λ)σmin (20) = κXσmax(1−ϵ)σ2max+λ ≤ 2κXσmax.

We assumed that , which implies Combining these, we get,

 ∣∣βTU⊥TRTRX~xopt∣∣≤2ϵκXσmax∥β∥2∥y∥2. (21)

Combining Eqns (19) and (21) we complete the proof for the case . For , Eqn.(18) becomes zero and the result follows. ∎

Our next theorem provides relative-error guarantees to the bound on the classification error when the test-point has no-new components, i.e.

###### Theorem 2.

Let be an accuracy parameter, be the number of features selected by BSS and . Let be the test point of the form , i.e. it lies entirely in the subspace spanned by the training set, and the two vectors and satisfy the property,

 ∥∥∥(I+λΣ−2)−1VTα∥∥∥2∥∥VTy∥∥2 ≤ ω∥∥∥((I+λΣ−2)−1VTα)TVTy∥∥∥