The problem of supervised Feature Selection (FS) (a.k.a. variable selection) has been studied for several decades in data science fields, such as statistics, machine learning, and data mining. Informally, the problem can be defined as selecting a subset of the available feature set, such that it is of minimal-size and at the same time maximally-predictive for an outcome of interest. Feature selection is performed for several reasons. It may reduce the cost or risk of measuring, computing, storing, and processing of the features. It leads to models of smaller dimensionality that are easier to visualize, inspect, and understand. It may even result in more accurate models by removing the noise, treating the curse-of-dimensionality, and facilitating model fitting. Arguably however, feature selection is primarily employed as a means for knowledge discovery and gaining intuition into the mechanisms generating the data. Indeed, a deep theoretical connection with causality has been identified(Tsamardinos2003). Actually, it is often the case that FS is the primary goal of an analysis and the learned model is only a by-product. For example, a medical researcher analyzing their molecular data may not care about the classification model to cancerous or healthy tissue; they can diagnose the tissues themselves without the use of the computer on the microscope. The whole point of their analysis is to identify the features (e.g., gene expressions) that carry all the information to perform the diagnosis.
When FS is employed for knowledge discovery and understanding considering multiple, equivalent solutions is paramount. It is misleading to present to the domain expert a single feature subset and proclaim all other features are either redundant or irrelevant, if there are multiple equivalent solutions. In addition, when features have an associated measurement cost, one seeks the least cost solution. We argue that in these cases, the analyst ought to identify all solutions and present the possibilities to the domain expert. Multiple, equivalent FS solutions often exist in practice (lagani2016feature), especially in fields with low sample size, high dimensionality and noisy features. Biology and medicine are prototypical fields where these adverse conditions are met. In a seminal paper (ein2004outcome), Ein-dor and co-authors demonstrate that multiple, equivalent prognostic signature for breast cancer can be found just by analyzing the same dataset with a different partition in training and test set, showing that several genes exist which are practically exchangeable in terms of predictive power. Statnikov and Aliferis (Statnikov2010) further show that the presence of multiple optimal signatures is not a rare occurrence, and is actually common in biological datasets.
Despite the relevance of the problem, to date only few algorithms address it directly. Two constraint-based approaches (constraints stemming from the results of conditionally independence tests) are the Target Information Equivalence (TIE (statnikov2013algorithms)) and the Statistically Equivalent Signatures (SES (lagani2016feature)) algorithms. Another algorithm which has been recently proposed by Cox and Battey (Cox2017pnas) searches for multiple solutions across a large number of separate analyses. Unfortunately, both the Cox–Battey method and TIE are quite computationally intensive.
In this work, we show that multiple equivalent solutions with controlled performance deviations can be also defined and identified for the popular least absolute shrinkage and selection operator (Lasso) method (Tibshirani1996). Previous studies based on bootstrapping (Bach2008), sub-sampling (Meinshausen2010) or locally perturbing the support of the Lasso solution (Hara2017) propose to run the Lasso multiple times and collectively establish the best (unique) solution. Their goal is to improve the ordinary Lasso solution in terms of either accuracy or stability. However, this is irrelevant to our goal which is to determine the set of equivalent Lasso solutions.
The case when strongly-equivalent Lasso solutions (SELSs) exist has been developed and studied in (Tibshirani2013)
. The author provided there various characterizations of the space of all SELSs. Multiple SELSs however, rarely occur in real datasets. Henceforth, we propose a relaxed definition of equivalence requiring the equivalent solutions to have not exactly the same, but a similar (within a tolerance threshold) in-sample loss. This is useful because with finite sample different feature subsets may appear statistically indistinguishable in terms of loss, even though asymptotically there is a single optimal solution. Specifically, using the root mean squared error (RMSE) as the loss function, we define theRMSE-equivalent Lasso solutions (RELSs). In contrast, a natural loss for
-penalized logistic regression (i.e., logistic Lasso) is the deviance leading toDeviance-equivalent Lasso solutions (DELSs). Inspired by the characterization of the SELS space, we devise an algorithm that computes a maximal subspace of the space of relaxed Lasso solutions. A set of constraints that guarantees sparsity as well as the key characteristics of the Lasso solution(s) is imposed. The proposed algorithm handles both RMSE and deviance equivalence in a unified manner and actually it is easily extensible to any convex performance metric. Furthermore, we derive theoretical bounds both for the RMSE and the deviance that relate the allowed tolerance with the spectral properties of the predictors matrix.
We perform empirical experiments on several real datasets providing corroborating evidence of the prevalence of multiple, equivalent solutions in real scenarios and quantifying the efficacy of the proposed Lasso extension. Indeed, the presence of multiple heterogeneous signatures (i.e., solutions) having at the same time low value for their coefficient of variation indicates that multiple, equivalent solutions are common across several application fields. In addition, we compare the proposed algorithm against the SES algorithm that can scale w.r.t. the dimensionality of the data. A prominent difference is that Lasso equivalent solutions have on average better prediction performance at the cost of selecting larger signatures while the SES algorithm is more parsimonious. A novelty of the comparison is the employment of a fully automated analysis pipeline (called Just Add Data Bio or JAD Bio) that tries both linear and non-linear learners, optimizes their hyper-parameter values, and corrects the estimation of performance for the bias of the tuning. Thus, the results are obtained after matching each feature selection with the most appropriate learner and its hyper-parameter values.
2. Definition of Lasso Solutions Equivalence
2.1. Lasso preliminaries
We first introduce the Lasso inference problem. Given an outcome vector (a.k.a. target variable), a matrix
with the predictor variables and the coefficient vector, the Lagrangian form of Lasso is defined as
where is a penalty parameter. Lasso solvers such as LARS (Efron2004) and FISTA (Beck2009) have been extensively applied in thousands of real problems while extensions of (1) to generalized linear models (GLMs) such as the logistic regression with -norm regularization have been also proposed (Friedman2010). In GLM Lasso, the quadratic cost term is replaced by the deviance which essentially behaves as the negative log-likelihood.
2.2. Strong equivalence
We briefly present the strong equivalence and its key properties which have been firstly defined and developed in (Tibshirani2013). It has been proved that when the entries of
are drawn from a continuous probability distribution, the uniqueness of the Lasso solution is almost surely guaranteed(Tibshirani2013). However, when the predictor variables take discrete values or there exists perfect collinearity then several optimal Lasso solutions may exist. Mathematically, given , two vectors are Lasso equivalent solutions in the strong sense if and only if . The set of all SELS is defined as
Additionally, two SELSs not only share the same cost function but also predict the same values for the target variable (Tibshirani2013) (i.e., ). Consequently, SELSs have the same norm, . Another characteristic of all SELSs is that the non-zero coefficients are not allowed to flip their sign (Tibshirani2013). Hence, if for the -th coefficient of a Lasso solution then for any SELS, , and, similarly, for the coefficients with negative values.
Proceeding, let be the Lasso solution with the largest active set (or support), which is also known as the equicorrelation set (Tibshirani2013). The maximal Lasso solution is computed using either the elastic net with vanishing penalty norm or the variant of LARS described in (Tibshirani2013). With a slight abuse of notation due to the restriction of the solution space to the predictor variables with non-zero coefficients, the set of all SELSs can be rewritten as
where is the matrix that contains only the columns of that are indexed by the equicorrelation set while is the diagonal matrix with the signs of the Lasso solution, . The first constraint ensures that all elements in will have the same fitted value while the second constraint ensures that coefficients’ sign will not flip.
2.2.1. Enumeration Algorithm for SELSs
A bounded polyhedron (i.e., a polytope) can be represented either as the convex hull of a finite set of vertices or by using a combination of linear constraint equalities and inequalities. In particular, the vertices of the convex polytope, , defined above corresponds to the “extreme” SELSs. Thus, enumerating the vertices of from the set of equality and inequality constraints, we can enumerate all SELSs. There exist algorithms that enumerate the vertices defined by a set of inequality constraints (see Fukuda et al. (Fukuda1997) and the references therein) and they can be extended to take into account equality constraints, too. In this paper, we employed the Matlab package by Jacobson (Jacobson2015) which contains tools for converting between the (in)equality and the vertices representations.
2.2.2. Cardinality of equivalent solutions
It is noteworthy that the number of vertices, like the number of edges and faces, can grow exponentially fast with the dimension of the polytope making the enumeration algorithm impractical when the dimension is larger than 20. In such cases, we cannot enumerate all the extreme SELSs in reasonable time. Nevertheless, it would be useful to know which of the variables participate in all SELSs and which are not. In Section 4, a practical categorization of the variables into dispensable (participate in some solutions) and indispensable (participate in all solutions) which has been firstly introduced in (Tibshirani2013) is presented and extended to the relaxed equivalence, too.
2.3. Relaxed equivalence
A less restrictive equivalence is to require two solutions to have similar performance (loss, error). We will search for solutions whose performance metric differ by a small tolerance from a given solution. Moreover, in order to avoid handling absolute quantities, we suggest working with the relative performance. Denoting by the performance metric, the given solution and the tolerance, we say that is performance equivalent to if and only if the following relation on the relative performance metric is satisfied
However, controlling only the performance metric may add unnecessary redundancy to the relaxed solutions destroying the desired property of sparsity. Indeed, a potentially large number of irrelevant or redundant variables may satisfy the performance constraint for a given , nonetheless, they should not belong to the set of equivalent solutions. To alleviate this issue, we propose to restrict the space of relaxed solutions by allowing only the non-zero coefficients of the given solution to vary. Thus, for a given Lasso solution, with support , a performance metric, , and a tolerance , we define the set of D-equivalent Lasso solutions as
where denotes the complement set of . The following proposition states that the convexity of the performance metric with respect to is inherited to the set of all relaxed Lasso solutions.
Proposition 0 ().
Let be a convex performance metric. Then, the set of all -equivalent solutions is convex.
Let and such that , then for we have and
Thus, and convexity is proved. ∎
Relaxed equivalence can be defined for several different performance metrics, which depend on the type of the analysis task (classification, regression, survival analysis, etc.). Different characteristics of the relaxed solution space can be conveyed by a suitable performance metric. For instance, when the Lasso cost function is used as a performance metric, the support of is maximal and is set to 0 then the strong equivalence is obtained. Next, we present one performance metric for regression problems and one for classification tasks.
2.3.1. RMSE equivalence for Lasso
The root mean squared error defined by
is a standard performance metric for regression problems. The set of RMSE-equivalent Lasso Solutions (RELSs), , is a convex set since the RMSE is a convex performance metric. We note also that RMSE-equivalence with tolerance is almost equivalent with MSE-equivalence with tolerance since implies that for small values of the tolerance.
2.3.2. Deviance equivalence for logistic Lasso
Logistic Lasso is preferred as a feature selection method in binary classification tasks compared to the ordinary Lasso because it takes into consideration the distinguishing properties of the problem. In logistic Lasso, the quadratic term in (1) is replaced by the deviance. Hence, despite the fact that RMSE is applicable, a more appropriate performance metric is the mean deviance defined as
The set of Deviance-equivalent Lasso Solutions (DELSs), , is also a convex set since the deviance is a convex function of its argument.
3. Enumerating Relaxed Lasso Solutions
In the definition of relaxed equivalence (i.e., (5)), we limit the active set of relaxed equivalent solutions to a subset of the given solution’s active set. Despite this restriction, the enumeration of all solutions is still difficult because of the shape of ; the set continuously curves resulting in an infinite number of extreme solutions. Hence, we propose to restrict the solutions that we will eventually enumerate to a subset of . Inspired by the representations of SELSs, we propose an approach that enumerates the vertices of the largest convex polytope of the relaxed space that allows relative performance be less than
. The central idea is to relax the Null space constraint while keeping intact the constraint on the sign of the coefficients. For this purpose, we rewrite the Null space constraint using the singular value decomposition for. Let be decomposed as
are orthogonal matrices with the left and right space eigenvectors, respectively, whileis a diagonal matrix with diagonal elements the ordered singular values (i.e., with ). Assume that for and for , then, the constraint in (3) is equivalent to where corresponds to the matrix with the eigenvectors whose singular values are non-zero.
Inspired by the above representation, we propose to relax the constraint to
where as above while is an integer between and to be specified later. Thus, the convex polytope for the relaxed Lasso solutions is defined as
We further constrain the relaxed Lasso solutions to live in a box and have coefficients that are at most far away from the given Lasso solution, . This constraint is added so as to guarantee the boundedness of the polyhedron defined by the other two constraint. We choose to set the box size to be which allows any coefficient to vary till the zero value. Since the relaxed polytope is performance metric ignorant, our goal is to choose appropriately so as is a maximal subset of the relaxed solution space, . Figure 1 demonstrates a two-dimensional example where the Lasso regression problem has a unique solution (small red circle). By setting the relaxed set of Lasso solutions constitutes the red line that connects the positive axes determined by the direction of . The two vertices (solid red dots) are candidates as “extreme” relaxed Lasso solutions which are further tested for ensuring that their relative performance is below the maximum tolerance. Before proceeding with an algorithm that enumerates the vertices of that is guaranteed to be a subset of the relaxed space, we derive theoretical bounds for two performance metrics.
3.1. Theoretical Bounds
Theorems 3.1 present error bounds for the RMSE and for the deviance. The bounds depend on the norm of the vector with the discarded singular values, . The proof of both parts can be found in the Appendix.
Theorem 3.1 ().
(a) Let be a Lasso solution111Variants of the standard Lasso are also valid.. For any it holds that
(b) Let be a logistic Lasso solution. For any it holds that
3.2. Enumeration algorithm
Algorithm 1 reliably enumerates relaxed solutions that belong to a subset of the set of all relaxed Lasso solutions. It takes as input a Lasso solution, a convex performance metric , a tolerance and the maximum dimension of the subspace that will be explored. We assume that the provided Lasso solution has maximal support hence the equicorrelation set and the sign vector are directly estimated as in lines 2 and 3 of Algorithm 1. In the for loop, the dimension of the subspace is increased starting from until the maximum value is reached or there are vertices whose performance metric is above the tolerance. In other words, the algorithm removes the restriction stemming from the component with the least singular value, allowing itself to search along the direction. It sequentially removes restrictions corresponding to components with larger singular values. As the dimension of the exploration subspace is increased the possibility of finding solutions whose performance exceeds the tolerance is also increased. If such solutions occur then the algorithm breaks and the solutions that exceed the tolerance are discarded. Due to the convexity of the performance metric, the polytope that is defined by the remaining solutions is also a subset of the relaxed space. Thus, the algorithm is sound. Additionally, a variant of binary search can be utilized instead of the linear search for identifying the optimal . However, the performance gains in computational time is minimal due to the fact that the number of vertices grows exponentially fast since the computational cost for the last iteration is proportional to the total cost of all previous iterations.
Remark 0 ().
We also investigated alternative definitions of the subset, , where we relax different properties of the strongly-equivalent characterization. For instance, we relaxed the condition to with resulting, however, in optimization problems which are algorithmically intractable. Overall, the choice of performance measures (RMSE and Deviance) and the particular subset, , in (7), were based on the mathematical simplicity and the practical implementation both being very critical for the adoption of a less popular but important idea of searching for and enumerating multiple equivalent solutions.
3.3. Determining the reference Lasso solution
Algorithm 1 searches for equivalent solutions in the least sensitive directions of the performance metric. It may discard features from the reference support but it cannot include features not initially selected. Therefore, it is important to provide an initial solution with the maximum support. We compute the reference Lasso solution , by optimally tuning its penalty parameter. We perform cross-validated hyper-parameter tuning using AUC (resp. MAE) as performance metric for classification (resp. regression) problems. It has been highlighted and proved (Meinshausen2006, Proposition 1) that many noise features are included in the prediction-oracle solution (i.e., the reference solution we estimate). Indeed, the value of the penalty parameter is lower in the prediction-oracle solution than the consistent one hinting towards larger support. We also observe similar behavior in our experiments with real datasets in the Results section. Thus, we expect the respective initial support to contain most, if not all, of the related to the target variable features. Of course, alternative methods based on elastic net or bootstrapping for selecting the reference solution/support can be utilized, however, such an exploration is beyond this paper’s scope and it is left as future work.
4. Variable Categorization
Due to the exponential growth of the polytope’s vertices, the complete enumeration is not always feasible. Nevertheless, there is an alternative approach to qualitatively assess the predictor variables by summarizing the interesting findings, e.g., reporting the set of features that belong to all solutions or only to some. Exploiting the fact that the feasible range of a coefficient’s value for all Lasso solutions can be efficiently computed as it was shown in (Tibshirani2013), a variable categorization is possible. We present the existing formulation for SELSs and based on it we extend it for RELSs.
SELSs. For each , the -th coefficient’s lower bound and upper bound
are computed by solving the linear programs
respectively. If is an element of the set then the -th variable is called dispensable otherwise it is called indispensable and they participate in all Lasso solutions. The fact that the sign of a coefficient remains the same in all SELSs implies that dispensable variables have either or while indispensable variables have either or . From a practical perspective, the number of linear programs to be solved is which is feasible.
Relaxed equivalence. Similarly, we can relax the criterion on dispensable/indispensable variables. Since the computational cost of the linear programs that are solved is manageable, we can discard the constraint on the control of the subspace dimension (i.e., there is no need for ). The linear programs for the lower and upper bounds for a given are
respectively. Evidently, as we increase the number of dispensable variables is increased while the number of indispensable variables is decreased.
While in this paper we follow the terminology of ”dispensable” and “indispensable” variables from (Tibshirani2013), we would like to emphasize that it may be misleading. It implies that dispensable variables could be dispensed with and filtered out in some sense. Consider however, two features and that are perfect copies of each other and at the same time, the strongest predictors. According to the definition they are dispensable but neither can be filtered out without a measurable loss in performance. Instead, we have recently proposed the terms “indispensable” and “replaceable” 222Invited talk at KDD 2017 conference “Advances in Causal-Based Feature Selection” (http://mensxmachina.org/en/presentations/). to clearly indicate that they the latter features can be replaced, not filtered out.
5. Experimental Setup
To evaluate the proposed method and compare against the SES algorithm, we used 9 datasets (Table 1
) trying to include a large variance of sample and feature sizes. No dataset was excluded from the evaluation based on the results. Seven datasets have binary outcome (i.e., classification tasks), while the other two have continuous outcome (i.e., regression tasks). The analyses were performed using the Just Add Data Bio v0.7 (JAD Bio; Gnosis Data Analysis; www.gnosisda.gr) automated predictive analysis engine.
|Dataset||Sample size||Feature Size||Field|
5.1. Just Add Data Bio Engine
JAD Bio employs a fully-automated machine learning pipeline for producing a predictive model given a training dataset, and an estimate of its predictive performance. JAD Bio performs multiple feature selection and predictive modeling. It tries several configurations, i.e., combinations of preprocessing algorithms, feature selection algorithm, predictive modeling algorithms, and values of their hyper-parameters using a grid-search in the space of hyper-parameters. The best configuration is determined using K-fold Cross Validation, which is then used to produce the final models for each feature subset on all data. The cross-validated performance of the best configuration is known to be optimistic due to the multiple tries (tsamardinos2015performance); JAD Bio estimates and removes the optimism using a bootstrap method before returning the final performance estimate (tsamardinos2017bootstrapping). The final performance estimates of JAD Bio are actually slightly conservative.
Specifically in terms of algorithms, JAD Bio trains several basic and advanced, linear and non-linear, multivariate machine learning and statistical models; namely, for classification problems trains Support Vector Machine models (SVMs) with linear, full polynomial, and Gaussian kernels, Ridge Logistic Regression models, Random Forests models, and Decision Trees, while for regression problems trains Ridge Linear Regression models instead of the logistic ones. In the following set of experiments more than 500 configurations were tried each time to find the optimal ones, resulting in more than 10000 models trained to compare two feature selection algorithms over nine datasets. JAD Bio is able to incorporate other user-defined preprocessing, transformation, and feature selection algorithms written in Java, R, or Matlab programming languages. This functionality allowed us to incorporate the proposed feature selection method into the pipeline.
5.2. Evaluation Pipeline and Tuning
Two variants of JAD Bio were used and compared: the original with the SES algorithm for the feature selection and an alternative with a Matlab implementation of the proposed method. To evaluate and compare them we split each dataset in two stratified parts. Then, JAD Bio is applied on each half (as training set) for obtaining a set of predictive models and their performance estimation, corresponding to different feature subsets (i.e., solutions found by the feature selection algorithms). The trained models are then tested on the remaining data (as test set) to compute the performance on new data. We then test whether the performance of the solutions found are actually similar. Since we partition the data into two, equivalence of the performance of the feature selection subset is determined using the same sample size as in training and thus having the same statistical power. The new proposed method was tuned with penalty parameter, , taking values from to with multiplicative step . We set the value to 0.01 allowing Lasso solutions having up to 1% performance error in the training set.
for regression), the CoV for the number of selected variables, the average hold-out performance (along with the two-tail, t-test adjusted p-value assessing the difference between SES and Lasso), as well as the average number of selected variables (along with the two-tail, t-test adjusted p-value assessing that the average number of variable selected by Lasso is different from the fixed number of variables selected by SES). Classification datasets are reported first, while the regression ones are at the bottom.
6.1. Multiple, equivalent solutions are common for predictive analytics tasks
Table 2 reports the Lasso and SES results. Both algorithms identify multiple solutions on almost all datasets and splits, indicating that the presence of multiple solutions is a common problem across several application domains.
At the same time, the coefficient of variation (CoV, defined as the ratio between standard deviation and average) for the hold-out performances is always quite low; the median CoV for Lasso isand for SES it is (maximum values are respectively and ). This means that the different solutions produce models with performance having standard deviation around the mean AUC value. These results support the claim that multiple solutions found by Lasso and SES are indeed equivalent in terms of predictive power.
Interestingly, when SES retrieves a large number of signatures (i.e., solutions) so does Lasso, and vice versa. Both algorithms provide numerous solutions for the arcene and secom datasets, only a few solutions for Ovarian, Parkinson, prostate, vV, BC_Continuous, and MP (an exception is the bankruptcy dataset). This is despite the fact that the two algorithms follow quite different approaches. Thus, the results provide evidence that the order of magnitude for the number of equivalent solutions is a characteristic of the data at hand.
6.2. Lasso and SES trade-offs
Figure 2 graphically represents the hold-out performances (x-axis, AUC for classification datasets and R for regression ones) and number of selected features (y-axis) for each signature retrieved by Lasso (circles) and SES (triangles) for the first split of each dataset (distinguished by color). The figure shows that Lasso consistently selects up to one order of magnitude more variables than SES (Table 2). In nine out of comparisons ( datasets 2 splits each) Lasso also achieves better performances than SES (adjusted p-value ), with an average AUC difference of between the two methods across classification datasets and R between regression datasets. In summary, the two algorithms are quite complementary, with SES providing more parsimonious models with a moderate cost in terms of predictive power achieving different trade-offs. SES sacrifices optimality (does not attempt to identify all predictive features, or as it is known in the respective literature the full Markov Blanket) to achieve scalability; however, it is easily applicable to different data types by equipping it with an appropriate conditional independence test (see a comparison of Lasso and SES-based feature selection for time-course data (Tsagris2018)).
6.3. Multiple solutions are heterogeneous in terms of included features
presents, for both Lasso and SES, the number of identified signatures categorized according to their size. For sake of clarity only results from the first split are presented, with the results on the second split following almost identical trends. By construction SES retrieves signatures with equal size, while Lasso identifies signatures that can have different sizes. For each group of equal-size signatures, we computed the average Jaccard index within the group as well as the average number of solution-specific features across all pairs of signatures. The Jaccard index between two sets is defined as the ratio between the size of their intersection and the size of their union, and ranges from(disjoint sets) to (perfectly overlapping sets), with two equal-size sets sharing half of their elements achieving . For computing the number of solution-specific features between two sets we first take their union and then subtract their intersection. The resulting features are either in one solution or in the other, but not in both. For the cases where numerous solutions are selected by SES (arcene and secom datasets), both the Jaccard index and the number of dispensable elements indicate a quite elevated heterogeneity, meaning that the multiple solutions proposed by SES include solutions quite different from each other. When only a handful of signatures are retrieved by SES, they tend to be more homogeneous. This is a consequence of the mechanisms used by SES for producing multiple signatures, which constraints each signature to differ from its closest sibling only for one variable. In contrast, Lasso solutions can have more than two signature-specific features even when only two signatures are retrieved. This can be seen in several rows of Table 3, e.g., for the arcene dataset (signature size , ) and for the secom dataset (signature size ). Interestingly, in several cases Lasso solutions differ from each other on average by or more solution-specific features, confirming that the multiple solutions retrieved by Lasso differ from each other due to the joint replacement of several variables.
In this paper we present an algorithm for deriving multiple, equivalent Lasso solutions with theoretical guarantees. The algorithm is empirically evaluated using an automatic platform, JAD Bio, over several learners and tuning hyper-parameters; the same platform is employed for a comparative evaluation against the SES algorithm. The results confirm that (a) multiple Lasso solutions are often found in real-world datasets, (b) these multiple solutions can substantially differ from each other, despite their similar performances, and (c) Lasso and SES provide comparable results, with the first being on average slightly more predictive at the cost of selecting a larger number of features.
|prostate||21, 22, 24||1|
We would like to thank Giorgos Borboudakis for his constructive discussions. The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP/2007-2013) / ERC Grant Agreement n. 617393. PC was funded by Gnosis Data Analysis PC.
Appendix A Proof of Theorem 3.1
(a). First, apply the Minkowski inequality as follows
Next, define the matrices ,
It holds that as well as . Using these matrices we write
Hence, we bound the error term in the inequality above by
where we used the vector-induced matrix norm which is also known as the spectral norm. It holds that thus the bound is rewritten as
Due to the singular value ordering, it holds that . The proof is completed by observing that the box constraint leads to the estimate
(b). We observe that for all
where is the projection coefficient of the -th sample to the -th principal vector. Thus, it holds that
because implies that for . Proceeding, the convexity of the function implies that
where and are -dimensional vectors with elements and . Using the Cauchy-Schwarz inequality, we get
where the last inequality is the consequence of the fact that for all , hence, .
Finally, we estimate
where we use the fact that . Combining the above with the bound for the difference , we obtain the desired bound. ∎