1. Introduction
The problem of supervised Feature Selection (FS) (a.k.a. variable selection) has been studied for several decades in data science fields, such as statistics, machine learning, and data mining. Informally, the problem can be defined as selecting a subset of the available feature set, such that it is of minimalsize and at the same time maximallypredictive for an outcome of interest. Feature selection is performed for several reasons. It may reduce the cost or risk of measuring, computing, storing, and processing of the features. It leads to models of smaller dimensionality that are easier to visualize, inspect, and understand. It may even result in more accurate models by removing the noise, treating the curseofdimensionality, and facilitating model fitting. Arguably however, feature selection is primarily employed as a means for knowledge discovery and gaining intuition into the mechanisms generating the data. Indeed, a deep theoretical connection with causality has been identified
(Tsamardinos2003). Actually, it is often the case that FS is the primary goal of an analysis and the learned model is only a byproduct. For example, a medical researcher analyzing their molecular data may not care about the classification model to cancerous or healthy tissue; they can diagnose the tissues themselves without the use of the computer on the microscope. The whole point of their analysis is to identify the features (e.g., gene expressions) that carry all the information to perform the diagnosis.When FS is employed for knowledge discovery and understanding considering multiple, equivalent solutions is paramount. It is misleading to present to the domain expert a single feature subset and proclaim all other features are either redundant or irrelevant, if there are multiple equivalent solutions. In addition, when features have an associated measurement cost, one seeks the least cost solution. We argue that in these cases, the analyst ought to identify all solutions and present the possibilities to the domain expert. Multiple, equivalent FS solutions often exist in practice (lagani2016feature), especially in fields with low sample size, high dimensionality and noisy features. Biology and medicine are prototypical fields where these adverse conditions are met. In a seminal paper (ein2004outcome), Eindor and coauthors demonstrate that multiple, equivalent prognostic signature for breast cancer can be found just by analyzing the same dataset with a different partition in training and test set, showing that several genes exist which are practically exchangeable in terms of predictive power. Statnikov and Aliferis (Statnikov2010) further show that the presence of multiple optimal signatures is not a rare occurrence, and is actually common in biological datasets.
Despite the relevance of the problem, to date only few algorithms address it directly. Two constraintbased approaches (constraints stemming from the results of conditionally independence tests) are the Target Information Equivalence (TIE (statnikov2013algorithms)) and the Statistically Equivalent Signatures (SES (lagani2016feature)) algorithms. Another algorithm which has been recently proposed by Cox and Battey (Cox2017pnas) searches for multiple solutions across a large number of separate analyses. Unfortunately, both the Cox–Battey method and TIE are quite computationally intensive.
In this work, we show that multiple equivalent solutions with controlled performance deviations can be also defined and identified for the popular least absolute shrinkage and selection operator (Lasso) method (Tibshirani1996). Previous studies based on bootstrapping (Bach2008), subsampling (Meinshausen2010) or locally perturbing the support of the Lasso solution (Hara2017) propose to run the Lasso multiple times and collectively establish the best (unique) solution. Their goal is to improve the ordinary Lasso solution in terms of either accuracy or stability. However, this is irrelevant to our goal which is to determine the set of equivalent Lasso solutions.
The case when stronglyequivalent Lasso solutions (SELSs) exist has been developed and studied in (Tibshirani2013)
. The author provided there various characterizations of the space of all SELSs. Multiple SELSs however, rarely occur in real datasets. Henceforth, we propose a relaxed definition of equivalence requiring the equivalent solutions to have not exactly the same, but a similar (within a tolerance threshold) insample loss. This is useful because with finite sample different feature subsets may appear statistically indistinguishable in terms of loss, even though asymptotically there is a single optimal solution. Specifically, using the root mean squared error (RMSE) as the loss function, we define the
RMSEequivalent Lasso solutions (RELSs). In contrast, a natural loss forpenalized logistic regression (i.e., logistic Lasso) is the deviance leading to
Devianceequivalent Lasso solutions (DELSs). Inspired by the characterization of the SELS space, we devise an algorithm that computes a maximal subspace of the space of relaxed Lasso solutions. A set of constraints that guarantees sparsity as well as the key characteristics of the Lasso solution(s) is imposed. The proposed algorithm handles both RMSE and deviance equivalence in a unified manner and actually it is easily extensible to any convex performance metric. Furthermore, we derive theoretical bounds both for the RMSE and the deviance that relate the allowed tolerance with the spectral properties of the predictors matrix.We perform empirical experiments on several real datasets providing corroborating evidence of the prevalence of multiple, equivalent solutions in real scenarios and quantifying the efficacy of the proposed Lasso extension. Indeed, the presence of multiple heterogeneous signatures (i.e., solutions) having at the same time low value for their coefficient of variation indicates that multiple, equivalent solutions are common across several application fields. In addition, we compare the proposed algorithm against the SES algorithm that can scale w.r.t. the dimensionality of the data. A prominent difference is that Lasso equivalent solutions have on average better prediction performance at the cost of selecting larger signatures while the SES algorithm is more parsimonious. A novelty of the comparison is the employment of a fully automated analysis pipeline (called Just Add Data Bio or JAD Bio) that tries both linear and nonlinear learners, optimizes their hyperparameter values, and corrects the estimation of performance for the bias of the tuning. Thus, the results are obtained after matching each feature selection with the most appropriate learner and its hyperparameter values.
2. Definition of Lasso Solutions Equivalence
2.1. Lasso preliminaries
We first introduce the Lasso inference problem. Given an outcome vector (a.k.a. target variable)
, a matrixwith the predictor variables and the coefficient vector
, the Lagrangian form of Lasso is defined as(1) 
where is a penalty parameter. Lasso solvers such as LARS (Efron2004) and FISTA (Beck2009) have been extensively applied in thousands of real problems while extensions of (1) to generalized linear models (GLMs) such as the logistic regression with norm regularization have been also proposed (Friedman2010). In GLM Lasso, the quadratic cost term is replaced by the deviance which essentially behaves as the negative loglikelihood.
2.2. Strong equivalence
We briefly present the strong equivalence and its key properties which have been firstly defined and developed in (Tibshirani2013). It has been proved that when the entries of
are drawn from a continuous probability distribution, the uniqueness of the Lasso solution is almost surely guaranteed
(Tibshirani2013). However, when the predictor variables take discrete values or there exists perfect collinearity then several optimal Lasso solutions may exist. Mathematically, given , two vectors are Lasso equivalent solutions in the strong sense if and only if . The set of all SELS is defined as(2) 
Additionally, two SELSs not only share the same cost function but also predict the same values for the target variable (Tibshirani2013) (i.e., ). Consequently, SELSs have the same norm, . Another characteristic of all SELSs is that the nonzero coefficients are not allowed to flip their sign (Tibshirani2013). Hence, if for the th coefficient of a Lasso solution then for any SELS, , and, similarly, for the coefficients with negative values.
Proceeding, let be the Lasso solution with the largest active set (or support), which is also known as the equicorrelation set (Tibshirani2013). The maximal Lasso solution is computed using either the elastic net with vanishing penalty norm or the variant of LARS described in (Tibshirani2013). With a slight abuse of notation due to the restriction of the solution space to the predictor variables with nonzero coefficients, the set of all SELSs can be rewritten as
(3) 
where is the matrix that contains only the columns of that are indexed by the equicorrelation set while is the diagonal matrix with the signs of the Lasso solution, . The first constraint ensures that all elements in will have the same fitted value while the second constraint ensures that coefficients’ sign will not flip.
2.2.1. Enumeration Algorithm for SELSs
A bounded polyhedron (i.e., a polytope) can be represented either as the convex hull of a finite set of vertices or by using a combination of linear constraint equalities and inequalities. In particular, the vertices of the convex polytope, , defined above corresponds to the “extreme” SELSs. Thus, enumerating the vertices of from the set of equality and inequality constraints, we can enumerate all SELSs. There exist algorithms that enumerate the vertices defined by a set of inequality constraints (see Fukuda et al. (Fukuda1997) and the references therein) and they can be extended to take into account equality constraints, too. In this paper, we employed the Matlab package by Jacobson (Jacobson2015) which contains tools for converting between the (in)equality and the vertices representations.
2.2.2. Cardinality of equivalent solutions
It is noteworthy that the number of vertices, like the number of edges and faces, can grow exponentially fast with the dimension of the polytope making the enumeration algorithm impractical when the dimension is larger than 20. In such cases, we cannot enumerate all the extreme SELSs in reasonable time. Nevertheless, it would be useful to know which of the variables participate in all SELSs and which are not. In Section 4, a practical categorization of the variables into dispensable (participate in some solutions) and indispensable (participate in all solutions) which has been firstly introduced in (Tibshirani2013) is presented and extended to the relaxed equivalence, too.
2.3. Relaxed equivalence
A less restrictive equivalence is to require two solutions to have similar performance (loss, error). We will search for solutions whose performance metric differ by a small tolerance from a given solution. Moreover, in order to avoid handling absolute quantities, we suggest working with the relative performance. Denoting by the performance metric, the given solution and the tolerance, we say that is performance equivalent to if and only if the following relation on the relative performance metric is satisfied
(4) 
However, controlling only the performance metric may add unnecessary redundancy to the relaxed solutions destroying the desired property of sparsity. Indeed, a potentially large number of irrelevant or redundant variables may satisfy the performance constraint for a given , nonetheless, they should not belong to the set of equivalent solutions. To alleviate this issue, we propose to restrict the space of relaxed solutions by allowing only the nonzero coefficients of the given solution to vary. Thus, for a given Lasso solution, with support , a performance metric, , and a tolerance , we define the set of Dequivalent Lasso solutions as
(5) 
where denotes the complement set of . The following proposition states that the convexity of the performance metric with respect to is inherited to the set of all relaxed Lasso solutions.
Proposition 0 ().
Let be a convex performance metric. Then, the set of all equivalent solutions is convex.
Proof.
Let and such that , then for we have and
Thus, and convexity is proved. ∎
Relaxed equivalence can be defined for several different performance metrics, which depend on the type of the analysis task (classification, regression, survival analysis, etc.). Different characteristics of the relaxed solution space can be conveyed by a suitable performance metric. For instance, when the Lasso cost function is used as a performance metric, the support of is maximal and is set to 0 then the strong equivalence is obtained. Next, we present one performance metric for regression problems and one for classification tasks.
2.3.1. RMSE equivalence for Lasso
The root mean squared error defined by
is a standard performance metric for regression problems. The set of RMSEequivalent Lasso Solutions (RELSs), , is a convex set since the RMSE is a convex performance metric. We note also that RMSEequivalence with tolerance is almost equivalent with MSEequivalence with tolerance since implies that for small values of the tolerance.
2.3.2. Deviance equivalence for logistic Lasso
Logistic Lasso is preferred as a feature selection method in binary classification tasks compared to the ordinary Lasso because it takes into consideration the distinguishing properties of the problem. In logistic Lasso, the quadratic term in (1) is replaced by the deviance. Hence, despite the fact that RMSE is applicable, a more appropriate performance metric is the mean deviance defined as
The set of Devianceequivalent Lasso Solutions (DELSs), , is also a convex set since the deviance is a convex function of its argument.
3. Enumerating Relaxed Lasso Solutions
In the definition of relaxed equivalence (i.e., (5)), we limit the active set of relaxed equivalent solutions to a subset of the given solution’s active set. Despite this restriction, the enumeration of all solutions is still difficult because of the shape of ; the set continuously curves resulting in an infinite number of extreme solutions. Hence, we propose to restrict the solutions that we will eventually enumerate to a subset of . Inspired by the representations of SELSs, we propose an approach that enumerates the vertices of the largest convex polytope of the relaxed space that allows relative performance be less than
. The central idea is to relax the Null space constraint while keeping intact the constraint on the sign of the coefficients. For this purpose, we rewrite the Null space constraint using the singular value decomposition for
. Let be decomposed as(6) 
where and
are orthogonal matrices with the left and right space eigenvectors, respectively, while
is a diagonal matrix with diagonal elements the ordered singular values (i.e., with ). Assume that for and for , then, the constraint in (3) is equivalent to where corresponds to the matrix with the eigenvectors whose singular values are nonzero.Inspired by the above representation, we propose to relax the constraint to
where as above while is an integer between and to be specified later. Thus, the convex polytope for the relaxed Lasso solutions is defined as
(7) 
We further constrain the relaxed Lasso solutions to live in a box and have coefficients that are at most far away from the given Lasso solution, . This constraint is added so as to guarantee the boundedness of the polyhedron defined by the other two constraint. We choose to set the box size to be which allows any coefficient to vary till the zero value. Since the relaxed polytope is performance metric ignorant, our goal is to choose appropriately so as is a maximal subset of the relaxed solution space, . Figure 1 demonstrates a twodimensional example where the Lasso regression problem has a unique solution (small red circle). By setting the relaxed set of Lasso solutions constitutes the red line that connects the positive axes determined by the direction of . The two vertices (solid red dots) are candidates as “extreme” relaxed Lasso solutions which are further tested for ensuring that their relative performance is below the maximum tolerance. Before proceeding with an algorithm that enumerates the vertices of that is guaranteed to be a subset of the relaxed space, we derive theoretical bounds for two performance metrics.
3.1. Theoretical Bounds
Theorems 3.1 present error bounds for the RMSE and for the deviance. The bounds depend on the norm of the vector with the discarded singular values, . The proof of both parts can be found in the Appendix.
Theorem 3.1 ().
(a) Let be a Lasso solution^{1}^{1}1Variants of the standard Lasso are also valid.. For any it holds that
(8) 
(b) Let be a logistic Lasso solution. For any it holds that
(9) 
3.2. Enumeration algorithm
Algorithm 1 reliably enumerates relaxed solutions that belong to a subset of the set of all relaxed Lasso solutions. It takes as input a Lasso solution, a convex performance metric , a tolerance and the maximum dimension of the subspace that will be explored. We assume that the provided Lasso solution has maximal support hence the equicorrelation set and the sign vector are directly estimated as in lines 2 and 3 of Algorithm 1. In the for loop, the dimension of the subspace is increased starting from until the maximum value is reached or there are vertices whose performance metric is above the tolerance. In other words, the algorithm removes the restriction stemming from the component with the least singular value, allowing itself to search along the direction. It sequentially removes restrictions corresponding to components with larger singular values. As the dimension of the exploration subspace is increased the possibility of finding solutions whose performance exceeds the tolerance is also increased. If such solutions occur then the algorithm breaks and the solutions that exceed the tolerance are discarded. Due to the convexity of the performance metric, the polytope that is defined by the remaining solutions is also a subset of the relaxed space. Thus, the algorithm is sound. Additionally, a variant of binary search can be utilized instead of the linear search for identifying the optimal . However, the performance gains in computational time is minimal due to the fact that the number of vertices grows exponentially fast since the computational cost for the last iteration is proportional to the total cost of all previous iterations.
Remark 0 ().
We also investigated alternative definitions of the subset, , where we relax different properties of the stronglyequivalent characterization. For instance, we relaxed the condition to with resulting, however, in optimization problems which are algorithmically intractable. Overall, the choice of performance measures (RMSE and Deviance) and the particular subset, , in (7), were based on the mathematical simplicity and the practical implementation both being very critical for the adoption of a less popular but important idea of searching for and enumerating multiple equivalent solutions.
3.3. Determining the reference Lasso solution
Algorithm 1 searches for equivalent solutions in the least sensitive directions of the performance metric. It may discard features from the reference support but it cannot include features not initially selected. Therefore, it is important to provide an initial solution with the maximum support. We compute the reference Lasso solution , by optimally tuning its penalty parameter. We perform crossvalidated hyperparameter tuning using AUC (resp. MAE) as performance metric for classification (resp. regression) problems. It has been highlighted and proved (Meinshausen2006, Proposition 1) that many noise features are included in the predictionoracle solution (i.e., the reference solution we estimate). Indeed, the value of the penalty parameter is lower in the predictionoracle solution than the consistent one hinting towards larger support. We also observe similar behavior in our experiments with real datasets in the Results section. Thus, we expect the respective initial support to contain most, if not all, of the related to the target variable features. Of course, alternative methods based on elastic net or bootstrapping for selecting the reference solution/support can be utilized, however, such an exploration is beyond this paper’s scope and it is left as future work.
4. Variable Categorization
Due to the exponential growth of the polytope’s vertices, the complete enumeration is not always feasible. Nevertheless, there is an alternative approach to qualitatively assess the predictor variables by summarizing the interesting findings, e.g., reporting the set of features that belong to all solutions or only to some. Exploiting the fact that the feasible range of a coefficient’s value for all Lasso solutions can be efficiently computed as it was shown in (Tibshirani2013), a variable categorization is possible. We present the existing formulation for SELSs and based on it we extend it for RELSs.
SELSs. For each , the th coefficient’s lower bound and upper bound
are computed by solving the linear programs
and
respectively. If is an element of the set then the th variable is called dispensable otherwise it is called indispensable and they participate in all Lasso solutions. The fact that the sign of a coefficient remains the same in all SELSs implies that dispensable variables have either or while indispensable variables have either or . From a practical perspective, the number of linear programs to be solved is which is feasible.
Relaxed equivalence. Similarly, we can relax the criterion on dispensable/indispensable variables. Since the computational cost of the linear programs that are solved is manageable, we can discard the constraint on the control of the subspace dimension (i.e., there is no need for ). The linear programs for the lower and upper bounds for a given are
and
respectively. Evidently, as we increase the number of dispensable variables is increased while the number of indispensable variables is decreased.
While in this paper we follow the terminology of ”dispensable” and “indispensable” variables from (Tibshirani2013), we would like to emphasize that it may be misleading. It implies that dispensable variables could be dispensed with and filtered out in some sense. Consider however, two features and that are perfect copies of each other and at the same time, the strongest predictors. According to the definition they are dispensable but neither can be filtered out without a measurable loss in performance. Instead, we have recently proposed the terms “indispensable” and “replaceable” ^{2}^{2}2Invited talk at KDD 2017 conference “Advances in CausalBased Feature Selection” (http://mensxmachina.org/en/presentations/). to clearly indicate that they the latter features can be replaced, not filtered out.
5. Experimental Setup
To evaluate the proposed method and compare against the SES algorithm, we used 9 datasets (Table 1
) trying to include a large variance of sample and feature sizes. No dataset was excluded from the evaluation based on the results. Seven datasets have binary outcome (i.e., classification tasks), while the other two have continuous outcome (i.e., regression tasks). The analyses were performed using the Just Add Data Bio v0.7 (JAD Bio; Gnosis Data Analysis; www.gnosisda.gr) automated predictive analysis engine.
Dataset  Sample size  Feature Size  Field 
Classification  
arcene  200  10000  Bioinformatics 
bankruptcy  7063  147  Finance 
Ovarian  216  2190  Bioinformatics 
Parkinson  195  22  Bioinformatics 
prostate  102  5966  Bioinformatics 
secom  1567  590  Industry 
vV  224  70  Bioinformatics 
Regression  
BC_Continuous  286  22282  Bioinformatics 
MP  4401  202  Materials 
5.1. Just Add Data Bio Engine
JAD Bio employs a fullyautomated machine learning pipeline for producing a predictive model given a training dataset, and an estimate of its predictive performance. JAD Bio performs multiple feature selection and predictive modeling. It tries several configurations, i.e., combinations of preprocessing algorithms, feature selection algorithm, predictive modeling algorithms, and values of their hyperparameters using a gridsearch in the space of hyperparameters. The best configuration is determined using Kfold Cross Validation, which is then used to produce the final models for each feature subset on all data. The crossvalidated performance of the best configuration is known to be optimistic due to the multiple tries (tsamardinos2015performance); JAD Bio estimates and removes the optimism using a bootstrap method before returning the final performance estimate (tsamardinos2017bootstrapping). The final performance estimates of JAD Bio are actually slightly conservative.
Specifically in terms of algorithms, JAD Bio trains several basic and advanced, linear and nonlinear, multivariate machine learning and statistical models; namely, for classification problems trains Support Vector Machine models (SVMs) with linear, full polynomial, and Gaussian kernels, Ridge Logistic Regression models, Random Forests models, and Decision Trees, while for regression problems trains Ridge Linear Regression models instead of the logistic ones. In the following set of experiments more than 500 configurations were tried each time to find the optimal ones, resulting in more than 10000 models trained to compare two feature selection algorithms over nine datasets. JAD Bio is able to incorporate other userdefined preprocessing, transformation, and feature selection algorithms written in Java, R, or Matlab programming languages. This functionality allowed us to incorporate the proposed feature selection method into the pipeline.
5.2. Evaluation Pipeline and Tuning
Two variants of JAD Bio were used and compared: the original with the SES algorithm for the feature selection and an alternative with a Matlab implementation of the proposed method. To evaluate and compare them we split each dataset in two stratified parts. Then, JAD Bio is applied on each half (as training set) for obtaining a set of predictive models and their performance estimation, corresponding to different feature subsets (i.e., solutions found by the feature selection algorithms). The trained models are then tested on the remaining data (as test set) to compute the performance on new data. We then test whether the performance of the solutions found are actually similar. Since we partition the data into two, equivalence of the performance of the feature selection subset is determined using the same sample size as in training and thus having the same statistical power. The new proposed method was tuned with penalty parameter, , taking values from to with multiplicative step . We set the value to 0.01 allowing Lasso solutions having up to 1% performance error in the training set.
6. Results
dataset  split  method 

# signatures 







arcene  split1  Lasso  100  81  0.0099  0.0198  0.8524  56.5  
arcene  split1  SES  100  100  0.0373  0.7394  0.0001  5  0.0001  
arcene  split2  Lasso  100  27  0.0128  0.0326  0.8011  29.1  
arcene  split2  SES  100  100  0.0088  0.7498  0.0001  27  0.0001  
bankruptcy  split1  Lasso  3531  32  0.0005  0.022  0.9704  51.8  
bankruptcy  split1  SES  3531  6  0.0004  0.9615  0.0001  17  0.0001  
bankruptcy  split2  Lasso  3532  40  0.0005  0.0126  0.9697  72.1  
bankruptcy  split2  SES  3532  2  0.0006  0.9615  0.0435  14  0.0001  
Ovarian  split1  Lasso  107  17  0.0021  0.0309  0.9803  35.7  
Ovarian  split1  SES  107  6  0.0065  0.9555  0.0003  6  0.0001  
Ovarian  split2  Lasso  109  3  0.0047  0.0777  0.9654  19.7  
Ovarian  split2  SES  109  2  0.0008  0.9647  0.8298  7  0.00641  
Parkinson  split1  Lasso  97  2  0.0124  0.0566  0.9327  12.5  
Parkinson  split1  SES  97  2  0.0026  0.9077  0.2421  1  0.03319  
Parkinson  split2  Lasso  98  1  0.9018  8  
Parkinson  split2  SES  98  2  0.0014  0.857  2  
prostate  split1  Lasso  51  6  0.0213  0.0456  0.9282  22.7  
prostate  split1  SES  51  3  0.0276  0.9677  0.1554  3  0.0001  
prostate  split2  Lasso  51  1  0.9215  7  
prostate  split2  SES  51  7  0.0226  0.9268  2  
secom  split1  Lasso  783  80  0.0124  0.0257  0.6934  45.3  
secom  split1  SES  783  100  0.0151  0.6071  0.0001  15  0.0001  
secom  split2  Lasso  784  38  0.0126  0.0088  0.6914  124.7  
secom  split2  SES  784  100  0.0146  0.5741  0.0001  12  0.0001  
vV  split1  Lasso  112  2  0.0255  0.1286  0.6794  11  
vV  split1  SES  112  4  0.0309  0.6518  0.2422  4  0.0985  
vV  split2  Lasso  112  3  0.0421  0.1083  0.6792  5.3  
vV  split2  SES  112  2  0.0017  0.6619  0.441  5  0.4226  
BC_Continuous  split1  Lasso  143  12  0.0095  0.0584  0.8771  13.6  
BC_Continuous  split1  SES  143  6  0.0017  0.8851  0.0091  12  0.0001  
BC_Continuous  split2  Lasso  143  3  0.0013  0.0769  0.8916  13  
BC_Continuous  split2  SES  143  2  0.007  0.8743  0.1501  14  0.2254  
MP  split1  Lasso  2200  75  0.0017  0.0095  0.5435  106.8  
MP  split1  SES  2200  24  0.0034  0.4763  0.0001  17  0.0001  
MP  split2  Lasso  2201  70  0.0033  0.0184  0.5469  53.7  
MP  split2  SES  2201  3  0.0011  0.5117  0.0001  21  0.0001 
for regression), the CoV for the number of selected variables, the average holdout performance (along with the twotail, ttest adjusted pvalue assessing the difference between SES and Lasso), as well as the average number of selected variables (along with the twotail, ttest adjusted pvalue assessing that the average number of variable selected by Lasso is different from the fixed number of variables selected by SES). Classification datasets are reported first, while the regression ones are at the bottom.
6.1. Multiple, equivalent solutions are common for predictive analytics tasks
Table 2 reports the Lasso and SES results. Both algorithms identify multiple solutions on almost all datasets and splits, indicating that the presence of multiple solutions is a common problem across several application domains.
At the same time, the coefficient of variation (CoV, defined as the ratio between standard deviation and average) for the holdout performances is always quite low; the median CoV for Lasso is
and for SES it is (maximum values are respectively and ). This means that the different solutions produce models with performance having standard deviation around the mean AUC value. These results support the claim that multiple solutions found by Lasso and SES are indeed equivalent in terms of predictive power.Interestingly, when SES retrieves a large number of signatures (i.e., solutions) so does Lasso, and vice versa. Both algorithms provide numerous solutions for the arcene and secom datasets, only a few solutions for Ovarian, Parkinson, prostate, vV, BC_Continuous, and MP (an exception is the bankruptcy dataset). This is despite the fact that the two algorithms follow quite different approaches. Thus, the results provide evidence that the order of magnitude for the number of equivalent solutions is a characteristic of the data at hand.
6.2. Lasso and SES tradeoffs
Figure 2 graphically represents the holdout performances (xaxis, AUC for classification datasets and R for regression ones) and number of selected features (yaxis) for each signature retrieved by Lasso (circles) and SES (triangles) for the first split of each dataset (distinguished by color). The figure shows that Lasso consistently selects up to one order of magnitude more variables than SES (Table 2). In nine out of comparisons ( datasets 2 splits each) Lasso also achieves better performances than SES (adjusted pvalue ), with an average AUC difference of between the two methods across classification datasets and R between regression datasets. In summary, the two algorithms are quite complementary, with SES providing more parsimonious models with a moderate cost in terms of predictive power achieving different tradeoffs. SES sacrifices optimality (does not attempt to identify all predictive features, or as it is known in the respective literature the full Markov Blanket) to achieve scalability; however, it is easily applicable to different data types by equipping it with an appropriate conditional independence test (see a comparison of Lasso and SESbased feature selection for timecourse data (Tsagris2018)).
6.3. Multiple solutions are heterogeneous in terms of included features
Table 3
presents, for both Lasso and SES, the number of identified signatures categorized according to their size. For sake of clarity only results from the first split are presented, with the results on the second split following almost identical trends. By construction SES retrieves signatures with equal size, while Lasso identifies signatures that can have different sizes. For each group of equalsize signatures, we computed the average Jaccard index within the group as well as the average number of solutionspecific features across all pairs of signatures. The Jaccard index between two sets is defined as the ratio between the size of their intersection and the size of their union, and ranges from
(disjoint sets) to (perfectly overlapping sets), with two equalsize sets sharing half of their elements achieving . For computing the number of solutionspecific features between two sets we first take their union and then subtract their intersection. The resulting features are either in one solution or in the other, but not in both. For the cases where numerous solutions are selected by SES (arcene and secom datasets), both the Jaccard index and the number of dispensable elements indicate a quite elevated heterogeneity, meaning that the multiple solutions proposed by SES include solutions quite different from each other. When only a handful of signatures are retrieved by SES, they tend to be more homogeneous. This is a consequence of the mechanisms used by SES for producing multiple signatures, which constraints each signature to differ from its closest sibling only for one variable. In contrast, Lasso solutions can have more than two signaturespecific features even when only two signatures are retrieved. This can be seen in several rows of Table 3, e.g., for the arcene dataset (signature size , ) and for the secom dataset (signature size ). Interestingly, in several cases Lasso solutions differ from each other on average by or more solutionspecific features, confirming that the multiple solutions retrieved by Lasso differ from each other due to the joint replacement of several variables.7. Conclusion
In this paper we present an algorithm for deriving multiple, equivalent Lasso solutions with theoretical guarantees. The algorithm is empirically evaluated using an automatic platform, JAD Bio, over several learners and tuning hyperparameters; the same platform is employed for a comparative evaluation against the SES algorithm. The results confirm that (a) multiple Lasso solutions are often found in realworld datasets, (b) these multiple solutions can substantially differ from each other, despite their similar performances, and (c) Lasso and SES provide comparable results, with the first being on average slightly more predictive at the cost of selecting a larger number of features.
dataset 

# signat. 



Lasso  
arcene  53  2  0.86  8  
54  2  0.83  10  
55  6  0.89  6.4  
56  26  0.92  4.9  
57  31  0.94  3.5  
58  13  0.97  2  
59  1  
bankruptcy  49, 54  1  
50  3  0.92  4  
51  9  0.92  4.1  
52  10  0.94  3.1  
53  8  0.96  2  
Ovarian  34  2  0.84  6  
35  6  0.89  4.3  
36  5  0.92  3  
37  3  0.95  2  
38  1  
p53  71, 72  1  
73  7  0.93  5.3  
74  18  0.94  5  
75  18  0.95  3.6  
76  14  0.97  2  
77  1  
Parkinson  12, 13  1  
prostate  21, 22, 24  1  
23  3  0.92  2  
secom  41, 48  1  
43  2  0.83  8  
44  14  0.88  5.9  
45  29  0.9  4.8  
46  21  0.93  3.4  
47  12  0.96  2  
vV  10, 12  1  
BC_Continuous  12, 15  1  
13  4  0.86  2  
14  6  0.87  2  
MP  104, 109  1  
105  8  0.94  7  
106  18  0.95  5.1  
107  30  0.97  3.6  
108  17  0.98  2  
SES  
arcene  5  100  0.37  4.8  
bankruptcy  17  6  0.85  2.8  
Ovarian  6  6  0.63  2.8  
Parkinson  1  2  0  2  
prostate  3  3  0.5  2  
secom  15  100  0.6  7.7  
vV  4  4  0.51  2.7  
BC_Continuous  12  6  0.79  2.8  
MP  17  24  0.79  4 
Acknowledgments
We would like to thank Giorgos Borboudakis for his constructive discussions. The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP/20072013) / ERC Grant Agreement n. 617393. PC was funded by Gnosis Data Analysis PC.
References
Appendix A Proof of Theorem 3.1
Proof.
(a). First, apply the Minkowski inequality as follows
Next, define the matrices ,
It holds that as well as . Using these matrices we write
Hence, we bound the error term in the inequality above by
where we used the vectorinduced matrix norm which is also known as the spectral norm. It holds that thus the bound is rewritten as
Due to the singular value ordering, it holds that . The proof is completed by observing that the box constraint leads to the estimate
(b). We observe that for all
where is the projection coefficient of the th sample to the th principal vector. Thus, it holds that
because implies that for . Proceeding, the convexity of the function implies that
Hence,
where and are dimensional vectors with elements and . Using the CauchySchwarz inequality, we get
where the last inequality is the consequence of the fact that for all , hence, .
Finally, we estimate
where we use the fact that . Combining the above with the bound for the difference , we obtain the desired bound. ∎