Lasso, knockoff and Gaussian covariates: a comparison

05/04/2018 ∙ by Laurie Davies, et al. ∙ Universität Duisburg-Essen 0

Given data y and k covariates x_j one problem in linear regression is to decide which if any of the covariates to include when regressing the dependent variable y on the covariates x_j. In this paper three such methods, lasso, knockoff and Gaussian covariates are compared using simulations and real data. The Gaussian covariate method is based on exact probabilities which are valid for all y and x_j making it model free. Moreover the probabilities agree with those based on the F-distribution for the standard linear model with i.i.d. Gaussian errors. It is conceptually, mathematically and algorithmically very simple, it is very fast and makes no use of simulations. It outperforms lasso and knockoff in all respects by a considerable margin.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There are many papers on lasso from the first [Tibshirani, 1996] to a very recent one [Bellec et al., 2017], a span of 21 years. As no theoretical comparisons are made in this paper we give no further references. The software required for the comparison is the R package glmnet which can be downloaded from

 https://CRAN.R-project.org/package=glmnet

Knockoff is much more recent. Theoretical work is to be found in [Candes et al., 2017]. The software is obtainable from R

https://cran.r-project.org/web/packages/knockoff/index.html

Part of the comparison is based on the Tutorials 1 and 2 of

https://web.stanford.edu/group/candes/knockoffs/software/knockoff/

The present version of the Gaussian covariate method is new but it is based on previous attempts, see [Davies, 2017]. It is described in Section 2. There is as yet no R package but the software is available as an ancillary file.

The real data used in the comparison includes three of the data sets used in [Dettling and Bühlmann, 2003], colon cancer ([Alon et al., 1999]), leukemia ([Golub et al., 1999]) and lymphoma ([Alizadeh et al., 2000]) available from

http://stat.ethz.ch/~dettling/bagboost.html

For more information about the data see [Dettling and Bühlmann, 2002].

The prostate cancer data is available from the lasso2 CRAN R package

 https://CRAN.R-project.org/package=lasso2

and the red wine data ([Cortez et al., 2009]) from

https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/

The Boston housing data and the Brownlee stack loss data are available from the R package MASS

https://CRAN.R-project.org/package=MASS

A further data set is the one considered in [Cox and Battey, 2017] on osteoarthritis available from the Gene Expression Omnibus under accession number GDS5363 available from

https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS5363

As in [Cox and Battey, 2017] the males have been excluded. All the data sets are the same but with permutations so that replicating the results here will require exactly the same data set. The one used here is that used in [Cox and Battey, 2017] and is, I think, the DataSet SOFT file.

For a short discussion of regression use is made of the R package quantreg available from

https://CRAN.R-project.org/package=quantreg

The comparisons require the latest version of R [R Development Core Team, 2008]. The results of this paper can be reproduced by running the file runcomp.R. This requires the FORTRAN file selvar.f and the R files comp.R and selvar.R. The running time is about eight hours.

2 A description of the Gaussian covariate method

2.1 The basic idea

Consider data consisting of a dependent variable and an explanatory variable . The problem is to decide if is indeed an explanatory variable for in the sense that the values of can to some extent be explained by those of . A standard method of deciding this is to postulate a linear model

(1)

with

i.i.d. standard Gaussian noise and to test the null hypothesis

. The standard test is the F-test based on the F-statistic

(2)

where and is the sum of the squared residuals after regressing on . The null hypothesis is rejected if the P-value

(3)

is less than the specified size of the test .

The Gaussian covariate approach is to compare with a Gaussian covariate with i.i.d. components. The comparison is done through the sum of the squared residuals. The covariate is clearly not an explanatory covariate so that if is no better than in respect of the sum of squared residuals it is concluded that is also not an explanatory covariate: the null hypothesis is replaced by the question, is better than ?

Denote the sum of squared residuals after regressing on by . It has been shown by Lutz Dümbgen ([Davies and Dümbgen, 2018]) that

(4)

with P-value

(5)

and that the two P-values are equal

(6)

This is a remarkable result even if it has a simple proof. It is remarkable because both P-values are exact and uniformly distributed over

but whereas the P-value on the left is valid for all (non-zero) and that on the right depends on the model (1).

We need a generalization of this result also due to Lutz Dümbgen ([Davies and Dümbgen, 2018]). Given and linearly independent covariates and i.i.d. additional random covariates we have

(7)

where is the sum of squared residuals after regressing on the and the sum of squared residuals after regressing on all covariates. The case is the one required for stepwise regression.

2.2 Gaussian covariate stepwise regression

Regress on including an offset by default, put

and denote the sum of squared residuals by . The best of the is the one with the smallest given by

Denote the corresponding quantities for the Gaussian covariates by and respectively. From (4)

(8)

As the Gaussian covariates are independent it follows that

(9)

If the best of the is the right hand side is referred to as the P-value of . The smaller the P-value the more relevant . This corresponds to the role of testing in the linear model where the smaller the P-value the more significant the covariate .

In some applications it is useful to be less strict when selecting variables. This may be seen as a trade-off between reducing the number of false negatives, not selecting variables with some explanatory value, at the risk of more false positives, selecting variables with no explanatory value. This may be done as follows.

The random variables

are i.i.d. so that

(10)

This can be extended to the th order to give

(11)

Comparing with is less strict than comparing it with . Although has a direct interpretation when an integer this is not necessary in (11).

To incorporate this into a stepwise procedure a cut-off value for the P-value must be specified, for example . Suppose at stage of the procedure covariates have been selected. We denote the sum of the squared residuals by with and the set of selected covariates by . For each regress on the covariates in and denote the smallest sum of squared residuals taken over by . This is now compared with the smallest sum of squares obtainable by considering random Gaussian covariates . These are so to speak chosen anew for each . This is asking the question as to whether the remaining covariates are better than Gaussian noise. Regressing on the covariates in gives a random sum of squared residuals . From (7) we have

(12)

and the P-value for the best of the is given by

(13)

If the P-value is greater than the cut-off value the procedure terminates. Otherwise the best covariate is included and the procedure continues.

The extension of the above to the th order statistic gives the P-value

which is the probability that the th best of the Gaussian covariates is better than the best of the remaining covariates.

The default which includes the offset command is

fstepwise(y,x,alpha,kmax)

The parameter kmax specifies the largest number of selected covariates. The main reason for its inclusion without a default value is to reduce memory size. A more experienced FORTRAN programmer may well be able to do without it. It can also be avoided if the programme were written in a language with a dynamic memory such as C: “the principle weakness of FORTRAN was and is its lack of dynamic arrays”, [Huber, 2011] page 67. A second reason is that it enables the user to choose a maximum number of covariates.

Applying fstepwise to the leukemia data mentioned in Section 1 with gives (** 1 **)

> fstepwise(ly.original,lx.original,0.05,10,misclass=T,time=T)[[1]]
   user  system elapsed
  0.008   0.000   0.011
[1,] 1182 0.0000000000 4.256962    4
[2,] 1219 0.0008577131 2.884064    3
[3,] 2888 0.0035805523 2.023725    1

Here 1182, 1219 and 2888 are the selected covariates with P-values 0, 8.58e-4 and 3.58e-3 respectively. The third column gives the sum of squared residuals after the inclusion the the covariate and the fourth the number of misclassifications. The time required was 0.008 seconds.

The default value can be replaced by for example by putting nu=3 in the command to give (** 2 **)

>  fstepwise(ly.original,lx.original,0.05,10,nu=3,misclass=T,time=T)[[1]]
   user  system elapsed
  0.012   0.000   0.013
     [,1]         [,2]     [,3] [,4]
[1,] 1182 0.000000e+00 4.256962    4
[2,] 1219 1.051452e-10 2.884064    3
[3,] 2888 7.664817e-09 2.023725    1
[4,] 1946 3.353905e-03 1.602749    1
[5,] 2102 6.026398e-04 1.242921    0

Thus leads to the two additional covariates 1946 and 2102.

2.3 False positives and

The effect of the choice of

on the results can be estimated by relating

to the concept of false positives. A false positive is selecting a covariate which is no better than Gaussian noise. If all covariates are Gaussian noise then all selected covariates are false positives. For given the number of false positives can be obtained using simulations which then provide a guide for real data. The command is

fsimords(n,k,alpha,nu,kmx,nsim=100,time=T)

For normal values of the default value of 100 simulations is usually sufficient so the time required is required is not very long. Table 1 (** 3 **) gives the result for the leukemia data with . The time was 3.49 seconds and the average number of false positives was 0.98. On the basis of Table 1 there is no evidences that the two additional covariates obtained with are relevant.

0 1 2 3 4 5 6 7 8 9 mean
0.51 0.22 0.16 0.07 0.00 0.01 0.03 0.00 0.00 0.00 0.98
Table 1: The empirical frequencies for the number of false positives based on 100 simulations.

The definition of false costive given above is an empirical one. In simulations based on the linear model a false positive is the selection of a covariate with . A false negative is the omission of a covariate with . This is the definition we use in Tables 2 and 1. In the simulations of graphs in Section 3.7 a false negative is the omission of an edge where the true graph has an edge. A false positive is the inclusion of an edge where the true graph has no edge. Again this is the definition we use.

2.4 Repeated Gaussian covariate stepwise regression

Once the first covariate has been chosen the stepwise procedure is conditional on this covariate. More generally once a subset has been chosen the next covariate to be chosen is dependent on this subset.

To illustrate this we consider the colon cancer data with . The stepwise procedure results in (** 4 **)

>  fstepwise(colon.y,colon.x,1,2,misclass=T,time=T)[[1]]
   user  system elapsed
  0.004   0.000   0.011
     [,1]         [,2]     [,3] [,4]
[1,]  493 7.402367e-08 6.804815    9
[2,]  175 4.311166e-01 5.431871    7

For any cut-off P-value only one covariate is chosen, namely 493. This does not mean that only covariate 493 is relevant but that given 493 the remaining 1999 are in a sense no better than white Gaussian noise.

If covariate 493 is eliminated and the stepwise procedure applied to the remaining covariates again just one covariate is chosen, 377 with a P-value 1.36e-07. The covariates 493 and 377 are highly correlated with correlation coefficient of 0.778. This explains why 377 is no longer considered once 493 has been included. Now 377 can be excluded and the procedure continued in this manner until the P-value of the best of the remaining covariates exceeds the specified cut-off value .

The value results in 82 covariates being selected (** 5 **). The time required was 0.224 seconds. The first seven are given in Table 2. The covariates 1635 and 576 are grouped together as indicated by the number 4. That is the two taken together give a linear approximation to dependent variable. Similarly 1423 and 353 are grouped together. In all the 82 covariates form 49 linear approximations.

1 2 3 4 5 6 7
set 1 2 3 4 4 5 5
covariate 493 377 249 1635 576 1423 353
P-value 7.40e-8 1.35e-7 1.13e-6 2.28e-6 2.28e-6 2.76e-5 8.00e-4
ss 6.80 6.94 7.44 7.62 5.52 8.26 5.33
mis 9 11 8 19 5 9 6
Table 2: The first seven of the selected 82 covariates for the colon cancer data.

The default command is

fstepstepwise(y,x,alpha,kmax).

The number of selected covariates can be reduced by specifying a smaller P-value. Putting for the colon data results in 45 covariates grouped into 32 linear approximations (** 6 **) The time required was 0.12 seconds. The number can also be reduced by specifying a maximum number of linear approximations using nmax or a maximum number of covariates using vmax. In the case of the leukemia data running

fstepstepwise(ly.original,lx.original,0.05,10,time=T)

results in 420 selected covariates forming 153 linear approximations (** 7 **) but not printed. Setting nmax=20 gives 20 linear approximations involving 62 covariates (** 8 **)

fstepstepwise(ly.original,lx.original,0.05,10,nmax=20,time=T)

2.5 Misclassifications and Outliers

The repeated Gaussian stepwise procedure produces several linear approximations each of which can be used to classify the data. the number of misclassifications for the first five approximations for the colon cancer data are given above. These can be combined for example, by calculating the fit for each approximation and then taking the average. The result for the 49 approximations to the colon cancer data is shown Figure 

1

. The least squares fit gives similar results. There are five misclassifications shown in red and these five are clearly outliers, very different from the other observations.

Figure 1:

The average fit of the 49 colon linear approximations using logistic regression.

2.6 regression

The idea is not restricted to least squares. It can be applied to regression but then simulations are necessary. We take the Brownlee stack loss data as an example. Suppose the covariates Air Flow and Water Temperature have been selected. Including the covariate Acid Concentration reduces the sum of the absolute residuals from 43.69355 to 42.08116. In a simulation with a Gaussian covariate replacing Acid Concentration the Gaussian covariate was better in 20% of the simulations giving a P-value of 0.204 for Acid Concentration (** 9 **). This uses the R package quantreg. The corresponding P-value for least squares regression is 0.344.

fl1stack(stackloss,1000)

For robust regression with a smooth function and for non-linear regression such as logistic a chi-squared approximation is available (see [Davies, 2017]) removing the need for simulations.

2.7 Why Gaussian?

The method started with the question as to whether it was possible to judge the relevance of a covariate without assuming the model (1). The initial data set was the Brownlee stack loss data and the unfortunate covariate was Acid Concentration. This was replaced by the cosine of the average daily temperature in Berlin on the first 21 days of January 2013. The Acid Concentration won but only just. Relying on empirical alternatives to the covariates was not a promising option but from there it was but a short step to simulating alternatives. Initially an attempt was made to model the alternative covariates. If the covariate was 0-1 use the binomial, if integer valued a Poisson etc (see page 279 of [Davies, 2014]). The results showed that the modelling was not worth it. They were essentially the same when one used Gaussian alternatives.

In the case of one Gaussian covariate the sum of squared residuals is

(15)

If the are not Gaussian but say Bernoulli then (15) will still hold asymptotically but this will require conditions on . More generally if the

have finite variance then subject to conditions on the

(15) will hold. In this case there seems to be no reason not to use the exact result for Gaussian covariates.

We start from so when calculating the percentage reduction in the sum of squares due to regression on the the expression cancels out just as cancels out in the F-test. This is why the calculated probabilities hold for all . If the are replaced by i.i.d. Cauchy random variables then (15) becomes

where is a standard Cauchy random variable. There is no cancelling out and the distribution will depend on even asymptotically. Moreover is larger than indicating that Cauchy random variables are less exacting than Gaussian random variables.

2.8 A summary

This completes the description of the Gaussian covariate method. It is extremely simple. There is no mention of regression parameters or the variance of (1). This contrasts with the treatment of lasso in [Bellec et al., 2017] where all the values of the tuning parameter considered involve . Indeed the estimation of is one of the main problems with lasso as many optimality results for the choice of depend on the value of . The linear approximations, or models if the reader wishes, provided by the repeated stepwise procedure are as they stand. They take explicitly into account the number of available covariates, they do not over fit and there is no need for any form of post selection analysis.

As an example we take the colon data. All the linear approximations provided by the repeated stepwise procedure have either one or two covariates. The first one consists only of the covariate 493. If we use logistic regression using this covariate alone there are nine misclassifications. Replace the remaining 1999 covariates by Gaussian noise and choose the first three covariates. One of these is always 493, the two remaining are Gausssian noise. Simulations show that in about 2.8% of the cases there are zero misclassifications and in about 3.3% just one misclassification. This indicates that you can get overfitting with just three covariates.

3 Comparison of lasso, knockoff and the Gaussian covariate procedure

3.1 Problems with interpretation

The comparisons given here are purely empirical. It is possible to prove theoretical results on the Gaussian covariate method as a first attempt in [Davies, 2017] shows but this will not be pursued further. The comparisons are given in detail and it should be possible to repeat them using the software available as an auxiliary file. The version of lasso to be used is the cross-validation option cv.glmnet provided in the R package glmnet.

One problem when comparing lasso and knockoff with the Gaussian covariate method is how to interprete the outputs of lasso and knockoff. One applications of lasso to the colon cancer data with the binomial family option resulted in the four covariates (**10 **)

The time required was 0.64 seconds.

With and the binomial family option Knockoff selected 10 covariates (** 11 **)

The time required was about 65 minutes.

If these are interpreted as models then for a sample size of both over fit to such an extent to make them unacceptable. For this data set you can get overfitting with just four Gaussian covariates.

** 9.5 ** overfit The first four Gaussian covariates are chosen and the number of misclassifications based on the logit model determined. In 18% of the cases this was less than 4. Overfitting occurs already with three covariates. Reasonable approximations should have either one or two covariates. This is the case for the 49 linear approximations of ** 5**.

A second problem is that the number of covariates selected by lasso and knockoff varies from application to application. This is particularly pronounced for knockoff. Ten applications of knockoff resulted in from 5 to 12 covariates being chosen. (** 20 **).

A third problem is the amount of time required. Again this is particularly a problem for knockoff which required 65 minutes to select the model based on the 10 covariates given above (** 11 **). The repeated Gaussian method required 0.17 seconds to provide 49 perfectly reasonable approximations.

The simulations described below use a models so that it is possible to give the number of false positives and negatives. For the red wine and Boston housing data the goal is to give a good linear approximation to the data. In such cases we use the output of lasso and knockoff and make no attempt at a post choice analysis

For the gene expression data sets, the colon data, the leukemia data, the prostate cancer data and the osteoarthritis data the dependent variable is zero-one depending on whether the person has or has not the medical condition under investigation. Very often logistic regression is used to analyse such data but here we use least squares which is much easier and in terms of stability much better. We point out however the the Gaussian covariate approach can be adopted to non-linear regression such as logistic regression. In the lymphoma data there are two different medical conditions plus the control group. The two conditions can be analysed separately but here we treat the data set as a whole.

For these data sets it can be argued that the problem is not to specify a set of linear approximations but to list those covariates which are significantly associated with the dependent covariate. This is argued in [Cox and Battey, 2017], [Dettling and Bühlmann, 2002], [Dettling and Bühlmann, 2003] and, we think, in [Candes et al., 2017]. Nevertheless even if this is the main goal it may still be of interest to compare linear approximations. The sum of the squared residuals is part of the output of fstepwise and fstepstepwise but this can be augmented by setting misclass =T which then gives the number of misclassifications. This is based on the least squares residuals but a logistic regression can easily be used once the covariates are given. The difference is small.

3.2 Post selection analysis for lasso and knockoff

The Gaussian covariate method associates a P-value with each chosen covariate enabling the statistician to form a judgment about the relevance of the covariate. Lasso and knockoff provide so to speak naked covariates without any indication of their individual relevance. We now describe two methods for associating a P-value with each individual covariate.

Two illustrate the first method we we take the 4 covariates given above specified by one application of lasso (** 10 **) and the 10 covariates resulting from one application of knockoff (** 11 **).

We consider all subsets of size three or less of the covariates specified by the method used (lasso, knockoff). For each such subset we calculate the P-value of each covariate as given by (2.2) with

where is the sample size, the number of covariates, the size of , the sum of squared residuals when regressing the dependent variate on all covariates in and the sum of squared residuals when the covariate is omitted. The offset is included by default. This compares the covariate with the best of Gaussian covariates.

To reduce overfitting we require that the P-value of each covariate in is less than a specified threshold The value we use is . The final P-value for the covariate is its minimum P-value over all subsets which contain it and satisfy the threshold condition. We also restrict attention to those covariates with P-value less than a second threshold value alpha which here is also taken to be 0.05.

The default command is

fpval1(y,x,ind,alpha,alpha1)

where (y,x) is the data, ind the chosen covariates, alpha and alpha1 the threshold values.

The results for the lasso (** 10 **) and knockoff (** 11 **) selections are given by (** 12 **) and (** 13 **) respectively. The output is

where is the covariate, the P-value of , and are the further covariates in the subset for which the P-value of is smallest. A denotes their absence. The final two items are the sum of the squared residuals and, if misclass=T, the number of misclassifications based on . Using fpval1 all of the lasso covariates are included and 8 of the 10 konockoff covariates.

A variation on this method is to take a subset of size and then to choose the best covariate which minimizes the sum of squared residuals based on . This will in general introduce new covariates which were not in the original selection. The default command is

fpval(y,x,ind,alpha,alpha1)

The additional covariates are give a minus sign if they are in the initial choice . For fpval the results are again all lasso covariates and again 8 knockoff covariates but with 1360 instead of 1473.

The second method can be used when the set of selected covariates is too large for the first method. It uses the Gaussian covariate method but restricts the choice of covariates to the set of selected covariates and adjusts the P-value to take into account that these covariates were selected from a larger set. The command is

fstepwise(y,x[,ind],alpha,kmax,ek=k)

or

fstepstepwise(y,x[,ind],alpha,kmax,ek=k)

depending on whether just one or all relevant linear approximations are required. Here ind is the initial selection of covariates and k is the number of covariates from which this selection was made.

Another method of measuring the relevance of covariates is to calculate the cross-validation error. This is given by

tmpcv<-cv.glmnet(colon.x[,ind],colon.y,family="binomial")
print(min(tmpcv$cvm))

where denotes the covariates. Using the first three covariates from the stepwise method 493, 175 and 1909 gives a cross validation error of 0.39, ** 13.1 **. The four lasso covariates one of 0.58, ** 13.2 **and the 10 knockoff covariates one of 0.46, ** 13.3 **.

3.3 Tutorials 1 and 2

The simulations of Table 2 are based on the Tutorials 1 and 2 respectively of with the parameters as given there.

The default choice avoids false positives possibly at the cost of false negatives. Using this value of in Tutorial 1 (** 14 **) results in on average 15 covariates being chosen with no false positives. Putting in results in 53 covariates being chosen with three false positives (Table 3) which agrees well with the expected number 2.1 of Table 4. Thus compared with the choice increases the number of selected covariates from 14 to 49 of which according to Table 3 about 2 may be false positives. Putting results in 62 being chosen of which about seven are false positives (Table 3) . Compared to there is an increase of about 13 in the number of selected covariates of which about 4 may be false positives. Similar calculations apply to Tutorial 2 (**15 **).

It follows from Table 3 that lasso selects on average approximately 140 covariates in Tutorial 1 and 100 in Tutorial 2 which is somewhat excessive.

In terms of the sum of false positives and false negatives knockoff and the choices and are comparable. The default value of the false discovery rate fdr is 0.1 but in Tutorial 2 it is set to 0.2. If the default value 0.1 is used the false positive and negative values become 2.48 and 44.6 respectively. Also in Tutorial 2 the covariance matrix Sigma was used to construct the knockoff variables. This seems to improve the performance but only slightly. If the knockoff filter is used as in Tutorial 1 the average numbers of false positives and negatives become 6.32 and 36.72 respectively.

The main difference between knockoff and the Gaussian covariate method is the computing time. In Tutorial 1 knockoff is 20 times slower that the Gaussian method with and 50 times slower in Tutorial 2. 60.1 15.0 11.5

Tutorial 1 Tutorial 2
method fp fn time fp fn time
lasso 82.4 0.52 7.89 60.1 15.0 11.5
knockoff 5.58 10.0 63.9 7.00 35.1 53.9
0.00 46.2 0.25 0.00 56.5 0.04
3.36 11.6 2.29 2.78 42.5 0.44
7.02 5.82 3.33 7.00 35.4 0.94
Table 3: Comparison of lasso, knockoff and Gaussian covariates with based on Tutorials 1 and 2 (** 14 **) and (** 15 **).
0 1 2 3 4 5 6 7 8 9 10 mean
0.13 0.29 0.20 0.20 0.12 0.02 0.03 0.00 0.01 0.00 0.00 2.13
0.02 0.02 0.04 0.08 0.15 0.11 0.17 0.20 0.09 0.04 0.08 5.84
Table 4: The empirical frequencies and the expected number of false positives based on 100 simulations (** 16 **).

The commands for Tutorial 1 and Tutorial 2 are as follows:

ftut(1,1000,1000,60,4.5,0.1,0.05,50)
ftut(2,1000,1000,60,7.5,0.2,0.05,50)

3.4 Red wine data

For the red wine data with the dependent variable is variable 12 and gives subjective evaluations of the quality of the wine with integer values between three and eight.

Ten applications of lasso gave either 4 or 6 selected covariates union was . Ten applications of knockoff gave between 6 and 11 so that all covariates were selected.selected (** 17 **). The Gaussian method with cut-off P-value gives six covariates which are in order alcohol, volatile-acidity, sulphates, total-sulfur-dioxide, chlorides, pH (see Table 5 of [Lockhart et al., 2014]) (** 18 **). The repeated stepwise method gives two more linear approximations with 4 and a single covariate (** 19 **).

fstepwise(redwine[,12],redwine[,1:11],0.05,11,misclass=T)
fstepstepwise(redwine[,12],redwine[,1:11],0.05,11,misclass=T)

3.5 Cancer data

3.5.1 Colon data

The size of colon cancer data is .

Five application of lasso resulted in either 4 or 5 covariates with union . 4, 5, 25 and 29 variables giving 32 variables in all . Five applications of the knockoff filter with fdr =0.5 resulted in 5, 12, 8, 5 and 7 variables giving 13 variables in all (** 20 **). The repeated Gaussian covariate method for the colon cancer data with was treated in Section  2.4. It results in 82 covariates forming 49 linear approximations (** 5 **). The computing times were two seconds for lasso, 41 minutes for knockoff and 0.01 seconds for the Gaussian covariate method. All the lasso covariates and 10 of the 13 knockoff covariates are included in the Gaussian list (** 21 **).

The post selection P-values for the lasso with alpha=alpha1=0.05 using fpval1 and fpval are given in (** 22 **), those for the knockoff covariates in (** 23 **). Of the 13 knockoff covariates two have P-values greater than 0.05 and so are not included. ** 23.5 ** gives the corresponding P-values for Gaussian covariate selection.

3.5.2 Leukemia data

The size of the leukemia data is .

The knockoff procedure with fdr=0.5 resulted in 31 covariates. The time required was five hours 20 minutes. In view of this no further analysis was carried out as it would require of the order of a day’s computing time.

The lasso is much faster requiring 0.3 seconds on average. Five applications of lasso resulted in 14, 12, 14, 16 and 14 covariates giving 17 covariates in all (** 24 **). The lasso P-values based on fpval1 are given by (** 25 **) with alpha1=alpha=0.05 result in 16 of the 17 covariates. The repeated Gaussian method (** 26 **) with gives 420 covariates forming 153 linear approximations. These include 15 of the 17 lasso covariates. The Gaussian procedure with nmax=20 results in 62 covariates which include 15 of the original 17 lasso covariates (** 26 **)

3.5.3 Prostate cancer

The size of the prostate data which suggest a very long time using the knockoff filter. It will not be considered.

Lasso was applied 5 times with 24, 14, 14, 7 and 23 covariates being selected giving 24 in all (** 27 **). The P-value procedure with alpha=alpha1=0.05 reduces this to 18 (** 28 **). The repeated Gaussian method with resulted in 278 covariates and 118 linear approximations. Putting nmax=20 gives 52 covariates (** 29 **). The 278 include 16 of the original 24 lasso covariates, the 52 include 16 of the 18 fpval1 lasso covariates (** 30 **)

3.5.4 Lymphpoma

The lymphoma data are the only ones where the dependent variable takes on three values, 0, 1 and 2. The size is .

Five application of lasso resulted in 40, 41, 38, 44 and 44 covariates giving 46 in all (** 31 **). The P-value method with includes all of them (** 32 **).

The repeated Gaussian method results in 1603 covariates forming 512 linear approximations. It contains all the lasso covariates. Putting nmax =20 results in 78 covariates which include 28 of the lasso covariates (** 33 **).

3.5.5 Osteoarthritis

The size of the osteoarthritis data is . This proved too large for the knockoff filter which required 17.7 GB of memory.

Lasso give 10 covariates in all (** 34 **). The P-value method (** 35 **) included all of them.

The osteoarthritis data set was analysed in ([Cox and Battey, 2017]). The authors selected the following 17 covariates.

7235 11643 25125 25470 25744 27642 27920 29679 33385
36409 37443 44276 45991 46771 48415 48433 48549

The P-value method with reduces this to 16 covariates (** 36 **).

The repeated Gaussian method yields 317 covariates and 107 linear models. These include all the lasso and 11 of the Cox-Battey covariates. Putting nmax=20 reduces this to 62 covariates which include all the lasso and 3 of the Cox-Battey covariates (** 37 **).

3.6 Boston housing data and interactions

For the Boston housing data . Allowing for interactions of order up to and including seven increases the number of covariates from 13 to 77520 giving (** 39 **). The command is

fgeninter(x,ord)

Five applications of lasso resulted in 39, 44, 53, 53 and 43 covariates giving in all 61 (** 40 **). The time required was eight minutes. To choose a reasonable linear approximation from some subset of these covariates we apply the Gaussian stepwise procedure just to these covariates but set the effective sample ek size to 77520 (** 41 **). The results of a linear regression based on these covariates is given in (** 42 **). The sum of squared residuals is 6493. The covariates are decomposed in (** 43 **).

The Gaussian stepwise selection is given in (** 44 **). The time required was about two seconds. This is followed by a linear regression (** 45 **) with sum of squared residuals 5576. The decomposition of the covariates (** 46 **).

The last three steps are repeated for interactions of degree eight or less (** 47 **), (** 48 **), (** 49 **) and (** 50 **). It may be noted that the sum of squared residuals for regression based on the first three interactions, 441, 197063 and 197166, is 10417. This is less than the standard regression on all 13 initial covariates with sum of squared residuals 11078

3.7 Graphs

Given covariates a graph is calculated as follows. Each is regressed on the remaining covariates and connected to those covariates found to be relevant

As an example we take to be a set of covariates generated in Tutorial 1 of

https://web.stanford.edu/group/candes/knockoffs/software/knockoff/

The graph is given by a bidiagonal matrix with 1 on the main diagonal and 0.25 on the side diagonals ** 51 **.

One application of lasso ** 52 ** resulted in 4 false negative and 82 false positives. The time requires was about 150 minutes.

The Gaussian covariate method was applied with the cut-off P-value was replaced by . This resulted in 1 false negative and no false positives ** 53 **. The time required was 13 seconds.

In [Meinshausen and Bühlmann, 2006] lasso was used to calculate graphs for large. In a simulation in Section 4 of that paper with the method resulted in two false positives and 638 false negatives. The description of the generation of the graph of Figure 1 is incorrect and has been altered to reproduce graphs with about 1800 edges. One application of lasso gave 571 false positives and 13 false negatives. The time taken was 46 minutes, ** 54 **

For the same data the Gaussian covariate procedure with and resulted in zero false positives and 118 false negatives, ** 55 **. The time required was 10 seconds. Putting resulted in 8 false positives and 9 false negatives in 12 seconds. ** 56 **. Simulations suggest on average about 6 false positives. ** 57 **.

Graphs can also be constructed for real data sets. The colon cancer data with . Lasso requires 10 minutes and produces a graph with 23322 edges. ** 59 **. The Gaussian covariate method with gives a graph with 1634 edges in 2.8 seconds. The first ten edges are given ** 59.5 **

The repeated Gaussian covariate method with gives a graph with 24475 edges in 44 seconds ** 60 **

Putting gives a graph with 1521 edges. The first 10 are given ** 61 **

It is seen that the repeated method picks up many more highly significant dependencies than the single method so the recommendation would seem to be to use the repeated method with a smaller value of .

Finally for the osteoarthritis data with lasso requires 11 seconds for one node resulting in an estimated 11*48802/(24*3600)=6.2 days for the whole graph. The Gaussian covariate method requires about 0.24 seconds for each node giving an estimated time of 3 hours 15 minutes for the whole graph. A specially written FORTRAN programme gives a graph with 38986 edges in 45 minutes. ** 62 **

This data set shows the limitations of the recommendation made just above. Regress the first covariate on the remaining 44801 where has been divided by the number of variables as in the default version of fgraphst. This results in 4009 selected covariates. If a graph were to be constructed the first covariate alone would be connected to 4009 other covariates. The calculated P-values for the first 152 covariates are zero. Such a graph would be much too large to be useful.

In this particular case one can make use of the problem and restrict the construction of the graph to those covariates chosen in the first step. We take the 74 covariates selected by the Gaussian covariate method and apply the repeated Gaussian procedure to calculate a graph with . It has 289 edges. The first covariate 939 is connected with 18 of the 62 covariates. ** 63 **.

4 Acknowledgment

We thank Oliver Maclaren for comments on earlier versions.

References

  • [Alizadeh et al., 2000] Alizadeh, A., Eisen, M., Davis, R., Ma, C., Lossos, I., Rosenwald, A., Boldrick, J., Sabet, H., Tran, T., Yu, X., Powell, J., Yang, L., Marti, G., Moore, T., Hudson, J., Lu, L., Lewis, D., Tibshirani, R., Sherlock, G., Chan, W., Greiner, T., Weisenburger, D., Armitage, J., Warnke, R., Levy, R., Wilson, W., Grever, M., Byrd, J., Botstein, D., Brown, P., and Staudt, L. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403:503–511.
  • [Alon et al., 1999] Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., and Levine, A. (1999).

    Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.

    PNAS, 96(12).
  • [Bellec et al., 2017] Bellec, P. C., Lecué, G., and Tsybakov, A. B. (2017). Slope meets Lasso: improved oracle bounds and optimality. arXiv:1605.08651v3 [math.ST].
  • [Candes et al., 2017] Candes, E., Fan, Y., Janson, L., and Lv, J. (2017). Panning for gold: Model-free knockoffs for high-dimensional controlled variable selection. arXiv:1610.02351v3 [math.ST].
  • [Cortez et al., 2009] Cortez, P., Cerdeira, A., Almeida, F., Matos, T., and Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, Elsevier, 47(4):547–553.
  • [Cox and Battey, 2017] Cox, D. R. and Battey, H. S. (2017). Large numbers of explanatory variables, a semi-descriptive analysis. Proc. Natl. Acad. Sci. USA, 114(32):8592–8595.
  • [Davies, 2014] Davies, L. (2014). Data Analysis and Approximate Models. Monographs on Statistics and Applied Probability 133. CRC Press.
  • [Davies, 2017] Davies, L. (2017). Stepwise choice of covariates in high dimensional regression. arXiv:1610.05131v4 [math.ST].
  • [Davies and Dümbgen, 2018] Davies, L. and Dümbgen, L. (2018). A model-free approach to linear least squares regression with exact probabilities. arXiv:1807.09633v1[math.ST].
  • [Dettling and Bühlmann, 2002] Dettling, M. and Bühlmann, P. (2002). Supervised clustering of genes. Genome Biology, 3(2):1–15.
  • [Dettling and Bühlmann, 2003] Dettling, M. and Bühlmann, P. (2003). Boosting for tumor classification with gene expression data. Bioinformatics, 19(9):1061–1069.
  • [Golub et al., 1999] Golub, T., Slonim, D., P., T., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, H., Loh, M., Downing, J., Caligiuri, M., Bloomfield, C., and Lander, E. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(15):531–537.
  • [Huber, 2011] Huber, P. J. (2011). Data Analysis. Wiley, New Jersey.
  • [Lockhart et al., 2014] Lockhart, R., Taylor, J., Tibshirani, R. J., and Tibshirani, R. (2014). A significance test for the lasso. Ann. Statist., 42(2):413–468.
  • [Meinshausen and Bühlmann, 2006] Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Annals of Statistics, 34(3):1436–1462.
  • [R Development Core Team, 2008] R Development Core Team (2008). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.
  • [Tibshirani, 1996] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., 58(1):267–288.