FATSO: A family of operators for variable selection in linear models

04/11/2019
by   Nicolás E. Kuschinski, et al.
0

In linear models it is common to have situations where several regression coefficients are zero. In these situations a common tool to perform regression is a variable selection operator. One of the most common such operators is the LASSO operator, which promotes point estimates which are zero. The LASSO operator and similar approaches, however, give little in terms of easily interpretable parameters to determine the degree of variable selectivity. In this paper we propose a new family of selection operators which builds on the geometry of LASSO but which yield an easily interpretable way to tune selectivity. These operators correspond to Bayesian prior densities and hence are suitable for Bayesian inference. We present some examples using simulated and real data, with promising results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

07/23/2021

LocalGLMnet: interpretable deep learning for tabular data

Deep learning models have gained great popularity in statistical modelin...
10/28/2019

Variable Selection with Copula Entropy

Variable selection is of significant importance for classification and r...
07/18/2019

Least Angle Regression in Tangent Space and LASSO for Generalized Linear Model

We propose sparse estimation methods for the generalized linear models, ...
03/28/2022

A Comparison of Hamming Errors of Representative Variable Selection Methods

Lasso is a celebrated method for variable selection in linear models, bu...
02/24/2021

Sparse online variational Bayesian regression

This work considers variational Bayesian inference as an inexpensive and...
09/10/2021

Reducing bias and alleviating the influence of excess of zeros with multioutcome adaptive LAD-lasso

Zero-inflated explanatory variables are common in fields such as ecology...
05/12/2020

Robust Lasso-Zero for sparse corruption and model selection with missing covariates

We propose Robust Lasso-Zero, an extension of the Lasso-Zero methodology...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In standard linear models, it is not uncommon to have prior knowledge that several of the regression coefficients should be zero. This happens, for example, when it is suspected that most of the factors considered in a large model are not relevant. The identity of which factors are relevant, however, is not known beforehand. It is of interest to estimate model parameters and to identify which coefficients are nonzero. This is a classical problem with a classical solution, wherein regression is performed on the model and then the parameters are tested one at a time to determine whether they are significantly nonzero. Often this is followed by a second round of inference using only those parameters determined to be nonzero the first time through (see Rencher, 2008, for example).

This procedure works well for large sample sizes and small dimensions, but for high dimensions (large numbers of explanatory variables ) or small samples sizes, eg. the

problem, it becomes impossible to perform linear regression using classical techniques since the response

is typically found exactly inside of the column space of , and there are infinitely many exact solutions.

There are several solutions to this problem, but many of them center only around estimation and do not intend to identify relevant factors. Methods that do intend to separate relevant from irrelevant variables are known as variable selection methods. There is a broad body of recent literature on the subject of variable selection in extremely high dimensional problems, such as those which are frequently encountered in gene selection and microarray data (Guyon and Elisseeff, 2003). In this paper we will focus on linear regression problems, usually with a more manageable (if still large) number of dimensions.

In order to obtain good estimates in these situations, one common solution is to use regularizing operators. One popular such operator is the LASSO operator (Tibshirani, 1996), which is designed to yield point estimates which are frequently exactly zero. Tuning the degree of selectivity of the LASSO operator, however, is not very fluid. The degree of selectivity is tied to shrinkage of the estimators and it is difficult to interpret.

While LASSO is a very popular operator for variable selection in linear models (and has been tried in non linear models also, see Ribbing et al., 2007

), several other regularization methods exist, such as ridge regression

(Hoerl and Kennard, 1970), bridge regression (Park and Yoon, 2011), elastic net (Zou and Hastie, 2005), etc. While these operators have several important virtues, none of them address the issue of interpretability in variable selection.

In a Bayesian setting, a large number of selection operators have been suggested recently in the form of scale mixtures of normals. For instance the horseshoe prior (Carvalho et al., 2010), Dirichlet-Laplace priors (Bhattacharya et al., 2014) and others Liang et al. (2008). These approaches also do not focus on interpretability issues. Our selection operator is not scale a mixture of normals; we follow a different strategy.

In this paper, we look into the geometric mechanism by which LASSO promotes regression estimators to zero, and we study some of the consequences. Using this information we propose a new family of operators which use the same geometric mechanism as LASSO, but provide an extra parameter which permits fluid and intuitive tuning of the degree of selectivity separately from shrinkage. We prove that this family of operators corresponds to a large family of Bayesian prior distributions, and we study the relationship between the geometry of the priors and the meaning of the parameters in a Bayesian context.

The paper is organized as follows. In section 2 we establish our notation and investigate the geometric properties of the LASSO operator. In section 3 we introduce the FATSO operator, which is based on the geometrical properties discussed in section 2. In section 4 we study the behavior of FATSO and see how it addresses the issue of parameter interpretability. Section 5 discusses the differences between FATSO and various other extensions of LASSO. In section 6 we look into the observable effects of the parameters in numerical examples and with real data. Finally, section 7 gives some final thoughts.

2 The LASSO operator

Consider a standard linear model of the form

where is the

data vector,

a vector of parameters, is the design matrix and the errors are . In a situation in which we suspect that many of the coefficients in are 0 the problem of interest is estimating which coefficients are nonzero along with their value. When the dimension of is high in relation to the sample size, classical inference does not work, so this problem becomes a problem of variable selection. For this purpose, one common technique is to use the Least Absolute Shrinkage and Selection Operator (LASSO), see Tibshirani (1996)

. In classical statistics LASSO is seen as a likelihood penalization, and in Bayesian statistics it is treated as a Laplace prior

(Park and Casella, 2008). In the Bayesian setting, the MAP corresponds with the classical estimator, namely

where is the Gaussian log-likelihood function. The expression is the LASSO operator and it depends on the value of a parameter .

A popular alternative parametrization for LASSO is to write the operator as , which makes the LASSO estimator equal to

Thus, no longer depends on .

The reason why LASSO produces parameter estimations that are exactly zero can perhaps best be understood by examining its level curves. In the case where is bivariate there are three possibilities for the geometry of the level curves at the estimator (see figure 1). The level curves of the likelihood and the operator may cross (type A), may be tangent (type B), or they might meet at a point at which the curves of the operator are non-differentiable (type C). We note that type A cannot be the estimator by a simple argument: The point marked cannot be the estimator since at the point marked the value of the likelihood is the same, but the value of the operator is greater. Hence, the estimator must be either type B or type C. It is a of type C that interests us given that these situations make the MAP estimator of one parameter exactly equal to zero.

(A) (B) (C)
Figure 1: The three forms of intersections of level curves of the likelihood (elipses) and the LASSO operator (squares). (A) cannot happen at the MAP estimator, (B) corresponds to the likelihood curve being tangent to the slope of the prior and (C) corresponds to the likelihood curve intersecting the prior at an extreme, in this case the parameter in the axis is sent to exactly zero.

2.1 How LASSO promotes variable selection

When considering whether is of type B or C, we find that it depends on the value of , and this dependence has a notable property.

Lemma 1

For fixed X, with probability 1, random data Y will allow the lasso estimator to fulfill the following criterion: There exists

such that if then is of type C. Note that the LASSO operator may be viewed as the Lagrangian for the restricted maximization of the likelihood subject to for some . The larger the value of , the smaller the value of , and when then .

For the bivariate case, consider the slope of the level curve of the likelihood function at the origin. With probability 1, this slope will be neither 1 nor -1.

Note that the level curves of the likelihood function are concentric. Hence, there is an open ball around the origin where the level curve does not have a slope of 1 or -1 either. In this area, it is impossible for to be of type B. Therefore, for large enough , must be of type C.

Now for the general case, note that all of the bivariate marginals behave as the bivariate case just explained.

Theorem 1

For a fixed X and a fixed random Y, with probability 1 there exists such that if then for all except one value of . is the maximum threshold for each pairwise comparison of vs .

In other words, there is almost certainly a threshold for which any above this threshold will make all estimators zero except for one.

In general, there is no clear way to choose so as to select variables in any controlled way. In other words, we know that when grows, our selection becomes tighter and tighter, discarding more and more variables, but there is no interpretable measure of how much tighter. In other words, the choice of can run the gamut from allowing all coefficients to be nonzero to allowing only one of them, and no good way to control its degree of selectivity.

In practice, the most common method for selecting is to use data-driven techniques such as cross-validation (Obuchi, 2016; Ribbing et al., 2007).

In passing, we note the following important point: A known issue with LASSO is that estimations depend on the scale of the variables, so it is common practice to center the covariates and standardize them so that (Ribbing et al., 2007), although recently there have been alternative suggestions on how to rescale the variables (Sardy, 2008). Regardless of the specific method, something must be done unless the scale of the covariates is carefully chosen. This point is critical not only in LASSO, but in other selection operators as well. For this paper, we will assume that prior to any regularization, covariates have been centered and standardized in the way described above. This will become important when performing calculations related to our proposal later on, but it is equally critical in LASSO, so we mention it now.

3 An alternative proposal

We note, as seen in the proof of lemma 1, that the behavior of the level curves of the likelihood at zero is directly related to what variable selection choice will be made by the LASSO estimator. Essentially, the LASSO estimator will be either at a point where the level curves of the likelihood are at a 45 degree angle, or it will make a selection. The only time that it will select both variables regardless of is if the likelihood level curves are at a 45 degree angle exactly at 0 (for Gaussian data, the probability of this occurring is zero). In figure 2 we see a graphical representation of exactly where the LASSO estimator may be located (depending on choice of ).

Figure 2: The possible locations for the LASSO estimator, as determined by the level curves. Which specific location corresponds to the LASSO estimator depends on . The dark line runs from the MLE along the points where the level curves are at a 45 degree angle, until it reaches an axis.

In order to address the issue of selectivity, we propose to alter the LASSO level curves. The idea is to propose a new set of level curves directly, and to build a selection operator from this proposal. The objective is to adjust the slopes of the level curves such that they span a continuous range. If the slope of the likelihood level curves at zero is in this range, then one variable will not dominate the other.

If the slope of the level curve at zero is in this range then will be of type B regardless of the degree of shrinkage.

The geometry of the proposed level curves is the perimeter of the intersections of disks, as illustrated in figure 3 (or in general the boundary of the intersection of -balls in dimension ). If the angle in the figure is the same for all level curves, then if the angle of the likelihood level curves at the origin is between and , then both variables will be selected regardless of the degree of shrinkage (parameter for LASSO). This construction will introduce a second parameter , which determines , and which will be used in addition to a shrinkage parameter.

Figure 3: The proposed operator’s level curves are the boundary of the intersection of disks. If the angle of the likelihood level curves at 0 falls between and then one variable will not dominate the other regardless of the level of shrinkage ( in LASSO). We will resort to the angle introduced in this figure, and shown again in figure 4, in many parts of the paper. The value of the level curve corresponds to the level of shrinkage, but a new parameter is introduced, to change the geometry and the angle , which controls the position and size of the circles. The two images show the geometry with a different and .

For the construction to make sense, the angles of intersection of the level curves with the axes must not depend on the degree of shrinkage. Consequently, the center of the corresponding circle will vary depending on which level curve we are on. We proceed to explore the necessary calculations for the construction of an operator from this idea.

Figure 4 shows the essential geometry used to calculate the location of the center of each curve. We note that triangles and share intersection and we also note that the angle at is the same as the angle at so these triangles are similar. We can therefore characterize the angle by . We can now write where is a vector of ones, and is the distance from to the origin along any given axis. is the additional parameter in our operator, which will be directly related to the desired level of selectivity.

Figure 4: The geometry required for calculating the value of the operator. and are the centers of the circles , the arcs of which intersect the horizontal axis at and , respectively. Note that triangles and are similar. This figure is a reference for several calculations throughout the paper.

With this notation, and using for the dimension of it is now possible to write out the calculation

In the range of interest, and so we solve this equation to find a closed form expression for

We must remember that in this section we have written and out of notational convenience, but that they depend on and on , so it really is and .

Now that we have computed the geometry of the problem, the remaining issue is to use this geometry to construct an operator (in this case one that will also match a prior distribution). Any probability distribution for which the level curves of the density function are concentric circles (or higher dimensional equivalent) centered at the origin may be used as a basis for the construction of the operator. If the density function is

then we can construct a distribution with density function . This does not have a scale parameter unless

does, but most useful distributions do have one. We will refer to this family of priors as FATSOs or Flexible Axis-Thickened Selection Operators, and the basic form of FATSO will be based on a Gaussian distribution.

The FATSO will always be a probability distribution so long as

is also a probability distribution, and will have finite moments whenever

does since

The full formula for the Gaussian FATSO will have log density

for some normalizing constant , which does not have to be computed since it does not depend on the s.

This distribution also has the following useful property:

Lemma 2

The negative log density of the Gaussian FATSO is concave.

Note that (where is the univariate Gaussian density) is an increasing function for positive . Note also that . We also observe that is convex when seen as a function of , so the result follows.

A trivial corollary is that, since the likelihood function for linear regression is also log-convex, then the posterior is log-convex and the calculation of the MAP is a convex optimization problem. Unfortunately most convex optimization algorithms require the use of gradients, and FATSO is not differentiable at any point where some , so the gradient does not exist at the expected optimum. That said, the convexity of the target function guarantees a unique maximum, and other desirable properties for optimization. Almost any optimization technique which does not depend on differentiability at the optimum will calculate the FATSO estimator effectively.

4 Interpreting FATSO and selecting parameters

The design of FATSO is based around the idea of reducing the collection of level curves for which parameter estimates are zero in a controlled way. Namely, the issue is the slopes of the level curves of the likelihood function at zero. By adding the parameter , we have allowed an interval of these slopes to produce nonzero parameter estimates, rather than a single slope. This seems promising, but in order to be of real use, we need a proper way to interpret this slope and assign (and in most cases) to fit our problem.

As we have previously observed, in the bivariate case, if the angle of the level curves is between and then both variables will be selected. For interpretative purposes, let be the slope ; as in figure 3. Following the geometry from figure 4 we can observe that is a known angle . Using allows us to calculate and , so we have

We now have an easy conversion between , and , but on its own this brings us no closer to interpreting nor to being able to set (ie. ) in the FATSO operator.

The key to this crucial step is to calculate the slope of the likelihood level curve at zero. We note that the level curves are perpendicular to the gradient, so it is possible to study this slope by considering the gradient of the likelihood function at zero.

We observe that for the standard linear regression problem, the likelihood function is integrable, and a flat prior can be used to obtain a Gaussian posterior (Box and Tiao, 1992). We will not actually use a flat prior nor treat the result as a posterior, but for mathematical convenience, we can think of the likelihood function as if it were a Gaussian density with mean at the MLE and covariance matrix .

The gradient of of a Gaussian density is (Petersen et al., 2008)

When we reduce it to the bivariate case, the slope of the gradient at zero is

As previously explained, the covariates have been standardized so and hence it is easy to observe that , so we can write this quantity with , obtaining

In the independent case, where

(which can only happen if there is no intercept: If both the covariates and the response variable are centered then the intercept is always 0 anyway) this result is simply the ratio of the signals of the two parameters (note that with standardized covariates these are the pure effects on

, free from the units of measurement; this can be seen easily since so or the quotient of signal to noise ratios, which are unitless). This corresponds well with an intuitive notion of the relative importance, or difference from zero, of one parameter to the other. In other words, this gives us an interpretation for the slope of the likelihood level curve at zero.

This intuitive notion is quite reasonable in the case of and are independent, but when they are correlated then it is lacking. If and tend towards zero together, for instance, then we would hope our notion of relative difference from zero would reflect that.

One way to attempt to correct this is to consider instead the conditional distribution of one variable given the other (Eaton, 2007),

and calculate the conditional equivalent which we will call

This would give a more accurate representation of the relative difference from zero of the two variables, since it is the quotient of the means in the particular case of interest in which the other variable is zero (Also, so this is still unitless).

Figure 5 gives some intuition to show how the conditional distribution is a better choice than the marginal distribution. Both of the Gaussian distributions shown have the same marginal density, but in one case they are independent and in the other they are highly correlated. The difference between the relative importance of the two variables is visually apparent: If one of the variables is set to be zero, the other should be small as well.

(A) (B)
Figure 5: Two Gaussian bivariate densities with the same marginals and different degrees of correlation. In (A) the variables are independent and (with the vertical variable as and the horizontal variable as ) we have . In the second case, the two variables are strongly correlated. When one variable tends towards zero, the other is also very small. In this case, intuitively, the variables are closer and there is less of a reason to prefer one over the other. This intuition is reflected by .

When we calculate , the result is exactly , which is precisely the slope of the gradient of the likelihood at zero.

In other words, regardless of correlation, the slope of the level curves of the likelihood at zero matches the conditional signal ratio, , which is a good intuitive measure of relative importance between variables in a regression problem.

The user therefore assigns as the circumstances require so that, regardless of , both and are selected if is between and . Our previous calculation allows us to set when is known, although it is also possible to simply use an alternate parametrization, working with directly instead of . This parametrization is easier to interpret and will be used from here on out.

This is nicely interpretable in two dimensions. In higher dimensions the structure is analogous and the mathematics are identical (Simply do the calculation with the marginal distribution of the two intended variables). The interpretation of the slope is slightly less intuitive, since the direction is determined by a vector rather than by a single number. The relationship of the corresponding components of the gradient, however, still matches .

We have a way to interpret . For full Bayesian inference, one would simply select but it may also be reasonable to choose another path and simply try out values of . Since the computational cost is low (unless the number of parameters is truly huge), a fair amount of information about the behavior and relative importance of parameters can be gleaned in fairly little time. In table 2, from section 6.2, we can see an example of what such an exploration might look like.

One final practical note on the subject of the selection of is based on the fact that it is independent of units. Since it means the same at all scales, one can think that reasonable values for should be between 1.1 and 15. If is less than 1.1 then there is very little difference between the geometries of FATSO and LASSO, whereas if is greater than 15 one is hardly performing any variable selection at all.

4.1 and prior conditional variance in Gaussian FATSO

We now have a handle on but Gaussian FATSO has a second parameter . If then we end up with a flat prior which may be suitable to variable estimation without any selection. On the other hand if then the prior will be concentrated around zero. This yields higher selectivity, but also shrinks the value of all estimations.

One way to think about selecting is to think of the FATSO less as an operator and more as a prior distribution. We can then study the properties of FATSO as a probability distribution, in which case

may be interpreted as related to the variance of this distribution. Following the geometry from figure

4 we have the next lemma.

Lemma 3

For a Gaussian FATSO, the prior distribution of

is a Gaussian random variable with mean zero and (prior)variance

where is the angle as shown in figures 3 and 4. We note from figure 4 and Tales’s theorem, that the ratio of AC to BC is the same regardless of how far C is along the horizontal axis. Then we have the relationship

where the left hand side differs by a constant from the value of the FATSO prior log-density calculated at the point C and the right hand side differs by a constant from a Gaussian log-density calculated at the same location (B is the origin).

Some trigonometry will then yield the value of , which proves the claim.

When then the variance goes to zero, and the geometry of FATSO approaches the geometry of LASSO.

Of note, if is close to then can become very small, and as a result will have an effect on shrinkage of estimators unless is adjusted to compensate. This is not a very significant issue unless

If FATSO will be used for Bayesian analysis, this shows the effect of on the FATSO prior. The conditional variance of one parameter given all others are zero is a reasonable way to establish prior variability. should be selected accordingly.

Departing from a full Bayesian prior statement, one reasonable way to select is to use data-driven techniques such as cross-validation, but these may come at a significant computational cost, or the sample size may be too small for cross-validation to be a reasonable choice.

If we want to set

using heuristics, we will turn to the observed data

for some guidance. Note that if only is active and all others are equal to zero, then if we write as the th column of X we have

Now, using the Bayesian interpretation (even if we are not going to perform Bayesian inference), we can think of as a random variable (a priori independent of ) and write

and here we use the fact that X was standardized so that .

is not known, but it can be estimated with the sample variance . Hence, if is known we can calculate one choice for as follows

However, it is worth observing that in practice, the value of has a relatively small effect on point estimates, as will be seen empirically in the results section of this paper. acts more as an on/off switch than a dial, and hence it is not too important to worry about its exact value. One has only to find something in the (usually very large) reasonable range. The above method for selecting is not meant to be taken as a precise value, but merely to give a notion of where the reasonable range might be.

A second option, as is done in LASSO, mentioned in section 2 is to parameterize not with but with , yielding estimates which no longer depend on . Of course, this comes at the cost of being able to use knowledge of in order to select the parameter, as was done above with . Even if we prefer to use , however, this shows us that we can scale

with the inverse of the standard deviation of the noise to achieve similar results.

5 Comparison to other LASSO extensions

FATSO is not the first attempt to extend the ideas of LASSO in a new direction. There are several other regularizations which have been attempted and which yield different benefits. We make no claims that FATSO is necessarily any better than any of these, but only that the issues it aims to address are different.

5.1 Ridge and Bridge regression

Ridge regression, also known as Tikhonov regularization in inverse problems (Kaipio, 2005) and is an older idea than LASSO. It is also closely related to the use of Gaussian priors in Bayesian regression. It essentially aims to estimate the regression coefficients with

where is the Ridge operator. One idea which places LASSO at one end and Ridge regression at the other is called Bridge regression, which changes the operator to for . Of note, however, for any value of the slope of the level curves at 0 is exactly zero. Hence, Ridge and Bridge are not selection operators in the sense that the resulting estimators are not zero (Hoerl and Kennard, 1970; Park and Yoon, 2011).

5.2 Group LASSO

One common extension to LASSO is the group LASSO, which separates the columns of X into groups and which promotes the selection of groups of variables together. While this does extend the ability of LASSO to handle more complex situations, it also requires some degree of understanding of the relationships between covariates, which is not the goal of FATSO. In another sense, however, group LASSO is more closely related to FATSO than the other LASSO extensions since it aims to incorporate information about parameter grouping that is not immediately visible in the data but which is understood by the user (Yuan and Lin, 2006).

5.3 Scale mixtures of Normals

In recent Bayesian literature, there has been an explosion of selection operators proposed with the theme of corresponding to priors which are scale mixtures of normals (Carvalho et al., 2010; Bhattacharya et al., 2014; Liang et al., 2008). A scale mixture of normals is a random variable which can be represented as where

is a random variable with a standard normal distribution and

is some other (continuous or discrete) random variable

(West, 1987). LASSO itself is closely related to this family, since it corresponds to a Laplace prior and a single variate Laplace prior is a scale mixture of normals with

a Gamma distributed random variable. There are various motivations for the proposed operators, but they generally are focussed on some form of asymptotic convergence either of the entire posterior distribution or of some point estimate derived from it. We are unaware of any which ease the interpretation of tuning parameters.

5.4 Elastic net

The idea with the most similar behavior to FATSO is the elastic net. The elastic net uses as a regularization operator (and then applies a correction to the estimator), essentially working as a linear combination of the Ridge and LASSO operators. The first thing to note about the elastic net operator is that the level curves are not concentric, and the slope of the curves’ intersection with the axes depends on the curve. For distant curves, the geometry of Ridge is dominant, whereas with curves closer to the origin the geometry is closer to that of LASSO.

While elastic net does not maintain the concentric level curves of FATSO, it does allow for variable selection with less stringent selectivity than LASSO, so it behaves in a similar way. In elastic net, however, the degree of selectivity is moderated very obscurely by the interplay of and . The common recommendation is to select both parameters by data-driven techniques, such as cross-validation. This is a valid approach, but does not allow users to make informed decisions about the desired degree of selectivity based on their own expertise. Given that the scenario is one where data is known to have very little information, the goal of allowing human knowledge to participate is very sensible.

While FATSO is in no way intended to replace the elastic net, it is worth noting that the two main issues with LASSO which the elastic net aims to solve are both addressed by FATSO as well. The first of these issues is that in situations, LASSO cannot select more than variables, and in the following section we will see an example where FATSO selects more than covariates. The second issue is that when several covariates are highly correlated LASSO tends to select only one of them. It is proven in the original elastic net paper (Zou and Hastie, 2005) that any strictly convex regularization will solve this issue and FATSO is strictly convex.

FATSO does not aim to compete with the elastic net in terms of computational tractability or in terms of asymptotic error reduction, so while the behavior of the two operators is somewhat similar, their ultimate objectives are different.

6 Results

6.1 Simulated Data

15 observations of a 20 dimensional regression problem were simulated. The true values of the regressors were zeros except for three variables. These variables were , , and

(noise standard error was 0.1). Using these same data, FATSO estimates were calculated using different values of

and . Table 1

shows maximum a posteriori estimates of these data based on various values of

and using Gaussian FATSO.

Other s
500 353.6 0.1 0.697 0.329 0.543 Many are of similar
order of magnitude
30 21.21 30 0.709 0.336 0.548
also active
3 2.236 30 0.814 0.397 0.602 and are active
2 1.581 30 0.871 0.464 0.671 Others inactive
2 1.581 0.01 0.857 0.442 0.65 active
2 1.581 1 0.868 0.459 0.69 Others inactive
2 1.581 100 0.868 0.46 0.658 Others inactive
2 1.581 1000 0.71 0.249 0.437 Others inactive
Table 1: Results of estimations using the FATSO operator using various values for and . We see that affects selectivity smoothly, while adjusting does not allow for very flexible tuning.

With a high value for and a low value for , the FATSO prior is nearly flat. In these situations the estimator is nearly the MLE, and since the dimension is greater than the sample size, the MLE does not give any real information about the parameters. As FATSO becomes more informative, so do the estimations. Similarly, we note the effect of and act independently. We note that the effect of acts almost as if it had a threshold. For a fixed of 2, then for any the exact value does not seem to have very much effect. at 1 and at 100 both yield very similar estimations for the parameters, and it is not until is extremely large (1000) that the effect of shrinkage becomes noticeable. With very small , however, the effect is lost somewhat (the extreme case being where we are left with the MLE again). The same cannot be said for (or ), which affects estimations much more fluidly. We see that with a fixed of 30, the effect of on selectivity is very clear. As approaches 1 the selection is more strict, and as grows then selection is looser. This confirms that the parameter permits tuning the degree of selectivity in a fluid way that is not possible with a shrinkage parameter alone.

It is very tempting to be seduced by good results with and reasonably high since the estimations are so close to the truth, but we must remember that these are synthetic data. The estimates with higher values of are also estimates that could easily produce the same dataset, but in this particular case did not. When the dimension of the parameter space is larger than the sample size, then the data will be in the column space of the design matrix. The choice of one set of estimators over another is not information that is really in the data at all. With lower , FATSO will tend to choose smaller sets of covariates which explain the data, but whether that is desirable or not is really an issue of user to judge.

In order to illustrate this case, data was simulated with 17 nonzero variables rather than 3. It is known that LASSO selects at most as many variables as the sample size, so using LASSO here will select for at most 15 of the 17 active variables. The variables that were zero were , and and inference was performed in the same way. With and (a highly selective combination with a geometry similar to LASSO) we estimate 6 inactive variables. These are , , , , , and . As we see, not only are the variables being selected more strictly, but the variable choice is simply wrong. This is caused by the insistence on a high level of selectivity. With and we estimate two inactive s, and these are and . In this case the random simulation turns out to be unusually highly correlated with simply by chance, so was not selected against. We note that specifying too stringent selection criteria forces the model to shift away from the true values of the parameters, but allowing more nonzero entries yields a very good selection of variables. The difference between this situation and the last one is subtle, and it may often be a good idea to make the selection based on human understanding of the situation rather than on data which is necessarily insufficient.

6.2 Real Data

We use data from Stamey et al. (1989) to evaluate the performance of FATSO and the effect of parameter adjustment. These data are often used for LASSO demonstrations. The data are a 9 column matrix which describe prostate cancer data in 97 patients. The first 8 columns describe characteristics of the tumor and the last column is the response variable: An antigen. The data is available in the R package lasso2 (Lokhorst et al., 2014)

Since the data was sorted by the response variable, the rows of the matrix were permuted randomly. There are 97 rows; estimation was done using the first 67 rows and used to estimate the remaining 30. Table 2 displays the results of inference on this collection of data using different values for the parameters.

active s Prediction MSE Observations
100 0.0001 all except 0.60516 Almost exactly
simple linear regression
5 0.5 all except 0.61363
3 1 0.63189 Removing one variable
does not greatly
increase the error
2 1 0.65002
1.2 1 0.78280 With a higher error
we can remove many more
Table 2: Results of estimations using the FATSO operator on the prostate cancer data. Once again we see that can be adjusted to tune the degree of selectivity with some reasonable degree of control.

The main takeaway from this experiment is the fluid way in which we can select the s. Since this is not a high dimensional problem, the MLE is a good estimator, but by tuning we can pick a simplified model which selects more or fewer variables. Removing variables comes at a cost, but we can see exactly how costly this removal is. Using this information it is possible to manually tune our model to whatever balance of parsimony and accuracy we want. Hence, even in this relatively low dimensional scenario, there is something to be gained by having a fluid selection operator.

7 Conclusions

In situations with high dimensional data, where

, there are infinitely many parameter combinations which might yield the observed data. In these situations, the data does not clearly favor one choice of parameters over another, so in order to make a selection, some measure of human choice is required. LASSO and other similar regularization operators are means by which a form of preference is given to one kind of solution over another. These systems all have parameters which affect – in some sense – how this choice is made. The selection of these parameters by data-driven techniques is appealing, but the information to make the choice is not really in the data. As a result, it becomes desirable to understand the meaning of the parameters and the effect of their choice on the resulting inference. This problem is particularly serious in the Bayesian setting since the operators correspond to prior distributions and it is invalid to assign priors using the data that these priors are chosen to analyze. This last issue is not a vague or theoretical one since both LASSO and Elastic Net have been adapted for Bayesian inference regardless of the difficulty in assigning parameters (Park and Casella, 2008; Li and Lin, 2010).

While significant effort has been made to improve the data-driven techniques for adjusting parameters, this effort has done little in the sense of improving the interpretability of the parameters for human users who have additional information. In this sense, the elastic net is the system which boasts the lowest mean squared error for theoretical purposes, but it is also gives perhaps the least interpretable combination of parameters.

FATSO is an attempt to offer a means of setting the degree of selectivity by hand. While it is theoretically possible to use data-driven techniques to assign , if data driven techniques are preferred, then one would probably be better served using another regularization operator. On the other hand, in situations where one intends to choose the degree of selectivity using outside knowledge, FATSO is recommended to set the selectivity in a way that is understandable and meaningful.

References

  • Bhattacharya et al. (2014) Bhattacharya, A., Pati, D., Pillai, N. S. and Dunson, D. B. (2014) Dirichlet-laplace priors for optimal shrinkage. Journal of the American Statistical Association, 110, 1479–1490.
  • Box and Tiao (1992) Box, G. E. P. and Tiao, G. C. (1992) Bayesian inference in statistical analysis. New York: Wiley.
  • Carvalho et al. (2010) Carvalho, C. M., Polson, N. G. and Scott, J. G. (2010) The horseshoe estimator for sparse signals. Biometrika, 97, 465–480.
  • Eaton (2007) Eaton, M. (2007) Multivariate statistics : a vector space approach. Beachwood, Ohio: Institute of Mathematical Statistics.
  • Guyon and Elisseeff (2003)

    Guyon, I. and Elisseeff, A. (2003) An introduction to variable and feature selection.

    Journal of Machine Learning Research

    , 2003, 1157–1182.
  • Hoerl and Kennard (1970) Hoerl, A. E. and Kennard, R. W. (1970) Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12, 55–67.
  • Kaipio (2005) Kaipio, J. (2005) Statistical and computational inverse problems. New York: Springer.
  • Li and Lin (2010) Li, Q. and Lin, N. (2010) The bayesian elastic net. Bayesian Analysis, 5, 151–170.
  • Liang et al. (2008) Liang, F., Paulo, R., Molina, G., Clyde, M. A. and Berger, J. O. (2008) Mixtures of g priors for bayesian variable selection. Journal of the American Statistical Association, 103, 410–423.
  • Lokhorst et al. (2014) Lokhorst, J., Venables, B., port to R, B. T. and tests etc: Martin Maechler (2014) lasso2: L1 constrained estimation aka ‘lasso’. URL: https://CRAN.R-project.org/package=lasso2. R package version 1.2-19.
  • Obuchi (2016) Obuchi, Tomoyuki; Kabashima, Y. (2016) Cross validation in lasso and its acceleration. Journal of Statistical Mechanics: Theory and Experiment, 2016, 053304.
  • Park and Yoon (2011) Park, C. and Yoon, Y. J. (2011) Bridge regression: Adaptivity and group selection. Journal of Statistical Planning and Inference, 141, 3506–3519.
  • Park and Casella (2008) Park, T. and Casella, G. (2008) The bayesian lasso. Journal of the American Statistical Association, 103, 681–686.
  • Petersen et al. (2008) Petersen, K. B., Pedersen, M. S. et al. (2008) The matrix cookbook. Technical University of Denmark, 7, 510.
  • Rencher (2008) Rencher, A. (2008) Linear models in statistics. Hoboken, N.J: Wiley-Interscience.
  • Ribbing et al. (2007) Ribbing, J., Nyberg, J., Caster, O. and Jonsson, E. N. (2007) The lasso—a novel method for predictive covariate model building in nonlinear mixed effects models. Journal of Pharmacokinetics and Pharmacodynamics, 34, 485–517.
  • Sardy (2008) Sardy, S. (2008) On the practice of rescaling covariates. International Statistical Review, 76, 285–297.
  • Stamey et al. (1989) Stamey, Thomas A.and Kabalin, J. N., Ferrari, M. and Yang, N. (1989) Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. iv. anti-androgen treated patients. The Journal of Urology, 141, 1088–1090.
  • Tibshirani (1996) Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological), 58, 267–288.
  • West (1987) West, M. (1987) On scale mixtures of normal distributions. Biometrika, 74, 646–648.
  • Yuan and Lin (2006) Yuan, M. and Lin, Y. (2006) Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68, 49–67.
  • Zou and Hastie (2005) Zou, H. and Hastie, T. (2005) Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, 301–320.