Log In Sign Up

Estimator of Prediction Error Based on Approximate Message Passing for Penalized Linear Regression

by   Ayaka Sakata, et al.

We propose an estimator of prediction error using an approximate message passing (AMP) algorithm that can be applied to a broad range of sparse penalties. Following Stein's lemma, the estimator of the generalized degrees of freedom, which is a key quantity for the construction of the estimator of the prediction error, is calculated at the AMP fixed point. The resulting form of the AMP-based estimator does not depend on the penalty function, and its value can be further improved by considering the correlation between predictors. The proposed estimator is asymptotically unbiased when the components of the predictors and response variables are independently generated according to a Gaussian distribution. We examine the behaviour of the estimator for real data under nonconvex sparse penalties, where Akaike's information criterion does not correspond to an unbiased estimator of the prediction error. The model selected by the proposed estimator is close to that which minimizes the true prediction error.


page 1

page 2

page 3

page 4


Asymptotic Statistical Analysis of Sparse Group LASSO via Approximate Message Passing Algorithm

Sparse Group LASSO (SGL) is a regularized model for high-dimensional lin...

Prediction Errors for Penalized Regressions based on Generalized Approximate Message Passing

We discuss the prediction accuracy of assumed statistical models in term...

Sparse Multinomial Logistic Regression via Approximate Message Passing

For the problem of multi-class linear classification and feature selecti...

Robustness in sparse linear models: relative efficiency based on robust approximate message passing

Understanding efficiency in high dimensional linear models is a longstan...

A comment and erratum on "Excess Optimism: How Biased is the Apparent Error of an Estimator Tuned by SURE?"

We identify and correct an error in the paper "Excess Optimism: How Bias...

Approximate message passing for nonconvex sparse regularization with stability and asymptotic analysis

We analyze linear regression problem with a nonconvex regularization cal...

Neural message passing for joint paratope-epitope prediction

Antibodies are proteins in the immune system which bind to antigens to d...

1 Introduction

In recent decades, variable selection using sparse penalties, referred to here as sparse estimation, has become an attractive estimation scheme [1, 2, 3]

. The sparse estimation is mathematically formulated as the minimization of the estimating function associated with the sparse penalties. In this paper, we concentrate on the linear regression problem with an arbitrary sparse regularization. Let


be a response vector and predictor matrix, respectively, where each column of

corresponds to a predictor. We denote the regression coefficient to be estimated as , and the problem considered here can be formulated as


where is the sparse regularization that enhances the zero component in , and is a set of regularization parameters. In the classical variable selection problem, the variables are estimated by a maximum likelihood method under the constraint on the support of the variables used in the model. In sparse estimation, the support is determined by the regularization parameter ; hence, the adjustment of is regarded as the selection of a model, and is crucial in the sparse estimation.

In general, the regularization parameter is determined based on model selection criteria. One of the criteria is the prediction performance, which is measured by the prediction error defined for the test data generated according to the true distribution. Following this criterion, the model with the regularization parameter that minimizes the prediction error provides an appropriate description of the data. However, the exact computation of the prediction error is impossible, because we do not know the true distribution. Therefore, the model selection is generally implemented using estimators of the prediction error, rather than the prediction error itself. For the maximum likelihood estimation, Akaike’s information criterion (AIC) is an unbiased estimator of the prediction error, and is widely used for model selection when concentrating on the prediction of unknown data [4]. However, for sparse estimation, AIC is generally not an unbiased estimator of the prediction error. In such cases, a -type criterion, which evaluates the prediction error using generalized degrees of freedom (GDF) [5, 6]

, is useful when the variance of the data is known. However, the exact computation of GDF also requires the true probability distribution of the data, which means that an estimator must be constructed. In sparse estimation, GDF and its unbiased estimator can be derived for certain regularizations. A well-known result obtained using the least absolute shrinkage and selection operator (LASSO) is that AIC corresponds to the unbiased estimator of the prediction error

[7], a nontrivial fact that has attracted the interest of statisticians. Recently, sparse estimations using nonconvex regularizations have been studied with the aim of achieving a more compact representation of the data than LASSO [8, 9]. Hence, model selection methods that can be applied to such regularizations is required.

In this paper, we propose a numerical method for the construction of GDF and prediction error’s estimators based on approximate message passing (AMP). The derived form of GDF does not depend on the regularization, although the property of the AMP fixed point is regularization-dependent. The estimator is asymptotically unbiased when the predictor matrix and response variables are independently and identically distributed (i.i.d.) Gaussian, which is indicated by showing the correspondence between AMP and the replica method [10]. In this case, we can confirm that AIC is the unbiased estimator for penalty, which is consistent with an earlier study [7]. We apply our estimator to real data and show that it adequately emulates model selection according to the prediction error.

The remainder of this paper is organized as follows. In Sec. 2, we summarize the background materials that support the following discussion. The problem setting considered in this paper is explained in Sec. 3, and the AMP algorithm for the penalized linear regression problem is introduced in Sec. 4. The estimator of GDF using the AMP fixed point is derived in Sec. 5, and its asymptotic property for Gaussian random data and predictor are studied in Sec. 6 using the replica method. In Sec. 7, the proposed estimator is applied to real data, and the improvement it offers is quantified in Sec. 8. Finally, Sec. 9 presents the conclusions to this study and a discussion of the results.

2 Background

2.1 Prediction error and generalized degrees of freedom

We denote the estimated regression coefficient obtained by solving (1) and its fit to the response variable as and , respectively. The prediction performance of the model under the given response variable is measured by the prediction error


where is test data whose statistical property is equivalent to that of . The value of that minimizes the prediction error gives an appropriate description of the data . We aim to approximate the prediction error by constructing its estimator. The simplest estimator is the training error, which is defined by


but this underestimates the prediction error because of the correlation between and . We consider the case in which the mean and covariance matrix of the response vector are given by


where denotes matrix transpose, and is the

-dimensional identity matrix. In this case, the prediction error and the training error satisfy the relationship


where is the generalized degrees of freedom (GDF) defined by [11]


and . The expression (5) is known as the criterion for model selection [5, 6].

2.2 Stein’s lemma

The covariance (6) is not observable in the sense that the expectation with respect to the true distribution is required. Hence, estimators of df, denoted by , are used instead of for approximating the prediction error. An idea for the construction of the estimator appears in Stein’s unbiased risk estimate (SURE), known as Stein’s lemma [12]. Following Stein’s lemma, the component-wise covariance between the response variable and its fit is given by


Equation (7) means that an unbiased estimator of GDF is given by


Note that (7

) is exact when the response variables are i.i.d. Gaussian random variables.Equation (

8) is an observable quantity, and hence one of the estimators of the prediction error can be defined by


Equation (9) is an unbiased estimator of the prediction error when the response variables are i.i.d. Gaussian, but the unbiasedness is not generally guaranteed for other predictor matrixes.

For the maximum likelihood method, the unbiased estimator of GDF is given by the number of parameters used in the model, which is known as Akaike’s information criterion (AIC) [4]


As defined here, AIC has been multiplied by compared with its conventional definition, but this does not affect the following discussion. For penalized regression problems, when the fit of the response variable is represented as using a matrix , it can be shown that GDF corresponds to [13]

, which is widely used in ridge regression. Furthermore, many studies have derived SURE in problems where (

8) is easily computed [14, 15, 16].

2.3 GDF for sparse penalties

For sparse penalties, the derivation of GDF and the construction of its estimator is not straightforward, because the penalty term is not differentiable at the origin. For penalty, GDF is equal to the expectation of the density of non-zero components [7];


and hence is an unbiased estimator of GDF. It means that AIC corresponds to an unbiased estimator of the prediction error. The extension of this result for general problem settings has been extensively discussed [17, 18]. However, such a simple relationship does not generally hold, and efficient computation methods for (8) are required. Computation techniques for df have been developed using cross-validation and the parametric bootstrap method, but these techniques are computationally expensive [6, 19].

In this paper, we propose a method to construct the estimator of the prediction error based on Stein’s lemma using AMP. In principle, the method can be applied for any , and the extra computational cost of obtaining the fixed points is less than in other methods.

2.4 Sparse penalties considered in this paper

Figure 1: Behaviour of the estimator for (a) , (b) SCAD, and (c) MCP.

The penalty functions considered in this paper are , the smoothly clipped absolute deviation (SCAD), and the minimax concave penalty (MCP). Under these regularizations, we explain the properties of the estimator using the one-dimensional problem



is the training sample. The ordinary least square (OLS) estimator, which minimizes the squared error between

and , is for any . The penalty term changes the estimator from the OLS estimator. The penalties considered here give analytical solutions of (12). As explained in Sec. 4, the estimates given by AMP reduce to the one-body problem (12) with effective and . We summarize the solution of (12) in the following subsections.

2.4.1 penalty

The penalty is given by


where is the regularization parameter determined by the model selection criterion. The solution of (12) under the penalty is given by


where and


The behaviour of the estimator at is shown in Figure 1(a). Across the whole region of , the estimator is shrunk from the ordinary least square (OLS) estimator, denoted by the dashed line.

2.4.2 SCAD penalty

The SCAD penalty is defined by [8]


where the regularization parameter is . The solution of (12) under SCAD regularization is given by




The behaviour of the estimator under SCAD regularization at is shown in Figure 1(b). The SCAD estimator behaves like the and OLS estimators when and , respectively. In the region , the estimator linearly transits between the and OLS estimators. We call this region the transient region of SCAD.

2.4.3 Mcp

The MCP is defined by [9]


where . The estimator of (12) under MCP is given by




Figure 1(c) shows the behaviour of the MCP estimator at . The MCP estimator behaves like the OLS estimator when , and is connected from zero to the OLS estimator in the region . We call this the transient region of MCP.

3 Problem setting

Problem (1) discussed in this paper is equivalent to searching the ground state, but here we extend the problem to finite temperatures by introducing the inverse temperature . We construct the posterior distribution for finite , and extract the ground state corresponding to the maximum a posteriori (MAP) estimator in the limit .

The likelihood we consider here is the Gaussian distribution given by


The distribution (25) depends on the parameter only through the form of ; hence, we hereafter denote (25) as , where . The prior distribution of is given by the penalty function and the inverse temperature as


and hence the posterior distribution of is given by


following Bayes’ formula, where is the normalization constant. The estimate for the solution of (1) is expressed as


where denotes the expectation according to (27) at . As , (27

) becomes the uniform distribution over the minimizers of (

1). Hence, when problem (1) has a unique minimizer, it corresponds to the estimate (28).

The exact computation of the expectation in (28) is intractable. Thus, we resort to the approximated message passing (AMP) algorithm to approximate (28), a technique that is widely used in signal processing and statistical physics [20, 21, 22, 23].

4 Approximate message passing

In this section, we briefly summarize AMP for penalized regression problems. The explanation here only covers the aspects required to reach the result shown in Sec. 5; a detailed derivation is given in [24]. In the derivation of AMP, we introduce the following assumptions for sufficiently large and .


All components of the predictor matrix are .


The correlation between predictors is negligible: the off-diagonal components of are .

The message passing algorithm defined for the probability distribution is expressed by messages that are propagated from factors to variables and variables to factors according to


Here, and represent the variable nodes and factor nodes connected to the -th factor node and -th variable node, respectively, and denotes that is not included. Using the messages, the marginal distribution is given by


Updating the messages in the form of the probability distribution is intractable, and we represent the distributions using the mean and rescaled variance, which is the variance multiplied by , assuming that they can be defined:


Further, we define the mean and rescaled variance with respect to the marginal distribution as


where the -th component of the estimate (28), denoted by , corresponds to at . Following the calculation under assumptions A1 and A2, the marginal distribution is derived as [24]




and is the normalization constant required to satisfy . Equation (36) suggests that the product of messages from factors connected to , corresponds to a Gaussian distribution with mean and variance . The functions and are given by


where we define


and the function contains the dependence on the likelihood as


Under assumption A1, the following relationship holds [24]:


which leads to


4.1 Specific form for penalized linear regression

Substituting the output function for linear regression given by


into Eqs. (40) and (41), we obtain the specific form of and as


Using these equations, the mean and variance parameters are given by


In summary, the AMP algorithm consists of (52)–(57).


where we define


and note that


At , the integral of (58) can be computed by the saddle point method as


which corresponds to the one-dimensional problem (12) with the replacements and . Under the penalties considered here, , SCAD and MCP penalty, these functions are computed analytically, and have the form


where and the subscript denotes dependence on the regularization: .

5 AMP-based estimator

At the AMP fixed point, Stein’s lemma implies that the covariance between the response variable and its fit can be expressed as


The dominant dependence on of the estimate is induced through the mean parameter . Hence, using (60),


Substituting the definition of , (42), into (55), we obtain


where the second term can be ignored under the assumption A1:

Hence, we have


where is approximated by and (65) is used to derive (68) from (67). Substituting (68) into (65), the second term of (68) can be ignored, and we obtain


This means that


is the AMP expression of the unbiased estimator of GDF, and we define the corresponding estimator of the prediction error as


These expressions do not depend on the form of the function . In deriving (70), note that assumptions A1 and A2 are required, and hence its unbiasedness is not generally guaranteed.

6 Asymptotic behaviour of the AMP-based estimator for Gaussian predictor matrix

When the components of the predictor matrix and data are i.i.d. according to a Gaussian distribution, the asymptotic property of the AMP fixed point can be described by the replica method, where and while remains . Here, we concentrate on the case and , but the discussion can be extended to the case in which the generative model of contains a “true” sparse expression such as , where is a noise vector.

6.1 Replica analysis

The basis of replica analysis is the derivation of the free energy density, which is defined by


The free energy density and physical quantities derived from it are averaged over the predictor matrix and response variable, a procedure that is less common in the context of statistics. This averaging is implemented with a purpose of describing typical behavior of the problem under consideration.

After the calculation under the replica symmetry assumption, the free energy density is derived as [24]


where denotes extremization with respect to the variables . The function is given by


where is the random field that effectively represents the randomness of the problem introduced by and , and . Equation (75) is equivalent to the one-dimensional problem (12) with the correspondence and . The solution of (75), denoted by , is given by


and is statistically equivalent to the solution of the original problem (1). Therefore, the expected density of nonzero components in the estimate is given by


where is the indicator function that takes 1 when is satisfied, otherwise zero. At the extremum of (73), the variables are given by


Note that the functional form of the parameters and does not depend on the regularization, but the values of and are regularization-dependent.

6.2 GDF for Gaussian random predictors

As discussed in [10], the expression of GDF for Gaussian random predictors using the replica method is given by


for any regularization. The specific form of this expression depends on the regularization. We summarize the form of GDF for three regularizations.

6.3 penalty

For penalty, the saddle point equation of is given by


where and


Substituting (83) and (81) into (82), GDF for regularization is given by


Equation (85) is the number of parameters divided by , and coincides with the well known results of GDF for LASSO [7], where AIC is the unbiased estimator of the prediction error.

6.4 Scad

The saddle point equation of for SCAD regularization is given by




Here, corresponds to the fraction of components of the regression coefficients that are in the transient region of SCAD (Sec. 2.4.2). Equation (86) leads to the following expression for GDF:


The coefficient of in the second term corresponds to the rescaled variance of the estimate in the transient region. From (89), the difference between the prediction error and AIC is given by


This result means that AIC underestimates the prediction error when SCAD regularization is used, and the transient region contributes to the increase of GDF.

6.5 Mcp

The saddle point equation of for MCP is given by




and is the fraction of the regression coefficient in the transient region of MCP (Sec. 2.4.3). Equation (91) leads to


where the coefficient of corresponds to the rescaled variance of the estimate in the transient region. Hence, the difference between the prediction error and AIC is given by