High-dimensional regression models have been studied extensively in both machine learning and statistical literature. Statistical inference in high-dimensions, where the sample sizeis smaller than the ambient dimension
, is impossible without assumptions. As the concept of parsimony is important in many scientific domains, most of the research in the area of high-dimensional statistical inference is done under the assumption that the underlying model is sparse, in the sense that the number of relevant parameters is much smaller than, or that it can be well approximated by a sparse model.
Penalization of the empirical loss by the norm has become a popular tool for obtaining sparse models and huge amount of literature exists on theoretical properties of estimation procedures (see,e.g., zhao06model; wainwright06sharp; zhang09some; zhang08sparsity, and references therein) and on efficient algorithms that numerically find estimates (see bach11optimization, for an extensive literature review). Due to limitations of the norm penalization, high-dimensional inference methods based on the class of concave penalties have been proposed that have better theoretical and numerical properties (see,e.g., fan01variable; fan09nonconcave; lv09unified; zhang11general).
In all of the above cited work, the main focus is on model selection and mean parameter estimation. Only few papers deal with estimation of the variance in high-dimensions (sun11scaled; fan12variance) although it is a fundamental problem in statistics. Variance appears in the confidence bounds on estimated regression coefficients and is important for variable selection as it appears in Akaike’s information criterion (AIC) and the Bayesian information criterion (BIC). Furthermore, it provides confidence on the predictive performance of a forecaster.
In applied regression it is often the case that the error variance is non-constant. Although the assumption of a constant variance can sometimes be achieved by transforming the dependent variable, e.g., by using a Box-Cox transformation, in many cases transformation does not produce a constant error variance (carroll88transformation). Another approach is to ignore the heterogeneous variance and use standard estimation techniques, but such estimators are less efficient. Aside from its use in reweighting schemes, estimating variance is important because the resulting prediction intervals become more accurate and it is often important to explore which input variables drive the variance. In this paper, we will model the variance directly as a parametric function of the explanatory variables.
Heteroscedastic regression models are used in a variety of fields ranging from biostatistics to econometrics, finance and quality control in manufacturing. In this paper, we study penalized estimation in high-dimensional heteroscedastic linear regression models, where the mean and the log variance are modeled as a linear combination of explanatory variables. Modeling the log variance as a linear combination of the explanatory variables is a common choice as it guarantees positivity and is also capable of capturing variance that may vary over several orders of magnitudes carroll88transformation; harvey76. Main contributions of this paper are as follows. First, we propose HIPPO (Heteroscedastic Iterative Penalized Pseudolikelihood Optimizer) for estimation of both the mean and variance parameters. Second, we establish the oracle property (in the sense of fan09nonconcave) for the estimated mean and variance parameters. Finally, we demonstrate numerical properties of the proposed procedure on a simulation study, where it is shown that HIPPO outperforms other methods, and analyze a real data set.
1.1 Problem Setup and Notation
Consider the usual heteroscedastic linear model,
where is an matrix of predictors with rows , is an
-vector of responses, the vectorsand are -vectors of mean and variance parameters, respectively, and is an -vector of random noise with mean and variance . We assume that the noise is independent of the predictors . The function has a known parametric form and, for simplicity of presentation, we assume that it takes a particular form .
Throughout the paper we use to denote the set . For any index set , we denote to be the subvector containing the components of the vector indexed by the set , and denotes the submatrix containing the columns of indexed by . For a vector , we denote the support set, , , the -norm defined as with the usual extensions for , that is, and . For notational simplicity, we denote the norm. For a matrix we denote the operator norm, the Frobenius norm, and and
denote the smallest and largest eigenvalue respectively.
Under the model in (1), we are interested in estimating both and . In high-dimensions, when , it is common to assume that the support is small, that is, and . Similarly, we assume that the support is small.
1.2 Related Work
Consider the model (1) with constant variance, i.e., . Most of the existing high-dimensional literature is focused on estimation of the mean parameter
in this homoscedastic regression model. Under a variety of assumptions and regularity conditions, any penalized estimation procedure mentioned in introduction can, in theory, select the correct sparse model with probability tending to. Literature on variance estimation is not as developed. fan12variance proposed a two step procedure for estimation of the unknown variance , while sun11scaled proposed an estimation procedure that jointly estimates the model and the variance.
Problem of estimation in the heteroscedastic linear regression models have been studied extensively in the classical setting with fixed, however, the problem of estimation under the model (1) when has not been adequately studied. jia10lasso assume that and show that Lasso is sign consistent for the mean parameter under certain conditions. Their study shows limitations of lasso, for which many highly scalable solvers exist. However, no new methodology is developed, as the authors acknowledge that the log-likelihood function is highly non-convex. dette11adaptive study the adaptive lasso under the model in (1). Under certain regularity conditions, they show that the adaptive lasso is consistent, with suboptimal asymptotic variance. However, the weighted adaptive lasso is both consistent and achieves optimal asymptotic variance, under the assumption that the variance function is consistently estimated. However, they do not discuss how to obtain an estimator of the variance function in a principled way and resort to an ad-hoc fitting of the residuals. daye11high develop HHR procedure that optimizes the penalized log-likelihood under (1) with the -norm penalty on both the mean and variance parameters. As the objective is not convex, HHR estimates with fixed and then estimates with fixed, until convergence. Since the objective is biconvex, HHR converges to a stationary point. However, no theory is provided for the final estimates.
In this paper, we propose HIPPO (Heteroscedastic Iterative Penalized Pseudolikelihood Optimizer) for estimating and under model (1).
In the first step, HIPPO finds the penalized pseudolikelihood maximizer of by solving the following objective
where is the penalty function and the tuning parameter controls the sparsity of the solution .
In the second step, HIPPO forms the penalized pseudolikelihood estimate for by solving
where is the vector of residuals.
Finally, HIPPO computes the reweighted estimator of the mean by solving
where are the weights.
In classical literature, estimation under heteroscedastic models is achieved by employing a pseudolikelihood objective. The pseudolikelihood maximization principle prescribes the scientist to maximize a surrogate likelihood, i.e. one that is believed to be similar to the likelihood with the true unknown fixed variances (or means alternatively). In classical theory, central limit theorems are derived for many pseudo-maximum likelihood (PML) estimators using generalized estimating equationszieglerGEE. HIPPO fits neatly into the pseudolikelihood framework because the first step is a regularized PML where only the mean structure needs to be correctly specified. The second step and third steps may be similarly cast as PML estimators. Indeed, all our theoretical results are due to the fact that in each step we are optimizing a pseudolikelihood that is similar to the true unknown likelihoods (with alternating free parameters). Moreover, it is known that if the surrogate variances in the mean PML are more similar to the true variances then the resulting estimates will be more asymptotically efficient. With this in mind, we recommend a third reweighting procedure with the variance estimates from the second step.
fan01variable advocate usage of penalty functions that result in estimates satisfying three properties: unbiasedness, sparsity and continuity. A reasonable estimator should correctly identify the support of the true parameter with probability converging to one. Furthermore, on this support, the estimated coefficients should have the same asymptotic distribution as if an estimator that knew the true support was used. Such an estimator satisfies the oracle property. A number of concave penalties result in estimates that satisfy this property: the SCAD penalty (fan01variable), the MCP penalty (zhang2010nearly) and a class of folded concave penalties lv09unified. For concreteness, we choose to use the SCAD penalty, which is defined by its derivative
where often is used. Note that estimates produced by the -norm penalty are biased, and hence this penalty does not achieve oracle property.
HIPPO is related to the iterative HHR algorithm of daye11high. In particular, the first two iterations of HHR are equivalent to HIPPO with the SCAD penalty replaced with the norm penalty. In practice, one can continue iterating between solving (3) and (4), however, establishing theoretical properties for those iterates is a non-trivial task. From our numerical studies, we observe that HIPPO performs well when stopped after the first two iterations.
2.1 Tuning Parameter Selection
As described in the previous section, HIPPO requires selection of the tuning parameters and , which balance the complexity of the estimated model and the fit to data. A common approach is to form a grid of candidate values for the tuning parameters and and chose those that minimize the AIC or BIC criterion
where, up to constants,
is the negative log-likelihood and
is the estimated degrees of freedom. In Section4, we compare performance of the AIC and the BIC for HIPPO in a simulation study.
2.2 Optimization Procedure
In this section, we describe numerical procedures used to solve optimization problems in (2), (3) and (4). Our procedures are based on the local linear approximation for the SCAD penalty developed in zou08onestep, which gives:
and iteratively solve each objective until convergence of . We set the initial estimates and to be the solutions of the -norm penalized problems. The convergence of these iterative approximations follows from the convergence of the MM (minorize-maximize) algorithms (zou08onestep).
3 Theoretical Properties of HIPPO
In this section, we present theoretical properties of HIPPO. In particular, we show that HIPPO achieves the oracle property for estimating the mean and variance under the model (1). All the proofs are deferred to Appendix.
We will analyze HIPPO under the following assumptions, which are standard in the literature on high-dimensional statistical learning (see, e.g. fan12variance).
Assumption 1. The matrix has independent rows that satisfy where
are subgaussian random variables with, and parameter (see Appendix for more details on subgaussian random variables). Furthermore, there exist two constants such that
Assumption 2. The errors are subgaussian with zero mean and parameter .
Assumption 3. There are two constants and such that and .
Assumption 4. and for some and and constants .
The following assumption will be needed for showing the consistency of the weighted estimator in (4).
Assumption 5. Define
There exist constants such that
Furthermore, we have that
With these assumption, we state our first result, regarding the estimator in (2).
Suppose that the assumptions (1)-(4) are satisfied. Furthermore, assume that , and for some . Then there is a strict local minimizer of (2) that satisfies
for some positive constants , and and sufficiently large .
In addition, if we suppose that assumption (5) is satisfied, then for any fixed with the following weak convergence holds
where . The first result stated in Theorem 3 established that achieves the weak oracle property in the sense of lv09unified. The extra term is subpolynomial in and appears in the bound (9) due to the heteroscedastic nature of the errors. The second result establishes the strong oracle property of the estimator in the sense of fan09nonconcave, that is, we establish the asymptotic normality on the true support . The asymptotic normality shows that
has the same asymptotic variance as the ordinary least squares (OLS) estimator on the true support. However, in the case of a heteroscedastic model the OLS estimator is dominated by the generalized least squares estimator. Later in this section, we will demonstrate thathas better asymptotic variance. Note that correctly selects the mean model and estimates the parameters at the correct rate. From the upper and lower bounds on , we see how the rate at which can grow and the minimum coefficient size are related. Larger the ambient dimension gets, larger the size of , which lower bounds the size of the minimum coefficient.
Our next result establishes correct model selection for the variance parameter . Suppose that assumptions (1)-(5) are satisfied. Suppose further that and . Then there is a strict local minimizer with the strong oracle property,
Morover, for any fixed with the following weak convergence holds
With the convergence result of we can prove consistency and asymptotic normality of the weighted estimator in (4).
Suppose that the assumptions (1)-(5) are satisfied and that there exists an estimator satisfying , for a sequence and . Furthermore, assume that , and for some . Then there is a strict local minimizer of (4) that satisfies
for some positive constants , and and sufficiently large .
Furthermore, for any fixed with the following weak convergence holds
Theorem 3 establishes convergence of the weighted estimator in (4) and the model selection consistency. The rate of convergence depends on the rate of convergence of the variance estimator, . From Theorem 3, we show the parametric rate of convergence for . The second result of Theorem 3 states that the weighted estimator is asymptotically normal, with the same asymptotic variance as the generalized least squares estimator which knows the true model and variance function .
4 Monte-Carlo Simulations
In this section, we conduct two small scale simulation studies to demonstrate finite sample performance of HIPPO . We compare it to the HHR procedure (daye11high).
Convergence of the parameters is measured in the norm, and . We measure the identification of the support of and
using precision and recall. Letdenote the estimated set of non-zero coefficients of , then the precision is calculated as and the recall as . Similarly, we can define precision and recall for the variance coefficients. We report results averaged over 100 independent runs.
4.1 Example 1
Assume that the data is generated iid from the following model where
follows a standard normal distribution and the logarithm of the variance is given by
The covariates associated with the variance are jointly normal with equal correlation , and marginally . The remaining covariates, are iid random variables following the standard Normal distribution and are independent from . We set and use and . Estimation procedures know that and we only estimate the variance parameter . This example is provided to illustrate performance of the penalized pseudolikelihood estimators in an idealized situation. When the mean parameter needs to be estimated as well, we expect the performance of the procedures only to get worse. Since the mean is known, both HHR and HIPPO only solve the optimization procedure in (3), HHR with the -norm penalty and HIPPO with the SCAD penalty, without iterating between (4) and (3).
Table 1 summarizes the results. Under this toy model, we observe that HIPPO performs better than HHR when the correlation between the relevant predictors is . However, we do not observe the difference between the two procedures when . The difference between the AIC and BIC is already visible in this example when . The AIC tends to pick more complex models, while the BIC is more conservative and selects a model with fewer variables.
4.2 Example 2
The following non-trivial model is borrowed from daye11high
. The response variablesatisfies
with , , ,
and the remainder of the coefficients are . The covariates are jointly Normal with and the error follows the standard Normal distribution. This is a more realistic model than the one described in the previous example. We set and the number of samples and .
Table 2 summarizes results of the simulation. We observe that HIPPO consistently outperforms HHR in all scenarios. Again, a general observation is that the AIC selects more complex models although the difference is less pronounced when the sample size . Furthermore, we note that the estimation error significantly reduces after the first iteration, which demonstrates final sample benefits of estimating the variance. Recall that Theorem 3 proves that the estimate consistently estimates the true parameter . However, it is important to estimate the variance parameter well, both in theory (see Theorem 3) and practice.
5 Real Data Application
Forecasting the gross domestic product (GDP) of a country based on macroeconomic indicators is of significant interest to the economic community. We obtain both the country GDP figures (specifically we use the GDP per capita using current prices in units of a ‘national currency’) and macroeconomic variables from the International Monetary Fund’s World Economic Outlook (WEO) database. The WEO database contains records for macroeconomic variables from 1980 to 2016 (with forecasts).
To form our response variable, , we form log-returns of the GDP for each country () for each time point () after records began and before the forecasting commenced (each country had a different year at which forecasting began). After removing missing values, we obtained 31 variables that can be grouped into a few broad categories: balance of payments, government finance and debt, inflation, and demographics. We apply various transformations, including lagging and logarithms forming the vectors . We fit the heteroscedastic AR(1) model with HIPPO.
In order to initially assess the heteroscedasticity of the data, we form the LASSO estimator with the LARS package in R selecting with BIC. It is common practice when diagnosing heteroscedasticity to plot the studentized residuals against the fitted values. We bin the bulk of the samples into three groups by fitted values, and observe the box-plot of each bin by residuals (Figure 2). It is apparent that there is a difference of variances between these bins, which is corroborated by performing a F-test of equal variances across the second and third bins (p-value of ). We further observe differences of variance between country GDP log returns. We analyzed the distribution of responses separated by countries: Canada, Finland, Greece and Italy. The p-value from the F-test for equality of variances between the countries Canada and Greece is , which is below even the pairwise Bonferroni correction of at significance level. This demonstrates heteroscedasticity in the WEO dataset, and we are justified in fitting non-constant variance.
We compare the results from HIPPO and HHR when applied to the WEO data set. The tuning parameters were selected with BIC over a grid for and . The metrics used to compare the algorithms are mean square error (MSE) defined by , the partial prediction score defined as the average of the negative log likelihoods, and the number of selected mean parameters and variance parameters. We perform
-fold cross validation to obtain unbiased estimates of these metrics. In Table3 we observe that HIPPO outperforms HHR in terms of MSE and partial prediction score.
We have addressed the problem of statistical inference in high-dimensional linear regression models with heteroscedastic errors. Heteroscedastic errors arise in many applications and industrial settings, including biostatistics, finance and quality control in manufacturing. We have proposed HIPPO for model selection and estimation of both the mean and variance parameters under a heteroscedastic model. HIPPO can be deployed naturally into an existing data analysis work-flow. Specifically, as a first step, a statistician performs penalized estimation of the mean parameters and then, as a second step, tests for heteroscedasticity by running the second step of HIPPO. If heteroscedasticity is discovered, HIPPO can then be used to solve penalized generalized least squares objective. Furthermore, HIPPO is well motivated from the penalized pseudolikelihood maximization perspective and achieves the oracle property in high-dimensional problems.
Throughout the paper, we focus on a specific parametric form of the variance function for simplicity of presentation. Our method can be extended to any parametric form, however, the assumptions will become more cumbersome and the particular numerical procedure would change. It is of interest to develop general unified framework for estimation of arbitrary parametric form of the variance function. Another open research direction includes non-parametric estimation of the variance function in high-dimensions, which could be achieved with sparse additive models (see ravikumar2009sparse).
MK is partially supported through the grants NIH R01GM087694 and AFOSR FA9550010247. JS is partially supported by AFOSR under grant FA95501010382.
In the appendix, we collect some well known results and provide proofs for the results in the main text.
For readers convenience, we summarize the notation again. We use to denote the set . For any index set , we denote to be the subvector containing the components of the vector indexed by the set , and denotes the submatrix containing the columns of indexed by . For a vector , we denote the support set, , , the -norm defined as with the usual extensions for , that is, and . For notational simplicity, we denote the norm. The unit sphere of is denoted by . The canonical bases of we denote by . For a matrix we denote the operator norm and the Frobenius norm. For a symmetric matrix , we use to denote the smallest and largest eigenvalue respectively.
7.1 Subgaussian random variables
In this section, we define subgaussian random variables and state a few well known properties.
We denote the norm of a random variable , i.e., . Define the Orlitz function , . Using the Orlitz function, we can define the Orlitz space of real valued random variables, , equipped with the norm
We will focus on the particular choice of . Define the set of real-valued symmetric random variables satisfying
For we have a good control of the tail probability
which can be obtained using the Markov inequality
since is symmetric and .
The space is the set of subgaussian random variables. A real-valued random variable is called subgaussian with parameter , , if
It follows from this bound on the moment generating function that the following bound on the tail probability holds
We also have that is subgaussian with parameter by direct calculation.
The following few facts are useful. Let and , , be independent variables, then for any , is subgaussian with parameter .
For any , we have
The claim follows from (16). ∎
Let and , , be independent variables, then for any ,
Using Markov inequality we have,
which concludes the proof. ∎
The notion of subgaussian random variable can be easily extended to vector random variables. Let be random variable satisfying , . The random variable is subgaussian with parameter if it satisfies
Let where positive definite matrix and the symmetric matrix square root. The following result is standard in multivariate statistics [anderson2003]. Let and , . Then
and is uncorrelated with . The following lemma shows that is subgaussian. The random variable defined in (20) is subgaussian with parameter , where .
From the definition of we have that and . With this, we have
This concludes the proof. ∎
Next, we present a few results on spectral norms of random matrices obtained as sums of random subgaussian vectors outer products. [hsu2011dimension] Let be i.i.d random subgaussian vectors with parameter , then for all ,
The above result can easily be extended to variables with arbitrary covariance matrix. Let be independent random vectors satisfying with being independent subgaussian vectors with parameter and is the symmetric matrix square root of . If and , then for all
7.2 Proofs and Technical Results
For convenience, we restate technical conditions used in the paper.
Assumption 1. The matrix has independent rows that satisfy where are subgaussian random variables with , and . Furthermore, there exist two constants such that
Assumption 2. The errors are with .
Assumption 3. There are two constants and such that and .
Assumption 4. and for some and and constants .
Assumption 5. Define
There exist constants such that
7.3 Proof of Theorem 3
We will split the proof in two parts. In the first part, we show that the vector , where , is a strict local minimizer of (2) and . In the second part, we use results for pseudo-maximum likelihood estimates to establish asymptotic normality of .
From Theorem 1 in fan09nonconcave, we need to show that satisfies
in order to show that is a strict local minimizer.
Define the events