Recently high-dimensional data sets have become increasingly available to researchers in many fields. In economics big data can be found in the analysis of consumer behavior based on scanner data from purchases. Furthermore, many macroeconomic variables are sampled rather infrequently leaving one with many variables compared to observations in models with many explanatory variables. Financial data is also of a high-dimensional nature with many variables and instruments being observed in small intervals due to high-frequency trading. Alternatively, models with many variables emerge when trying to control for non-linearities in a wage regression by including basis functions of the space in which the non-linearity is supposed to be found. Clearly, including more basis functions can result in better approximations of the non-linearity. However, this also results in a model with many variables, i.e. a high-dimensional model. For these reasons handling high-dimensional data sets has received a lot of attention in the econometrics and statistics literature in the recent years. In a seminal paperTibshirani (1996) introduced the Lasso estimator which carries out variable selection and parameter estimation simultaneously. The theoretical properties of this estimator have been studied extensively since then in various papers and extensions such as the adaptive Lasso by Zou (2006), the bridge estimator by Huang et al. (2008), the sure independence screening by Fan and Lv (2008) or the square root Lasso by Belloni et al. (2011) have been proposed. For recent reviews see, e.g., Fan et al. (2011), Bühlmann and van de Geer (2011) or Belloni and Chernozhukov (2011).
In the econometrics literature Lasso-type estimators have also proven useful. For example Belloni et al. (2012)
have established results in the context of instrumental variable estimation without imposing the hitherto much used assumption of sub-gaussianity by means of moderate deviation theorems for self-normalized random variables. Furthermore, they allow for heteroscedastic error terms which is pathbreaking and greatly widens the scope of applicability of their results.
Applications to panel data may be found in e.g. Kock (2013). The estimators have been studied in the context of GMM, factor models, and smooth penalties by, among others, Caner and Zhang (2013), Caner and Han (2013), Cheng and Liao (2013) and Fan and Li (2001). Within linear time series models oracle inequalities have been established by Kock and Callot (2013) and Negahban et al. (2012) have proposed a unified framework which is valid for regression as well as matrix estimation problems.
Most research has considered the linear regression model or other parametric models. In this paper we shall focus on a very general setup. In particular, we will focus on penalized empirical loss minimization of convex loss functions with potentially non-linear target functions.van de Geer (2008) studied a similar setup for the Lasso which is a special case of our results for the elastic net. Furthermore, even though our main focus is on non-asymptotic bounds, we also present asymptotic upper bounds on the excess risk and estimation error (the latter in the case where the target is linear). We also show how our results can be used to give new non-asymptotic upper bounds on penalized series estimators with many series terms.
In particular, we
provide a finite sample oracle inequality for empirical risk minimization penalized by the elastic net penalty. This inequality is valid for convex loss functions and non-linear targets and contains an oracle inequality for the Lasso as a special case.
For the case where the target function is linear this oracle inequality can be used to establish finite sample upper bounds on the estimation error of the estimated parameter vector.
The finite sample inequality is used to establish asymptotic results. In particular, the excess risk of our estimator is of the same order as that of an oracle which trades of the approximation and estimation errors. When the target is linear we give sufficient conditions for consistency of the estimated parameter vector.
In the case where the target is linear we briefly explain how a thresholded version of our estimator can unveil the correct sparsity pattern.
We provide two examples of specific loss functions covered by our general framework. We verify in detail that the abstract conditions are satisfied in common settings. Then we show how nonparametric series estimation is contained as a special case of our theory and provide a finite sample upper bound on the mean square error of an elastic net series estimator. We explain why this series estimator may be more precise than classical series estimators.
We also note that when the loss function is quadratic the sample does not have to be identically distributed and our results are therefore also valid in the presence of heteroscedasticity.
We stress here that our main objective is to establish upper bounds on the performance of the elastic net. It is not our intention to promote either the Lasso or the elastic net, merely to analyze the properties of the latter. However, we shall make some brief comments on merits of the two procedures when compared to each other. A clear ranking like the one in Hebiri and van de Geer (2011) is not available at this point. However, these authors only focus on quadratic loss for which a certain data augmentation trick facilitates the analysis.
We believe that the performance guarantees on the elastic net provided by this paper are useful for the applied researcher who increasingly faces high-dimensional data sets. The usefulness is enhanced by the fact that our results are valid for a wide range of loss functions and that heteroscedasticity is allowed for when the loss function is quadratic.
The paper is organized as follows. Section 2 puts forward the setup and notation. Section 3 introduces the main result, the oracle inequality for empirical loss minimization of convex loss functions penalized by the elastic net. Section 4 briefly discusses consistent variable selection by a thresholded version of the elastic net. Tuning parameter selection is handled in Section 5. Section 6 shows that the quadratic as well as the logistic loss are covered by our framework and provides an oracle inequality for penalized series estimators.
2 Setup and notation
We begin by setting the stage for general convex loss minimization. The setup is similar to the Lasso one in Section 6.3 in Bühlmann and van de Geer (2011). Let be a standard probability space. Consider a sample with and . Here, for the sake of exposition, can be thought of as a subset of for but as we shall see below it can be much more general. Define and let be a normed real vector space with norm . For each let be a loss function. More precisely, will be a function and the corresponding norm will most often be the -norm (in fact will be the -norm in almost all our econometric examples111This choice of norm on is suitable when the sample is supposed to be i.i.d.). Furthermore, when is a vector in , denotes the -norm while denotes the -norm. Order symbols such as and are used with their usual meanings. Also, in accordance with the usual Landau notation, means that there exists a constant such that for sufficiently large. denotes the intersection of and and contains all functions that are exactly of order . Finally, for any abstract set , denotes its cardinality. Throughout the paper we shall assume:
Assumption 0: is an independent sample and the mapping is convex for all .
The following examples provide illustrations of when the conditions in Assumption 0 are met:
where is some real error term and . Then, the standard case of quadratic loss is covered by the above setting upon choosing which is clearly convex in . By letting only consist of linear functions for some the case of linear least squares is covered. Non-linear least squares is covered by choosing for some parameter vector . As we shall see in Section 6.1.1 this setup can also be used to obtain some new upper bounds on nonparametric series estimation.
where is independent of and assumed to have a logistic distribution while . Assume that if and otherwise. Since has cdf one gets
Note that for and for some parameter vector this is the usual expression for
in the logit model. The above setting is more general, however, since it allowsto be non-linear.
The log-likelihood function for a given is then given by (for )
Hence, a sensible loss function is the negative log-likelihood
which is convex in .
The above two examples are both instances of the loss function being the negative of the log-likelihood. Hence, in a general setting with the negative of the log-likelihood being a convex function in our results also apply. Again, a special case is .
Returning to the general setup, denote by and the empirical and population means of the loss function for a fixed . We shall also denote these two quantities the empirical and population risk, respectively. Note also, that in the case of identically distributed variables the population mean reduces to the plain expectation . We define our target as the minimizer of the theoretical risk
where it is tacitly assumed that the minimizer exists and is unique for the -norm on . Then, for any , we define the excess population risk over the target as
Note that, by construction, for all
. Since the joint distribution ofis assumed to be unknown we shall consider empirical risk minimization instead of minimizing the population excess risk. Put differently, is minimized. Furthermore, we will consider a linear subspace of where may be thought of as basis functions of . Of course, in the case where is a subset of , one could also think of as being the ’th coordinate projection. This is the choice we make whenever is assumed to be linear. In general, denotes a vector of (transformed) covariates. is a convex subset of – in many cases we can even have . In Section 6 we shall see an example of but also an example of being a subset of . In case the target function is linear we will denote its parameter vector by .
As we shall see, it is possible to prove upper bounds on the excess risk of a penalized version of the empirical risk minimizer even when is non-linear while we only minimize over the linear sub space . This is non-trivial since the target belongs to a large set () while we only minimize over a smaller set (). The following section gives an exact definition and discussion of our estimator.
2.1 The elastic net
where and are positive constants. Hence, we are minimizing the empirical risk plus an elastic net penalty. This form of penalty was originally introduced by Zou and Hastie (2005) in the case of a linear regression model. The penalty is a compromise between the -penalty of the plain Lasso and the squared
-loss in ridge regression. Ridge regression does not perform variable selection at all – all estimated coefficients are non-zero. On the other hand, if two variables are highly correlated, the Lasso has a tendency to include only one of these. The elastic net strikes a balance between these two extremes and hence performs particularly well in the presence of highly correlated variables. This benefit has been formalized byHebiri and van de Geer (2011)
in the case of quadratic loss. In particular they have shown that the elastic net behaves better with respect to certain restricted eigenvalue conditions than the plain Lasso.
We next turn to the assumptions needed to prove oracle inequalities for the elastic net. First, define for some where 222Using the stronger topology on to define instead of turns out to be useful when verifying that the margin condition is satisfied with a quadratic margin in Section 6.. The margin condition requires that in the excess loss is bounded from below by a convex function of .
Definition.We say that the margin condition holds with strictly convex margin function , if for all we have
In all examples we shall consider it can be shown that the margin condition holds for for some such that for all , . More generally, we present a sufficient condition for to be quadratic in Section 6. The convex conjugate of will also play a role in the development of the oracle inequalities below. In particular, the following definition is taken from page 121 in Bühlmann and van de Geer (2011) and many more properties of convex conjugates can be found in Rockafellar (1997).
Definition. Let be a strictly convex function on with . The convex conjugate of is defined as
Lemma 3 in the appendix establishes some properties of . Note also that if , then . Furthermore, from the definition of the convex conjugate
which is also known as Fenchel’s inequality. Next, for any subset of and we define such that for . Letting denote the cardinality of we may define
Definition. The adaptive restricted eigenvalue condition is satisfied with if
where . As mentioned already, in many econometric examples one may choose to be the -norm. In this case, and if the covariates are also identically distributed (as will be assumed in our concrete examples in Section 6), where . Hence,
for all the adaptive restricted eigenvalue condition is satisfied in particular when the smallest eigenvalue of the population covariance matrix is positive. However, since the minimum in (2) is taken over a subset of only, we may have even when is singular. Note also that the adaptive restricted eigenvalue condition is used various guises in the literature and is similar to the eigenvalue conditions of Bickel et al. (2009) and Hebiri and van de Geer (2011).
Before defining what we understand by the oracle estimator, define as the subset of containing the indices of the non-zero coefficients. Let denote the cardinality of this set. Then, letting denote a collection of subsets of , we define
Definition The oracle estimator is defined as
Note that the definition of the oracle still leaves considerable freedom since is defined by the user – a property which we shall utilize later when considering linear targets (see e.g. remark 2 after Theorems 1 below). In the case where equals the power set of the oracle estimator may equivalently be written as
The definition of the oracle in (3) turns out to be convenient for technical reasons but it also has a useful interpretation as a tradeoff between approximation and estimation error: In the standard setting of a quadratic loss function with a linear target, i.e. , it is known that the squared -estimation error of Lasso type estimators when estimating parameters, of which are non-zero, is of the order . In the case of quadratic loss in the beginning of this section one has if the sample is identically distributed and and are independent. So and hence are quadratic in the definition of the margin condition. Choosing and of the order , which are both choices we shall adhere to in the sequel, one finds that is of the order . This is exactly the estimation error under quadratic loss and motivates coining the estimation error term. The term is referred to as the approximation error and (3) shows that the oracle trades of these two terms: a lower approximation error can be obtained by increasing while this also implies estimating more parameters resulting in a higher estimation error.
Finally, letting and we denote the oracle bound (value of the objective function minimized by the oracle) by
The inequality in Theorem 1 below will be valid on a random set which we introduce next. In Theorem 2 we shall show that this set actually has a high probability by means of a suitable concentration inequality for suprema of empirical processes. Define the empirical process
Next, we introduce a local supremum of the empirical process in incremental form
Then we define
where is a positive sequence and set
The set is the one we shall work on in Theorem 1 below. Note in particular, that on , can not be larger than which is the minimal value of the loss function of the oracle.
We are now ready to state our assumptions:
Assumption 1.Assume the margin condition with strictly convex function .
Assumption 2.Assume that and for all .
Assumption 3.Assume that adaptive restricted eigenvalue condition holds for , i.e. .
As discussed above, the margin condition, Assumption 1, regulates the behavior of the excess risk function. When is equipped with the we will see in Section 6 that the margin condition is actually often satisfied with being quadratic. Put differently, the margin condition is satisfied in many examples with a quadratic margin.
Assumption 2 is a technical condition which enables us to use the margin condition for . The first part requires that the oracle is a good approximation to in the sup-norm. Of course the validity of this statement depends on how well linear combinations of elements in can approximate . The validity also depends on the choice of in the definition of the oracle since the precise form of depends on this. For concrete choices of one can make proper choices of bases that guarantee the desired degree of approximation. Note in particular, that it follows from remark 2 in Section 3, that when is linear one can choose such that and so, a fortiori, . The second part of Assumption 2 states that if is close to . This is rather innocent by the triangle inequality. We will give more detailed sufficient conditions for Assumption 2 in Section 6 for concrete econometric examples. In particular, if consists of sufficiently smooth functions333To be concrete, we shall be considering a Hölder class of function to be defined precisely in Section 6., we shall exhibit concrete choices of bases and collections of sets such that approximates to the desired degree. Assumption 3 has been discussed above and is valid when is the -norm and has full rank.
3 An Oracle Inequality
In this section we extend Theorem 6.4 of Bühlmann and van de Geer (2011) from penalty (Lasso) to penalty (Elastic Net). This is not a trivial extension since the basic inequality used to establish the result has to be altered considerably. More precisely, the inequality that ties the estimator to the oracle has to be modified. The second difference is that we need to use the adaptive restricted eigenvalue condition which is different from the compatibility condition used in the -case. Compared to the linear target with quadratic loss in Hebiri and van de Geer (2011) estimated by the elastic net our proof cannot benefit from the augmented regressors idea since this idea relies crucially on the loss function being quadratic. In the case of general convex loss function the proof technique is entirely different and we use the margin condition, Fenchel’s inequality, and a careful definition of the oracle instead. We would like to stress that Theorem 1 below is purely deterministic in the sense that there are no probabilities attached to it. It is valid on the set to which we shall later attach a lower bound on its probability. It also provides a finite sample result – i.e. the result is valid for any sample size and not just asymptotically.
Suppose satisfies . Then on the set , under Assumptions 1-3, we have
where is the convex conjugate of the function in the margin condition.
Note that Theorem 1 provides an upper bound, , on the excess loss of in terms of the excess loss of the oracle as well as an extra term , the estimation error, which is hopefully not too big. We shall comment much more on this extra term in the sequel. Theorem 1 can also be used to give an upper bound on the -estimation error. Due to its importance the theorem warrants some detailed remarks.
1. The result of Theorem 1 reduces to the result for the Lasso in Theorem 6.4 of Bühlmann and van de Geer (2011) when we set except for the fact that our adaptive restricted eigenvalue condition is slightly stronger than their compatibility constraint. In that sense we generalize the oracle inequality of Bühlmann and van de Geer (2011). Their oracle inequality is, with being a compatibility constant (see p.157, Bühlmann and van de Geer (2011))
As mentioned, the only difference between their Theorem 6.4 and the result that can be deduced from our Theorem 1 is that . However, we also carried out the proofs of Theorem 1 imposing the compatibility constraint instead of the adaptive restricted eigenvalue condition and Theorem 1 then reduced to Theorem 6.4 in (Bühlmann and van de Geer (2011)) upon setting . This result can be obtained from the authors on demand. The reason that the adaptive restricted eigenvalue condition is used in the general elastic net is that it gives sharper bounds than the compatibility condition in the general elastic net case. More precisely, if the compatibility condition is used in our case we get an extra term in front of in the function .
2. Letting denote , i.e. the best linear approximation, and setting and choosing it follows that
Note how we have used our discretion in making a choice of which will turn out to be useful below. Since the second term in the definition of does not depend on in this case it follows that is the minimizer of which itself also is the minimizer of . Hence, in this case. It follows that under the conditions of Theorem 1
So, in particular, Theorem 1 can be used to provide upper bounds on the -distance of to due to our freedom in defining the oracle . If the target is also linear then clearly the best linear approximation equals the target implying that and hence and . Using , inequality (5) yields
Hence, in case the target is linear, (6) in particular yields an upper bound on the -estimation error which does not depend on the excess loss of the oracle. We shall make use of this fact in Section 4 on variable selection. It is also worth pointing out that in practice one does not know and hence can’t choose in the development of (6). However, (6) is valid even without this knowledge since an even sharper upper bound follows from Theorem 1 by choosing , i.e subsets of of cardinality at most . This bound only relies on sparseness of the target and since is a member of (6) follows a fortiori.
3. A key issue is to understand the effect of on the excess risk. Clearly, the right hand side of Theorem 1 is increasing in through (by Lemma 3 in the appendix is non-decreasing). However, the same is the case for the left hand side through its multiplication onto the squared -error. This illustrates a tradeoff in the size of .
4. Note also, that the very definition of the restricted set in the definition of the adaptive restricted eigenvalue condition also depends on through . In particular, increasing increases the size of the set we are minimizing over in the definition of . This implies that choosing too large may lead to , or at least undesirably small values of . Note that choosing results in . So in this case the size of the restricted set only depends on the cardinality of the oracle. Here it is worth noticing that the sparser the oracle ( small) the smaller will the restricted set be and the larger will be.
5. In many econometric examples the margin condition (Assumption 1) is satisfied with a quadratic margin resulting in for a positive constant (as argued just after the definition of the convex conjugate). Setting in Theorem 1 results in
Note that for , corresponding to a pure -loss, Theorem 1 reduces to
Recall that Theorem 1 and the remarks following it are valid on the set . As a consequence, we would like to have a large probability. This can be achieved by choosing large. However, note that Theorem 1 supposes such that the right hand side of Theorem 1 is also increasing in . Put differently, there is a tradeoff between the tightness of the bound in Theorem 1 and the probability with which the bound holds. In the following we shall give a lower bound on the probability of which trades off these two effects.
Assume that where is a constant possibly depending on . It will always be zero in our examples. We further assume that there exists a such that
for all and . In other words is assumed to be Lipschitz continuous in its second argument over with Lipschitz constant . The reason we only need Lipschitz continuity over is that it is used for a contraction inequality in connection with bounding the local supremum of the empirical process in (4). Assume furthermore that
and for a positive constant
Note that (9) implies (8) when . Assuming is rather innocent since many commonly used basis functions are bounded. As we shall see in Section 6 the most critical assumption in concrete examples is the Lipschitz continuity of the loss function. With this notation in place we state the following result which builds on Theorem 14.5 Bühlmann and van de Geer (2011) (see also Corollary A.1 in van de Geer (2008)).
The assumption is made for purely technical reasons and does not exclude any interesting problems. Similarly, still allows to increase at an exponential rate in the sample size.444This is from an asymptotic point of view, though we wish to emphasize that the inequality in Theorem 2 holds for any given sample size (satisfying the conditions of the theorem). From an asymptotic point of view Theorem 2 reveals that the measure of the set , on which the inequality in Theorem 1 is valid, tends to 1 as . In order to also cover the case of fixed one can choose in Theorem 2 to obtain by a slight modification of the proof of Theorem 2 which tends to one as even for fixed. But since Theorem 1 requires the this will yield a bigger upper bound in that Theorem since by Lemma 3 in the Appendix is a non-decreasing function. Combining Theorems 1 and 2 yields the following result.
Theorem 3 basically consists of the inequality in Theorem 1 with a lower bound attached to the measure of the set on which Theorem 1 is valid. In particular, the theorem reveals that the excess loss of, , will not be much larger than the one of the oracle. The second term on the right hand side reflects the estimation error. Put differently, the excess loss of our estimator depends on the excess loss of the oracle as well as the distance to the oracle. From an asymptotic point of view Theorem 2 reveals that the measure of the set on which the inequality in Theorem 1 is valid tends to 1 as . In particular, we have the following result for the asymptotic excess loss of . To this end assume that and for some and .
Corollary 1 shows that asymptotically the excess loss of will be of the same order as that of the oracle. This is useful since we saw in remark 2 above that we have considerable discretion in choosing and hence in what we bound by from above in Corollary 1. In the case where is linear we know from Remark 2 above that we can choose such that and hence . In this case Corollary 1 actually reveals that the excess loss of tends to zero. We next investigate the case of linear in more detail.
3.1 Linear target
In the case where the target function is linear Theorem 3 can be used to deduce the following result.
Assume that is linear.
The bounds (10) and (11) bound the -estimation error of the elastic net estimator for any type of loss function satisfying the conditions of Theorem 1. Note that there is no excess loss from the oracle entering in the upper bound. This is due to the fact that this is zero when the target is linear. The last two bounds bounds in Corollary 2 specialize to the case where the quadratic margin condition is satisfied. We stress again, as we shall see later (see Section 6), that the margin condition is indeed quadratic in many econometric examples.
Furthermore, one sees from the above Corollary that the rate of convergence of the elastic net estimator in the -norm is provided that the adaptive restricted eigenvalue is bounded away from zero.
Corollary 3 shows that the elastic net can be consistent even when the dimension increases at a subexponential rate in the sample size. Note, however, that the number of relevant variables, can not increase faster than the square root of the sample size ( can be put arbitrarily close to 0 to see this). Hence, even though the total number of variables can be very large, the number of relevant variables must still be quite low. This is in line with previous findings for the linear model in the literature. We also remark that this requirement is slightly stricter than the one needed when considering the excess loss in Corollary 1 (in that corollary we only needed ). Also note that the conditions in Corollary 3 are merely sufficient. For example one can let tend to zero at the price of reducing the growth rate of and .
4 Variable Selection
In this section we briefly comment on how the results in Section 3 can be used to perform consistent variable selection in the case where is a linear function555If the target function is not linear we do not find it sensible to talk about consistent variable selection in a linear approximation of the target. Hence, this section restricts attention to the case where the target is linear.. First note, that the results in Corollaries 2 and 3 can be used to provide rates of convergence of for in in the -norm in the case of a linear target. If one furthermore assumes that is bounded away from zero by at least the rate of convergence of it follows by standard arguments, see Lounici (2008) or Kock and Callot (2013), that no non-zero
will be classified as such. Put differently, the elastic net possesses thescreening property.
In order to remove all non-zero variables one may furthermore threshold the elastic net estimator by removing all variables with parameters below a certain threshold. Again standard arguments show that choosing the threshold of the order of the rate of convergence (details omitted) can yield consistent model selection asymptotically. Since thresholding is a generic technique which is not specific to our setup we shall not elaborate further on this at this stage.
One technical remark is in its place at this point. Since thresholding is done at the level of the individual parameter, what one really needs is an upper bound on the estimation error for each individual parameter. In other words, an upper bound on the sup-norm, , is sought. However, our results in the previous section provide rates of convergence in the much stronger -norm666On finite-dimensional vector spaces these two norms are equivalent but here we are working in a setting where the dimension, , tends to infinity.. Of course one may simply use the -rates of convergence to upper bound the sup-norm rates of convergence. But this is suboptimal. Alternatively, a strengthening of the adaptive restricted eigenvalue condition can yield rates of convergence in the - or the sup-norm which are of a lower order of magnitude than the corresponding -results. Upon request we can make results for the thresholded elastic net based on upper bounds on the -norm rate of convergence available. We have omitted the results here since thresholding is a rather standard technique.
5 Tuning Parameter Selection
Recently, Fan and Tang (2013) developed a method to select tuning parameters in high dimensional generalized linear models with more parameters than observations. Here we briefly describe their method with the terminology translated into our setting777Fan and Tang (2013) consider log-likelihood functions of generalized linear models. However, as our loss functions can often be written as the negative of the log-likelihood (as seen in Section 2), their setup applies to many of our examples.. For a loss function of the form
where and are known functions and is a known scale parameter. The tuning parameter, , (as a generic tuning parameter) is chosen to minimize the Generalized Information Criterion (GIC)
where corresponds to the saturated model as defined in Fan and Tang (2013) and is the elastic net minimizer corresponding to the penalty parameter . is the number of non-zero coefficients for a given value of .
Theorems 1-2 and Corollary 1 in Fan and Tang (2013) establish the consistency of this approach, i.e. where indicates the support of the vector , i.e. the location of its non-zero entries. Hence, GIC will, asymptotically, select the correct model (provided there exists a for which . A necessary condition for this to be meaningful is of course that the target is linear, i.e. for some , but nothing prevents one from using GIC even when the target is non-linear. The theoretical merits of the procedure are, to our knowledge, unknown in that case.
At this point it is also worth mentioning that the quadratic loss is covered by (14) since
such that we may choose and