High-dimensional data are increasingly prevalent in various areas such as bioinformatics, astronomy, climate science and social science. When the number of variables is larger than the sample size
in the linear regression setting, statistical estimation of the regression function often requires some crucial conditions. One common condition is the sparsity of the data generating model, under which only a small portion of the variables are important to affect the response variable. Under this condition, both sparse estimation of high-dimensional linear regression functions and variable selection have been well studied with fruitful theoretical understandings in the recent decade. Minimax estimation of the regression function with main effects only are well investigated under-sparsity constraints with (e.g., van de Geer, 2007; Candes & Tao, 2007; Bunea et al., 2007; Zhang & Huang, 2008; van de Geer, 2008; van de Geer & B hlmann, 2009; Bickel et al., 2009; Zhang, 2010a; Knight & Fu, 2000; Raskutti et al., 2011; Rigollet & Tsybakov, 2011; Wang et al., 2014); model selection consistency results are also obtained for various model selection procedures (e.g., Fan & Li, 2001; Zhao & Yu, 2006; Zhang & Huang, 2008; Zou & Yuan, 2008; Lv & Fan, 2009).
However, models with only main effects are often not adequate to fully capture the nature of the data. Interaction terms may be necessary to not only improve the prediction performance but also enhance the understanding of the relationships among the variables, especially in areas such as genetics, medicine and behavioristics, where interaction effects between the covariates are of enormous interest. Hierarchical constraints are often imposed to describe the underlying structure of models with interaction effects, such as the marginality principle (Nelder, 1977), the effect heredity principle (Hamada & Wu, 1992) and the “well-formulated models” (Peixoto, 1987). We follow a popular naming convention of heredity conditions as adopted in Chipman (1996): strong heredity and weak heredity. Strong heredity assumes that if an interaction term is in the model, then both of its corresponding main effects should also be included, while weak heredity only requires that at least one of its main effects should be included. In practice, it is possible that, compared to the interaction terms, some main effects are so small that including them in modeling may not be beneficial from the perspective of estimation variability. Thus, in this work we take into consideration the additional case where no heredity condition is imposed at all, also for the purpose of theoretical comparison with the other two heredity conditions.
Many approaches are proposed for interaction selection, most of which can be categorized into two types: joint selection and stage-wise selection. The joint selection approach selects the main and interaction terms simultaneously by searching over all possible models with interactions. A typical way of joint selection is to use regularization methods with specially designed penalty terms. For example, Yuan et al. (2009) introduced a family of shrinkage estimators, which incorporate the hierarchical structures through linear equality constraints on the coefficients and possess both selection consistency and root- estimation consistency under fixed . Choi et al. (2010) re-parameterized the regression model with interactions and applied an adaptive -norm penalty. The estimators have the oracle property (Fan & Li, 2001) when . Hao et al. (2017) proposed a computationally efficient regularization algorithm under marginality principle (RAMP) that simultaneously selects the main effects, interaction effects and quadratic effects for high-dimensional data . They also verified the interaction selection consistency property of the two-stage LASSO under some sensible conditions. The stage-wise selection procedure first performs a main effect selection (by excluding the interaction terms) to reduce the dimension of variables and then carries out a joint selection on the reduced dimension of variables, which is computationally feasible and effective. For example, viewing the sliced inverse regression (Li, 1991) from a likelihood perspective, Jiang & Liu (2014) suggested a stage-wise variable selection algorithm (SIRI) via inverse regression, which is able to detect higher order interactions without imposing any hierarchical structures. Hao & Zhang (2014) proposed two stage-wise interaction selection procedures, IFORT and IFORM, both of which enjoy sure screening property in the first stage. Fan et al. (2016) proposed a method, named the interaction pursuit, that incorporates both screening and variable selection in ultra-high dimensions. The method possesses both the sure screening property and the oracle property in the two stages respectively. For some other works on interaction selection, see Zhao et al. 2009; Li et al. 2012; Bien et al. 2013; Hall & Xue 2014. While having the aforementioned good properties, both types of interaction selection approaches have their own disadvantages as well. The joint selection is usually computational infeasible (insufficient storage) when is large; the stage-wise selection, as pointed out in Hao & Zhang (2014), may be very difficult to be theoretically justified under general conditions.
Although there have been many novel developments on selection of interaction terms as described above, little work has been done on the estimation of the regression function when interactions exist. In this paper, we present some theoretical results on the minimax rate of convergence for estimating the high-dimensional regression function with interaction terms under three different hierarchical structures. Regardless of the heredity condition, our results show that the minimax rate is determined by the maximum of the total estimation price of the main effects and that of the interaction effects. Heredity conditions enter the minimax rate of convergence in terms of the estimation price of the interaction effects, namely , where is the number of non-zero interaction effects and is the number of eligible candidate interaction terms under the different heredity conditions. Consequently, a stronger heredity condition leads to possibly faster minimax rate of convergence. For example, when the underlying model has no more than non-zero main effects, at most interaction terms are allowed to enter the model under strong heredity, compared to under weak heredity. As will be seen, only in certain situations is the minimax rate improved by imposing the strong heredity, although strong heredity allows fewer eligible interaction terms than the other two heredity conditions. Also, from the perspective of estimation, there may be no difference in rate of convergence between weak heredity and no heredity in many situations. Our results provide a complete characterization and comparison of the minimax rates of convergence under the three heredity conditions.
In real applications, since one does not know the true heredity condition behind the data (or practically the best heredity condition to describe the data at the given sample size), it is desirable to construct an estimator that performs optimally no matter which of the three heredity conditions holds. Such an estimator that adapts to the true heredity condition as well as the unknown number of main and interaction effects will be obtained in this paper.
The remainder of the paper is organized as follows. In Section 2
, we introduce the model setup, the loss function and the heredity conditions for the problem. In Section3, after stating the required assumption, we present our main results of the minimax rate of convergence under strong heredity. The theoretical results under weak heredity and no heredity are presented in Section 4. Section 5.1 provides detailed rates of convergence under different heredity conditions in relation to the sparsity indices, the ambient dimension and the sample size, followed by Section 5.2 where we present some interesting implications of the detailed results. In Section 6, we extend our results to quadratic models in which both quadratic and interaction effects are considered. In Section 7, we construct an adaptive estimator that achieves the minimax rate of convergence without knowledge of the type of the heredity condition or the sparsity indices ( and ). The proofs of our results and some technical tools are presented in the Appendix.
Suppose the dataset is composed of , where is a matrix with observations on covariates and
is the response vector. We start by considering a linear regression model with both main effects and two-way interaction effects:
where is the overall coefficient vector, is the full design matrix, and the random noise vector with known . More specifically, and are the coefficients of the main effects and the two-way interaction effects respectively. Here we define as the matrix that contains all the two-way interaction terms, where denotes the point-wise product of two vectors.
In this paper, our focus is on the fixed design, i.e., the covariates are considered given. Our goal is to estimate the mean regression function by a linear combination of the covariates and interaction terms.
Denote as the mean regression function, i.e., for . Denote as an estimated function of . In our fixed design setting, we focus on the prediction loss (or the Averaged Squared Error) , where is the Euclidean norm. Set the index sets for the main effects and the interaction effects as and respectively.
Let ( is the Cartesian product) be the index set of a model with non-zero main effects and non-zero interaction effects. In this paper, we consider the data generating model (2.1) with at least two main effects and one interaction effect purely for convenience, which does not affect the conclusions. Let be the submatrix of that corresponds to the model index . Its corresponding least squares estimator is used to estimate , where is the projection matrix onto the column space of . The loss function of using model is denoted as .
Denote the space of all the -dimensional vectors with a hierarchical notation of the subscripts as
We refer to as the subvector consisting of the first elements in , and as the subvector containing the rest of the elements. We introduce the following two vector spaces:
The space captures the strong heredity condition that if the interaction term is in the model, then both of its corresponding main effects should also be included. The space characterizes the weak heredity condition that if the interaction is in the model, then at least one of its main effects should be included. As pointed out in Hao & Zhang (2016)
, the sign of the main effect coefficients are not invariant of linear transformation of the covariates individually due to the existence of the interaction terms. Heredity conditions are consequently meaningless without the specification of the model parametrization. In our paper, we stick to the parameterizationand include the no heredity condition by considering the vector space . Define the -norm of a vector as the number of its non-zero elements, i.e., . For a vector space , define the corresponding - and - of as
respectively. Note that represents the collection of coefficients with at most non-zero main effects and non-zero interaction effects under a certain hierarchical constraint . And denotes the collection of linear combinations of the covariates with coefficients . Throughout this paper, we assume that (otherwise the minimax risk may not converge or the rate may not be optimal), and .
It is helpful to consider the uniform performance of a modeling procedure when we have plentiful choices of modeling procedures during the analysis of a statistical problem. The minimax framework seeks an estimator that minimizes the worst performance (in statistical risk) assuming that the truth belongs to a function class . The minimax risk we consider is
where is over all estimators, and and may refer to and , more formally speaking. In our work, we assume that the true mean regression function has a hierarchical structure by imposing , with .
In this paper, we will use the notation or to represent . If both and hold, we denote to indicate that and are of the same order. If holds without , we use the notation or .
3 Minimax Rate of Convergence under Strong Heredity
We start by stating an assumption required for our result of the minimax rate of convergence under strong heredity. In this paper, we use to indicate that the number of main effects can go to infinity as increases. We also allow and to increase with the sample size as well.
Sparse Reisz Condition (SRC)
For some , there exist constants (not depending on ) such that for any with and , we have
The SRC assumption requires that the eigenvalues offor any sparse submatrix of are bounded above and away from 0. It was first proposed by Zhang & Huang (2008). It is similar to the sparse eigenvalue conditions in Zhang (2010b); Raskutti et al. (2011), quasi-isometry condition in Rigollet & Tsybakov (2011); it is also related to the more stringent restricted isometry property (which requires the constants , are close to 1) in Candes & Tao (2007). Such assumptions are standard in the -regularization analysis like LASSO and the Dantzig selector. See Bickel et al. (2009); Meinshausen & Yu (2009); van de Geer (2007); Koltchinskii (2009) for more references.
3.2 Minimax Rate
Now we present our main result of the minimax rate of convergence under strong heredity. A simple estimator is enough for an effective minimax upper bound. Let be the model that minimizes the residual sum of squares over all the models that have exactly non-zero main effects and non-zero interaction effects under strong heredity, denoted as , where is the projection of onto the column space of the design matrix . For lower bounding the minimax risk, the information-theoretical tool of using Fano’s inequality with metric entropy understanding (Yang & Barron, 1999) plays an important role in the proof.
Under the Sparse Reisz Condition with , and the strong heredity condition , the minimax risk is upper bounded by
where is a pure constant; the minimax risk is lower bounded by
for some positive constant that only depends on the constants and in the SRC assumption.
From the theorem, under the SRC and the strong heredity condition, the minimax rate of convergence scales as: .
The term reflects two aspects in the estimation of the main effects: the price of searching among possible models, which is of order , and the price of estimating the main effect coefficients after the search. Thus is the total price of estimating the main effects. Similarly, is the total price of estimating the interaction effects.
Our result of the upper bound is general and does not require the sparsity condition of , although it may be needed for fast rate of convergence.
4 Minimax Rate of Convergence under Weak Heredity and No Heredity
Similar results are obtained under weak heredity and no heredity. The minimax rate of convergence is still determined by the maximum of the total price of estimating the main effects and that of the interaction effects. When the heredity condition changes, the total price of estimating the interaction effects may differ, possibly substantially.
Under the Sparse Reisz Condition with , and the weak heredity condition , the minimax risk is of order
Under the Sparse Reisz Condition with , and the no heredity condition , the minimax risk is of order
5 Comparisons and Insights
In this section, we summarize the consequences of our main results in three scenarios for an integrated understanding. For brevity, we introduce the following notation. For and , define the quantity The total price of estimating the main effects and the interaction effects are then denoted as and respectively, where depends on , and the heredity condition. We also use the notation (7.5) to indicate that depends on the heredity condition . Let
denote the minimax risk under the heredity condition .
5.1 Detailed Rates of Convergence
Since the minimax rate of convergence depends on the maximum of and , we discuss the cases where one of the two quantities is greater than the other.
When there are more main effects than interaction effects in the sense that , the minimax rate of convergence is not affected by the heredity conditions. When , we always have regardless of the heredity conditions. When , it depends on the order of to further decide which estimation price is larger. When , let be such that . If , we have ; otherwise .
In summary, given that , the minimax risk is of order
This scenario also includes the special case when , where we must have and . The minimax rate of convergence is of the standard parametric order regardless of the heredity conditions.
Scenario 2: and
When there exist more interaction terms, i.e., , under weak or no heredity, the quantity is always no less than (in order) .
For strong heredity, we discuss case by case. When , we always have . When , it depends on the order of to decide which estimation price is larger in terms of order. When , let be such that . If , we have ; otherwise . In summary, given that and , the minimax risk is of order
The term deals with the case where is inactive in the sense that exceeds under the specific heredity condition. For example, with , the upper bound in (2.2) does not provide any new information of the number of non-zero interaction effects for strong heredity. Thus the -ball is automatically reduced to a subset .
Scenario 3: and
When the number of the main effects is at least exponentially as many as the non-zero main effects in the sense that , is always no less than in terms of order. In fact, in this scenario, the results of the minimax rates under weak or no heredity are exactly the same as those in Scenario 2. For completeness, we still present the results. Specifically, the minimax risk is of order
5.2 Interesting Implications
Comparing the results for weak heredity and no heredity, we may or may not have distinct rates of convergence. When there exists a small constant such that for large enough , there is no difference between weak heredity and no heredity from the perspective of rate of convergence in estimation. It still remains an open question how they are different for the problem of model identification. Without the above relationship between and , there is no guarantee that the rates of convergence are the same under weak heredity and no heredity. For example, when , if in addition we have , the minimax rates are the same under weak and no heredity, at . In contrast, if instead we have , then the minimax rates are different, with and .
Heredity conditions do not affect the rates of convergence in some situations. For example, when there exist more main effects than interaction effects (Scenario 1), the minimax rates of convergence are the same under all three heredity conditions.
From the detailed rates of convergence, under any of the three heredity conditions, the estimation of the interaction terms may become the dominating part. There are two different reasons why the price of estimating the interaction terms becomes higher than that for the main effect terms. One is that the number of interaction terms is more than that of the main effect terms. The other reason is that although the main effect terms outnumber the interaction terms, the ambient dimension is so large that even estimating a small number of the interaction terms is more challenging than estimating the main effects.
How much can the rate of convergence be improved by imposing strong heredity? We quantify this improvement by taking the ratio of two minimax rates of convergence given the ambient dimension , i.e., and . In Scenario 2 ( and ), we have , where the maximal improvement happens when and . That is, the minimax rate of convergence under strong heredity is up to times faster than that under weak heredity. Similarly we have , where the maximal improvement happens at and .
In Scenario 3 ( and ), the improvement , where the maximal improvement happens when . In this scenario, the maximal improvement of the minimax rate from weak heredity to strong heredity depends on the ambient dimension . In other words, the larger the ambient dimension is, the more improvement of minimax rate of convergence we have from weak heredity to strong heredity. Similarly we have , where the equality holds if and .
If is active for all three heredity conditions, i.e., , the maximal improvement of minimax rate from weak/no heredity to strong heredity turns out to be consistent. That is, , where the maximal improvement happens at and .
6 Extension to Quadratic Models
Our aforementioned results do not consider quadratic effects. When both quadratic and two-way interaction effects are included in a model (called a quadratic model), it is easy to see the rates of convergence in the theorems still apply under both strong heredity and weak heredity. However, in the case of no heredity, the number of quadratic terms enters into the minimax rate. Assume one model has at most extra non-zero quadratic terms. We need the following assumption.
Sparse Reisz Condition 2 (SRC2)
For some , there exist constants (not depending on ) such that for any with , and , we have
where is the new design matrix, with representing the matrix that contains all the quadratic terms.
Next we state the minimax results for quadratic models. Strong heredity and weak heredity are exactly the same condition since a quadratic term has only one corresponding main effect term. That is, both strong and weak heredity require that if a quadratic term has a non-zero coefficient, then must also have a non-zero coefficient. Similarly, under SRC2 with , the minimax rate of convergence under strong/weak heredity for the quadratic model stays the order
under no heredity, its order becomes
7 Adaptation to Heredity Conditions and Sparsity Indices
In the previous sections, we have determined the minimax rates of convergence for estimating the linear regression function with interactions under different sizes of sparsity indices and heredity conditions . These results assume that , and are known. However, in practice, we usually have no prior information about the underlying heredity condition nor the sparsity constraints. Thus it is necessary and appealing to build an estimator that adaptively achieves the minimax rate of convergence without the knowledge of , and . We construct such an adaptive estimator as below.
To achieve our goal, we consider one specific model and three types of models together as the candidate models:
where denotes the full model with main effects and all the interaction effects. It is included so that the risk of our estimator will not be worse than order , in which is the rank of the full design matrix. With a slight abuse of the notation, we use , and to represent a model with main effects and interaction effects under strong heredity, weak heredity and no heredity respectively. Note that some models appear more than once in , which does not cause any problem for the goal of estimating the regression function. The details of the range of and for each model class are shown in (7.2), (7.3) and (7.4).
To choose a model from the candidate set, we apply the ABC criterion in Yang (1999). For a model in , the criterion value is
where is the projection of onto the column space of the design matrix with rank , is the descriptive complexity of model and is a constant. The model descriptive complexity satisfies and .
The model descriptive complexity is crucial in building the adaptive model. Let be four constants such that . Set for the full model,
with for , and
for . This complexity assignment recognizes that there are three types of models under the different heredity conditions.
Let denote the model that minimizes the ABC criterion over the candidate model set and denote the least squares estimate of using the model . Then we have the following oracle inequality.
When , the worst risk of the ABC estimator is upper bounded by
where is the rank of the full design matrix and the constant only depends on the constant .
From the theorem, without any prior knowledge of the sparsity indices, the constructed ABC estimator adaptively achieves the minimax upper bound regardless of the heredity conditions. The result also indicates a major difference between estimation and model identification. For estimation, from the result, we are able to achieve adaptation with respect to the heredity condition without any additional assumption. For model identification, although we are not aware of any work that addresses the task of adaptation over the unknown heredity nature, it seems certain that much stronger assumptions than those for consistency under an individual heredity condition will be necessary to achieve adaptive selection consistency. Achieving adaptive model selection consistency under different types of conditions remains an important open problem on model selection theory and methodology.
We do not require any assumptions on the relationship among the variables for the upper bound in the theorem. In particular, the variables may be arbitrary correlated.
The order is achievable when we use the projection estimator from the full model. Thus the minimax rate of convergence is no slower than the order . As is known, the rank of the design matrix plays an important role in determining the minimax rate of convergence under fixed design (Yang, 1999; Rigollet & Tsybakov, 2011; Wang et al., 2014). For our result, when , and together make the total estimation price of the true model small enough, the upper bound will be improved from to .
The ABC estimator may not be practical when is large. In such case, stochastic search instead of all subset selection can be used for implementation.