In many scientific discoveries, a fundamental problem is to understand how the features under investigation interact with each other. Interaction estimation has been shown to be very attractive in both parameter estimation and model prediction (Bien, Taylor and Tibshirani, 2013; Hao, Feng and Zhang, 2017), especially for data sets with complicated structures. Efron et al. (2004) pointed out that for Boston housing data, prediction accuracy can be significantly improved if interactions are included in addition to all main effects. In general, ignoring interactions by considering main effects alone may lead to an inaccurate or even a biased estimation, resulting in poor prediction of an outcome of interest, whereas considering interactions as well as main effects can improve model interpretability and prediction substantially, thus achieve a better understanding of how the outcome depends on the predictive features (Fan et al., 2015). While it is important to identify interactions which may reveal real relationship between the outcome and the predictive features, the number of parameters scales squarely with that of the predictive features, making parameter estimation and model prediction very challenging for problems with large or even moderate dimensionality.
1.1 Interaction Estimation, Feature Selection and Screening
Estimating interactions is a challenging problem because the number of pairwise interactions increases quadratically with the number of the covariates. In the past decade, there has been a surge of interest in interaction estimation in quadratic regression. Roughly speaking, existing procedures for interaction estimation can be classified into three categories. In the first category of low or moderate dimensional setting, standard techniques such as ordinary least squares can be readily used to estimate all the pairwise interactions as well as the main effects. This simple one-stage strategy, however, becomes impractical or even infeasible for moderate or high dimensional problems, owing to rapid increase in dimensionality incurred by interactions. In the second category of moderate or high dimensional setting where feature selection becomes imperative, several one-stage regularization methods are proposed and some require either the strong or the weak heredity assumption. See, for example,Yuan, Joseph and Zou (2009), Choi, Li and Zhu (2010), Bien, Taylor and Tibshirani (2013), Lim and Hastie (2015), and Haris, Witten and Simon (2016). These regularization methods are computationally feasible and the theoretical properties of the resulting estimates are well understood for moderate or high dimensional problems. However, in the third category of ultrahigh dimension problems, these regularization methods are no longer feasible because their implementation requires storing and manipulating large scale design matrix and solving complex constrained optimization problems. The memory and computational cost is usually extremely expensive and prohibitive. Very recently, several two-stage approaches are proposed for both ultrahigh dimensional regression and classification problems, including Hao and Zhang (2014), Fan et al. (2015), Hao, Feng and Zhang (2017) and Kong et al. (2017). Two-stage approaches estimate main effects and interactions at two separate stages, so their computational complexity is dramatically reduced. However, these two-stage approaches hinge heavily on either the strong or weak heredity assumption. These methods are computationally scalable but may completely break down when the heredity assumption is violated.
1.2 Heredity Assumption and Invariance Principle in Quadratic Regression
As an extra layer of flexibility to linear models, quadratic regressions include both main effects and pairwise interactions between the covariates. Denote the outcome variable and
the covariate vector. For notational clarity, we define. In general, quadratic regression has the form of
where , and are all unknown parameters. To ensure model identifiability, we further assume that is symmetric, that is, , or equivalently, , . Our goal is to estimate and which characterize respectively main effects and interactions. We remark here that the intercept is also useful for prediction.
In the literature, heredity structures (Nelder, 1977; Hamada and Wu, 1992) have been widely imposed to avoid quadratic computational cost of searching over all pairs of interactions. The heredity structures assume that the support of could be inferred from the support of . The strong heredity assumption requires that an interaction between two covariates be included in the model only if both main effects are important, while the weak one relaxes such a constraint to the presence of at least one main effect being important. In symbols, the strong and weak heredity structures are defined, respectively, as follows:
With the heredity assumptions, one can first seek a small number of important main effects and then only consider interactions involving these discovered main effects. It is however quite possible that main effects corresponding to important interactions are hard to detect. An example is , where and are drawn independently from and is standard normal. In this example, . The main effects and are thus unlikely detectable through a working linear model , indicating that the heredity assumptions do not facilitate to find interactions by searching for main effects first. From a practical perspective, Ritchie et al. (2001) provided a real data example to demonstrate the existence of pure interaction models in practice. Cordell (2009) also raised serious concerns that many existing methods that depend on the heredity assumption may miss pure interactions in the absence of main effects.
An ideal quantification of importance of the main effects and interactions should satisfy the invariance principle with respect to location-scale transformation of the covariates. It is natural and a common strategy to quantify the importance of main effects and interactions through the supports of and in model (1.1). In conventional linear model where only main effects are present and interactions are absent (i.e., in model (1.1)), the invariance principle is satisfied. In contrast, in quadratic regression (1.1) with a general the invariance principle is very likely violated. To demonstrate this issue, we can recast model (1.1) as
In this model, the importance of main effects and interactions is naturally characterized through the support of and , respectively, indicating that the interactions are invariant whereas the main effects are sensitive to location transformation. In ultrahigh dimensional quadratic regression, using one-stage approaches which simultaneously estimate main effects and interactions under the heredity assumption or using two-stage approaches which search for main effects prior to searching for interactions in model (1.1) and model (1.2) may lead to quite different conclusions. It is thus desirable to estimate interactions directly without knowing the main effects in advance. Direct interaction estimation without heredity constraints is, however, to the best of our knowledge, much more challenging and still unsolved in the literature.
1.3 Our Contributions
In this article we consider interaction estimation in ultrahigh dimensional quadratic regressions without heredity assumption. We make at least the following two important contributions to the literature.
We motivate our proposal with the goal of obtaining a general and explicit expression for quadratic regression with as minimal assumptions as possible. Surprisingly, it turns out that such an explicit solution only relies on certain moment conditions on the ultrahigh dimensional covariates, which will be automatically satisfied by the widely used normality assumption. Explicit forms can be derived for both the main effects and the interactions, from which it can be seen that the quadratic regression could be implemented as two independent tasks relating to the main effects and interactions separately. Under weaker moment assumptions, our approach is still valid in detecting the direction of the true interactions. Our proposal is different from existing one-step or two-step procedures in that we do not require the heredity assumption and our proposal give explicit forms for both the main effects and the interactions. Estimating the main effects through a separate working linear model ensures that the resulting estimate satisfies the desirable invariance principle. What is more, we show that our approach for interaction detection is robust to the estimation of main effects in that even when the linear effect can not be well estimated, we can still successfully detect the interactions.
We show that the interaction inference is equivalent to a particular matrix estimation at the population level. We estimate the interactions of matrix form under penalized convex loss function, which yields a sparse solution. We derive the theoretical consistence of our proposed estimation when the covariate dimension is an exponential order of the sample size. Compared with the conventional penalized least squares approach, the penalization of matrix form is appealing in both memory storage and computation cost. An efficient ADMM algorithm is developed to implement our procedure. This algorithm fully explores the cheap computational cost for matrix multiplication and is even much more efficient than existing penalized methods. We have also developed an R package “PIE” to implement our proposal.
The remainder of this paper is organized as follows. We begin in Section 2 with the quadratic regression model and derive closed forms for both the main effects and the interactions. We propose a direct penalized estimation for high dimensional sparse quadratic model. To implement our proposal an efficient ADMM algorithm is provided. We also study the theoretical properties of our proposed estimates. We illustrate the performance of our proposal through simulations in Section 3 and an application to a real world problem in Section 4. We give some brief comments in Section 5. All technical details are relegated to Appendix.
2 The Estimation Procedure
2.1 The Rationale
In this section we discuss how to estimate and , which characterize the main effects and interactions in model (1.1), respectively. Note that and Therefore, estimating and amounts to estimating and , respectively, which is however not straightforward, especially when is ultrahigh dimensional. To illustrate the rationale of our proposal, we assume for now that follows . It follows immediately from Stein’s Lemma (Stein, 1981; Li, 1992) that
where . Define , which is the residual obtained by regressing on linearly. The Hessians of and are equal. Accordingly, we have
By Stein’s Lemma, we can obtain that
where . This indicates that, if is normal, we have explicit forms for and . Specifically,
We remark here that the normality assumption is widely used in the literature of interaction estimation. See, for example, Hao and Zhang (2014), Simon and Tibshirani (2015), Bien, Simon and Tibshirani (2015) and Hao, Feng and Zhang (2017). In the present context we show that the normality assumption can be relaxed. Let be the trace operator of matrix . In particular, .
Suppose that is drawn from the factor model , where satisfies and where are independent and identically distributed (i.i.d.) with , , , . We further assume either (C1): or (C2): . Then the parameters , and in model (1.1) have the following explicit forms:
The factor model was widely assumed in random matrix theory(Bai and Saranadasa, 1996) and high dimensional inference (Chen, Zhang and Zhong, 2010) where higher order moment assumptions of are quite often required. The moment conditions on play an important role to derive an explicit form for . Condition (C1) is satisfied if is normal. When , condition (C2) implicitly requires the absence of quadratic terms of the form in model (1.1), i.e.,
where are i.i.d covariates.
We provide two explicit forms for estimating , one is based on the response and the other is based on the residual . The difference between and is that we remove the main effects in , or equivalently, the linear trend in model (1.1), before we estimate the interactions . It is thus natural to expect that the residual-based is superior to the response-based in that the sample estimate of has smaller variabilities than that of (Cheng and Zhu, 2017). In effect, we can replace with an arbitrary , which yields that . Similarly, we can define . Under the normality assumption, is symmetric about and hence . This ensures that, to estimate accurately, our proposal does not hinge on the sparsity of main effects because we do not require to be estimated consistently. Even if the main effects are not sufficiently sparse or are not estimated very accurately, we can either directly use the response-based method , or the lousy residual-based method which utilizes a lousy residual and can be a lousy estimate of . In effect equals by setting in . This makes our proposal quite different from existing procedures which assume the heredity conditions and hence require to estimate the main effects accurately in order to recover the interactions. By contrast, our proposal does not require to estimate the main effects precisely. We will illustrate this phenomenon through simulation studies in Section 3.
2.2 Interaction Estimation
We show that both and have explicit forms under moment conditions in Section 2.1. In particular, and for being or . In this subsection, we discuss how to estimate and at the sample level. Estimating is indeed straightforward by noting that it is a solution to the minimization problem
Therefore, we can simply estimate with the penalized least squares by regressing on the ultrahigh dimensional covariates linearly. We do not give many details about how to estimate because the penalized least squares estimation has already been well documented (Tibshirani, 1996; Fan and Li, 2001). Throughout our numerical studies we use the LASSO (Tibshirani, 1996) to estimate . The resulting solution is denoted by .
In what follows we concentrate on how to estimate , where can be or . For an arbitrary matrix , we have
Ignoring the constant, the term quantifies the distance between and . Therefore, to seek a matrix which can approximate very well, it suffices to consider the following minimization problem
as long as we have faithful estimates of and . The above loss function of matrix form is convex which guarantees that local minimum must be a global minimum.
To construct faithful estimates for and , suppose is a random sample of . Denote
where . We propose the following penalized interaction estimation (PIE) to estimate , for being or :
where is a tuning parameter and . To ease subsequent illustration, we further define the following two notations:
In this section we discuss how to solve (2.2) which includes (2.3) and (2.4) as special cases. Making use of the matrix structure of (2.2), we next develop an efficient algorithm using the Alternating Direction Method of Multipliers (Boyd et al., 2011, ADMM). We rewrite the optimization problem in (2.2) as
which motivates us to form the augmented Lagrangian as
where is a step size parameter in the ADMM algorithm, and stands for the Frobenius norm of . Given the current estimate , the augmented Lagrangian (2.3) can be solved by successively updating by:
Define the elementwise soft thresholding operator . For the step, given , , and , the solution is then given by
The step amounts to solving the equation
. We make the singular value decomposition to obtain, where , and is a diagonal matrix. Define , where . Given , and , the solution to (2.10) is given by
where denotes the Hadamard product.
Details of the algorithm is summarized in Algorithm 1. This algorithm yields a symmetric estimate of , which is denoted by . The computational complexity of each iteration is no more than O and the memory requirement is no more than O since we only need to store a few or matrices in computer memory. The algorithm explores the advantages of matrix multiplications and is efficient in memory storage and computation cost and hence is appealing for high dimensional quadratic regression.
Furthermore, as a first-order method for convex problems, convergence analysis of the ADMM algorithm under various conditions has been well documented in the recent optimization literature. See, for example, Nishihara et al. (2015), Hong and Luo (2017) and Chen, Sun and Toh (2017). The following lemma states that our proposed ADMM algorithm converges linearly to zero.
It remains to choose an appropriate tuning parameter for PIE or PIE. Motivated by LARS–OLS hybrid (Efron et al., 2004), we use PIE to find the model but not to estimate the coefficients. For a given , we fit a least squares model on the support of estimated by PIE or PIE and get the residual sum of squares. We then choose by the Bayesian information criterion (BIC). Our limited experience indicates that this procedure is very fast and effective.
2.4 Asymptotic Properties
Suppose is a sparse matrix. For notational clarity, we denote the support of by , the complement of by , and the cardinality of by . Similarly, we denote by and the respective support of and , and and the respective complement of and . We define , , and , for . We further define , and . Denote a sequence of generic constants which may take different values at various places. We assume the following regularity conditions to study the asymptotic properties of and .
Assume , where and
are the respective smallest and largest eigenvalues of.
Assume s are sub-Gaussian, i.e., for any unit-length vector .
Assume for some .
Assume the irrepresentability condition holds, i.e., .
Assume is symmetric about .
Conditions (A1) and (A2) are widely assumed in high dimensional data analysis. Condition (A3) is assumed to control the tail behavior ofthrough concentration inequalities. The irrepresentability condition (A4) is nearly necessary for the consistence of -penalization (Zhao and Yu, 2006; Zou, 2006). This condition was first used by Ravikumar et al. (2011). See also Zhang and Zou (2014) and Liu and Luo (2015). We assume condition (A5) to ensure the consistency of residual-based approaches.
Let for sufficiently large and assume that . Under the conditions (A1)-(A4), we have
If we further assume for sufficiently large , then .
, for sufficiently large .
, for sufficiently large .
shows that, as long as the signal strength of the interactions is not too small, our proposal can identify the support correctly with a very high probability. In other words,is asymptotically selection consistent. Theorem 1 also shows that is a consistent estimate of under both the infinity norm and the Frobenius norm.
Let for sufficiently large and assume that . Under the conditions (A1)-(A5), we have
If we further assume for sufficiently large , then .
, for sufficiently large .
, for sufficiently large .
Theorem 2 shows that , as well as , possesses both the selection and estimation consistency asymptotically. Moreover, the convergence rate of depends on . If , the convergence rate term involving will be absorbed in the first term of Theorem 2. In other words, unless the estimation error of diverges faster than , and would share the same convergence rate.
2.5 Connections to All-Pairs-LASSO
For quadratic regression, a nature way is to fit LASSO model on all pairs of interactions,
Following Bien, Taylor and Tibshirani (2013), we refer to this approach as the all-pairs-LASSO. For brevity, we assume and ignore the main effects. Write and . The all-pairs-LASSO is equivalent to