1 Introduction
The interaction effect is an important consideration in regression problems that are commonly encountered in machine learning. It refers to a general situation when it is not adequate to build a model with the original explanatory variables alone in a simple additive way – referred to as the main effects. At the same time, it could be more effective to build a model incorporating constructed variables as the products between pairs of variables – referred to as the interactions between the variables; see, for example, the monograph of
Jaccard and Turrisi (2003) for an overview. Since the interactions are constructed from the predictor variables, it is clear that conceptually they can be further expanded in a hierarchical manner. That is, the construction can continue searching for higher order multiple way new interactions that can be defined based on all existing variables including those constructed ones from the products.Recently, handling highdimensional regression problems has received increasing attentions. For an overview of the current state of knowledge, we refer to the monographs Hastie et al. (2015), Bühlmann and van de Geer (2011), and the review by Fan and Lv (2010) and the references therein. Conceptually speaking, existing penalized regression methods can be applied by incorporating the constructed interactions as new variables. However, because of the aforementioned interactive and hierarchical nature, the complexity of the problem scales up very quickly, both computationally for practical implementations, and theoretically for analyzing the properties of the estimations. For example, in a typical situation with predictor variables, the number of the possible twoway interactions is , growing at a polynomial rate of that itself is typically assumed growing at some exponential rate of the sample size . Consequently, on one hand, with the total number of predictor variables including interactions at the order of , the computational complexity increases substantially for the statistical methods. On the other hand, due to the increased data dimensionality, the relative signal strength from the predictor variables becomes weaker.
Recently, there have been active developments on methods for detecting and estimating the interactions in highdimensional regression problems. To manage the complexity of the problem, a main strategy is to assume some hierarchical structure between the main effect from the original predictor variables and the interactions. That is, in the absence of the original variables, the interactions alone do not play a role in the contributions to the model. This is referred as the marginality principle (Nelder, 1977). In a class of the existing methods guided by this principle, the interactions are detected and estimated with some twostep or iterative twostep procedures that first select the main effects with some penalized regression methods, and then performs a second run of selections by incorporating all potential interactions from the selected variables. For examples of the methods belonging to this class, we refer to Hao and Zhang (2014), Hao et al. (2016), Yuan et al. (2009), Zhao et al. (2009), Choi et al. (2010), Bien et al. (2013), Shah (2016), and Haris et al. (2016). More recently, She et al. (2018) investigate the problem with some group penalty approach incorporating the structural hierarchy. Requiring no hierarchical structure and in a setting with multiple response, Kong et al. (2017) propose to search for interactions with some sure screening procedure based on the square functions of the distance correlation measures. Their approach is computationally efficient, and only requires ranking with pairwise quantities. From the inverse modeling perspective, Jiang and Liu (2014) consider testing procedures with the sliced inversion regression approach (Li, 1991) for variable selection with the first and secondorder effects. For classification problems and general index models, Li and Liu (2017) study some information criteria based approaches for variable and interaction selections. Recently, Fan et al. (2016) propose a method to detect the signals from the interactions by examining the pairwise covariances between the squared response and each predictor variable. Upon evaluating the covariances, Fan et al. (2016) propose a new twostep procedure that first conduct sure screening for the interactions, and then conduct a second step of penalized regression without requiring hierarchical structure.
For those twostep procedures, detecting interactions in the second step upon selecting some variables such as the main effect has a clear limitation in the highdimensional scenarios. Particularly for the ones with some hierarchical structures, if the signal strength relatively to the noise from the main effect is weak, then penalized regression in the first step may miss the main effect. Subsequently, it becomes impossible to detect and estimate the interactions in the second step. This is more likely to happen for highdimensional problems, when the effect from the noise aggregation becomes more substantial, causing relatively low signal strength.
In this paper, we propose a new method to detect the interaction effects in regression problems with a onestep penalized Mestimation. Our method does not assume a hierarchical structure, and it also does not require a screening step. Our method is developed by utilizing the socalled principal Hessian matrix (Li, 1992)
, which is defined as the expectation of the second order derivative of the mean function. To solve the challenging problem, we assume that the principal Hessian matrix is sparse, reflecting the reality that given the limited data information, only few meaningful interactions can be supported with good estimation accuracy. A sparsity encouraging Mestimator is proposed by minimizing a dedicated crafted penalized squared loss function, and then an alternating direction method of multipliers (ADMM) algorithm is proposed to solve the optimizations efficiently. We show with simulations that the proposed method outperforms existing methods, and our theory confirms that the estimator works satisfactorily, allowing the dimensionality of the predictor variables growing exponentially with the sample size.
The rest of this paper is organized as follows. The proposed method using the principal Hessian matrix for detecting interactions is outlined in Section 2, followed by the alternating direction method of multipliers (ADMM) for solving the optimizations elaborated in Section 3. Numerical examples including simulation and a real data example are presented in Section 4 to demonstrate the promising performance of the method. Some theoretical analysis is given in Section 5, and the proofs are outlined in the Appendix.
2 Methods
2.1 Principal Hessian Matrix
The key device in our method is the principal Hessian matrix. In Li (1992)
, the principal Hessian matrix is proposed as a convenient device for investigating dimension reduction and data visualization; see also
Cook (1998).We consider the regression problem with the response variable and predictor variable
that is taken to be a random vector. Let
be the conditional mean function, then a model for can be written as withbeing a zero mean random variable independent of
. Then the principal Hessian matrix introduced in Li (1992) for the mean function is defined aswhere the expectation is taken with respect to the joint distribution of
. For ease of presentation and without loss of generality, we consider that is centered so that all of its components are of zero mean. For demonstrating the connection of the principal Hessian matrix with the interactions, let us consider the following working model with both the main and interaction effects:(1) 
where and is independent of . Then by letting , we note that from (1)
(2) 
That is, since is the expectation of the second order derivative of , it automatically filters out the linear contributions from . Hence, this is the intuition that the principal Hessian matrix is a very informative device for detecting interactions.
By an application of the Stein’s Lemma (Stein, 1981), Li (1992) shows that if , then
(3) 
where . The connection between (3) and (2) based on model (1) suggests that (3) is an ideal device for interactions detection. The identity (3) is the foundation of our work. Hereinafter, the principal Hessian matrix we refer to is (3). The setting is reasonable for solving the challenging problem of large scale interactions detection.
In the ideal case that , then , and provided that . The condition is satisfied by the normal and elliptical symmetric distributions. Moreover, if there is no linear contribution from , or as the main effect it has been estimated and removed from , is not required for detecting the interactions. That is, consistent estimation of the main effects helps alleviating the conditions on the distribution of .
Rigorously, we have the following propositions connecting in (3) with the interaction effects in the regression model (1).
Proposition 1.
If , then under model (1), if and only if .
Furthermore, the normal distribution assumed in Proposition (
1) can be more broadly generalized with the following conditions:Condition 1.
satisfies

[align=left]

for any ;

Let . It holds that if and only if or .
Condition 1
is clearly valid for the normal distribution, and it remains more broadly true. For example, for the elliptical symmetric distributions whose characteristic function takes the form
, Condition 1 can be verified. Other practically appealing situations for Condition 1 include when contains independent components, and/or there are only few in (1) being nonzero. We note that the essential role of C1 is to ensure that the linear contribution from in is not interfering with that from the interactions. Condition C1 is not required if the linear contribution is estimated and removed before performing interaction detections. Existing highdimensional penalized regression methods, e.g., those in Bühlmann and van de Geer (2011), Fan and Lv (2010), and Hastie et al. (2015) can be performed for such a purpose. Additionally, C2 of Condition 1 is on the correlations between products and . It is not restrictive by observing that , implying zero correlations between and for .We also note that by its definition,
is suitable for studying interactions between numerical predictor variables. For categorical variables, dummy variable coding is needed, resulting in subgroups of the observations so that our methods and other penalized regression analysis can be equally applied within the resulting groups for detecting the interactions between numerical and categorical variables.
2.2 Sparse Estimation of
Our study intends to explore an effective estimator for
with highdimensional data, and then to detect the interactions. In highdimensional regression problems, sparsity is a useful notion for statistical inferences; see, among others,
Hastie et al. (2015). In the context of interactions in , it means that the total number of contributing pairs of the predictor variables is small, so that is sparse. Thus, exploring the interactions becomes a sparse matrix estimation problem.By examining in (3), we observe that a sparse estimation of this matrix is challenging. First, estimating is difficult when is large. Though existing methods can be applied, e.g., those reviewed in Fan et al. (2016), there are two major concerns. First is on whether or not it is necessary to impose assumptions such as being sparse. Second is that if one construct an estimator of from sparse estimators of and , the implied sparse components of may not be the desirable ones in . As an example, we consider , , with , and . Then we have
Then, some algebra shows that in this case the principal Hessian matrix
That is, the precisely reflects the signal from the interactions, despite that both and can be some dense matrices.
In our investigation, we consider the case that is sparse, with no explicit requirements on , , and All our technical development is based on assumptions directly on . Specifically, we aim at a onestep sparse estimator of . The key development is the following. First, we observe that from (3), . This motivates to consider the weighted quadratic loss
(4) 
which is minimized at . With some elementary algebra, it can be shown that depends on only through
We denote by the observed data. Then, we propose to replace and by their sample counter parts , with , and applying the sparse inducing penalty function. Concretely, we propose an Mestimator:
(5) 
where is the norm of the matrix , and is a tuning parameter. With , we propose to detect interactions as
(6) 
We note that estimator may not symmetric. For practical applications, we suggest symmetrizing it by . Because only linear operators are involved in (5) with no inverse of a large matrix, the calculation of the loss is computationally efficient and scalable. In the next section, we design an efficient algorithm to compute the estimator efficiently.
3 Algorithm
3.1 Alternating Direction Method of Multipliers
We first observe that the Hessian of the problem (5) is and positivesemidefinite, where denotes the Kronecker product. The loss function in (5) is thus convex. We propose to solve (5) by the alternating direction method of multipliers (ADMM). Our algorithm is inspired by the algorithm developed in Jiang et al. (2016), in which a large matrix for classification with quadratic discriminant analysis is directly estimated in the same spirit of (5).
ADMM was first proposed in Glowinski and Marroco (1975), which is essentially a splitting version of the augmented Lagrangian method to solve optimization problems with a separable objective under a linear constraint:
where and , and are continuous functions. Recently, ADMM finds wide applications in different fields such as statistics, image processing and machine learning. This is due to the algorithm’s easy implementation and practical efficiency; see Boyd et al. (2011) and Eckstein and Yao (2012) for some recent reviews.
To apply the ADMM algorithm to compute our estimator, we rewrite problem (5) into the following equivalent form to facilitate the algorithm design that
(7) 
The augmented Lagrangian dual problem associated with the above problem is
where is the dual variable associated with the equality constraint, and is a penalty parameter. The ADMM algorithm runs iteratively, at the th iteration, we update the solutions by
(8)  
Note that since our problem is convex, and the objective value is lower bounded, the convergence result of the algorithm has been well established in existing literature; see Fang et al. (2015) and Sun et al. (2015) for examples. We further point out that in our simulation studies in Section 4, the algorithm performs well empirically.
3.2 Solving the Subproblems
The key to implement the ADMM algorithm developed above efficiently is to solve the  and subproblems in (8) efficiently. In this subsection, we derive efficient solutions for the two subproblems. For the subproblem, we have that
where denotes Kroneker product. By Fang et al. (2015), we add a proximal term that we let
where the matrix induced norm for positive definite matrix . Letting , where
is greater than the largest eigenvalue of
, we haveThus, we have
We point out that the most computational expensive step is to compute the Kronecker product , but this only needs to be once at the beginning.
Next, considering the subproblem, we have
It is not difficult to see that this step admits a closedform solution by soft thresholding:
where is the elementwise shrinkage operator that for a matrix and , the th entry .
For the stopping criterion, we look at the primal and dual residuals. The primal residual is a measure of the feasibility for problem (7), which is defined as
Meanwhile, the dual residual measures the convergence of the algorithm, where we take
In our implementation, we stop the algorithm when both primal and dual residuals are small that
4 Numerical Studies
4.1 Synthetic Data
In this section, we conduct extensive numerical studies to demonstrate and validate the performance of our proposed method in practice. We first conduct the investigation using synthetic data. In our simulation setup, we fix the sample size as , and we consider different dimensions for , 200 and 300. Meanwhile, we generate the design matrix by generating each sample independently from a
dimensional Gaussian distribution
, where the covariance matrixis either the identity matrix, or a Toeplitz matrix, i.e.
for some . We then generate the noises ’s independently from a normal random variable , and we consider different ’s. To thoroughly compare the proposed method with other methods, we consider the following eight models:
[align=left]

,

,

,

,

,

,

,

In the Model 1, we consider the case where only the main effects are present. This is a benchmarking case in the sense that there should be no false inclusion of the interactions for a valid detection method. In the next four four models, we consider the cases where only interaction terms are presented under different scenarios. With the interaction only models, we intend to show the advantage of the proposed onestep procedure. Then, we consider two models where some hierarchical structures exist in Model 6 and Model 7 with different signal strength. Finally, we consider an example with the heteroscedasticity in Model 8, in which the conditional variance of the response variable is not homogeneous.
We choose the tuning parameter using 10fold crossvalidation, and we point out that after extensive numerical investigations, we find that our method is insensitive to the tuning parameter selection. We report the empirical true positive rate (TPR) and the false positive rate (FPR) by (6) for each data generating scheme after repeating each scheme 200 times. In particular, let be the true set of interaction terms, and let be the interaction terms selected by the estimator . TPR and FPR are defined as
respectively.
We compare our method with the interaction pursuit (IP) method (Fan et al., 2016), which is twostep method with a screening procedure as the first step, and the RAMP method (Hao et al., 2018), which is based on a hierarchical structure. We report the results in Tables 14, each containing results of two of the above models.
Firstly, as seen in Table 1 for Model 1, our onestep method performs very well with very little false inclusion when no interactions are present. Other competing methods also are performing very well in the sense of little of no false inclusions. In all other models for detecting the contributing interactions, we see that the proposed method outperforms the other two method with large margins in all settings except Model 6, where a hierarchical structure is presented with good signal strength, which favors the RAMP method when the main effect is correctly detected first. Nevertheless, we see that in Model 7, though a hierarchical model is presented, due to the low signal strength of the main effect of and , our method still outperforms the RAMP method substantially. This clearly demonstrates the advantage of our onestep estimation based interaction detection method.
4.2 Read Data
We further apply the proposed method to analyze the GPL96 microarray dataset analyzed in McCall et al. (2010); Wu et al. (2013); Fang et al. (2017). This dataset contains samples of more than 2,000 biological contexts generated from Affymetrix Human 133A (GPL96) arrays, and each sample has genes. The data set was preprocessed and normalized using frozenRMA (McCall et al., 2010) to reduce batch effects. We use samples of breast tumor and normal samples in the analysis. To improve the efficiency in our demonstration, we conduct some prescreening. In particular, we use a plugin estimator to estimate , where denotes the MoorePenrose pseudoinverse of . We then screen out the variables where the norm of the corresponding columns of are small. In our analysis, we screen the genes and keep genes.
We treat the disease status as responses, and randomly split the dataset into a training set and a testing set. Each training set contains 100 samples from the breast tumor group and 150 samples from the normal group. We repeat the random split 100 times. We first report the averaged testing errors and their standard errors in Table
5. It is seen that the proposed method achieves better testing errors than the other two methods. We further provide the five most selected interaction terms in Table 6. It is seen that BRCA1 and BRCA2 are frequently selected by our proposed method including the interaction between these two genes, while they are less frequently selected by the other two methods. It is well known in literature that BRCA1 and BRCA2 are of fundamental importance for breast tumor as shown in King et al. (2003), This shows the promising applicability of our method in practice.= 100  = 200  = 300  

Model  (, )  Method  TPR  FPR  TPR  FPR  TPR  FPR 
1  (0, 0.1)  ADMM  NA  0.03%  NA  0.01%  NA  0.00% 
IP  NA  0.15%  NA  0.11%  NA  0.03%  
RAMP  NA  0.00%  NA  0.00%  NA  0.00%  
(0, 2)  ADMM  NA  0.04%  NA  0.01%  NA  0.00%  
IP  NA  0.14%  NA  0.02%  NA  0.00%  
RAMP  NA  0.00%  NA  0.00%  NA  0.00%  
(0.2, 1)  ADMM  NA  0.05%  NA  0.03%  NA  0.01%  
IP  NA  0.08%  NA  0.02%  NA  0.00%  
RAMP  NA  0.00%  NA  0.00%  NA  0.00%  
(0.2, 2)  ADMM  NA  0.05%  NA  0.02%  NA  0.00%  
IP  NA  0.12%  NA  0.03%  NA  0.00%  
RAMP  NA  0.00%  NA  0.00%  NA  0.00%  
2  (0, 0.1)  ADMM  99.0%  0.08%  96.0%  0.09%  92.5%  0.13% 
IP  82.0%  0.06%  67.0%  0.08%  66.0%  0.02%  
RAMP  1.00%  0.03%  0.00%  0.06%  0.00%  0.01%  
(0, 1.0)  ADMM  98.5%  0.12%  92.0%  0.13%  90.0%  0.17%  
IP  54.0%  0.01%  47.5%  0.04%  42.5%  0.04%  
RAMP  2.00%  0.22%  0.00%  0.06%  0.00%  0.00%  
(0.1, 0.1)  ADMM  96.5%  0.14%  93.5%  0.18%  91.0%  0.11%  
IP  73.0%  0.19%  68.5%  0.11%  65.0%  0.08%  
RAMP  0.00%  0.00%  0.00%  0.00%  0.00%  0.00%  
(0.1, 1.0)  ADMM  95.0%  0.30%  91.5%  0.26%  89.0%  0.13%  
IP  49.5%  0.18%  40.5%  0.10%  28.0%  0.06%  
RAMP  0.00%  0.00%  0.00%  0.00%  0.00%  0.00% 
= 100  = 200  = 300  

Model  (, )  Method  TPR  FPR  TPR  FPR  TPR  FPR 
3  (0.1, 0.1)  ADMM  99.5%  0.10%  95.0%  0.08%  94.5%  0.13% 
IP  93.5%  0.19%  82.5%  0.05%  80.5%  0.03%  
RAMP  2.00%  0.02%  2.0%  0.01%  2.00%  0.00%  
(0.1, 1.0)  ADMM  96.5%  0.29%  93.0%  0.13%  91.5%  0.08%  
IP  76.0%  0.17%  68.5%  0.03%  62.0%  0.02%  
RAMP  0.00%  0.00%  1.00%  0.00%  0.00%  0.00%  
(0.4, 0.1)  ADMM  100.0%  0.20%  99.0%  0.15%  96.0%  0.07%  
IP  97.0%  0.10%  91.5%  0.08%  89.0%  0.05%  
RAMP  2.00%  0.00%  0.00%  0.00%  3.00%  0.00%  
(0.4, 1.0)  ADMM  99.0%  0.40%  99.0%  0.19%  97.0%  0.10%  
IP  92.0%  0.10%  89.5%  0.08%  86.0%  0.02%  
RAMP  2.00%  0.00%  0.50%  0.00%  0.50%  0.00%  
4  (0, 1)  ADMM  98.5%  0.12%  96.0%  0.09%  95.5%  0.03% 
IP  89.0%  0.08%  82.5%  0.04%  73.0%  0.02%  
RAMP  3.50%  0.00%  1.00%  0.06%  0.00%  0.01%  
(0, 1.5)  ADMM  98.0%  0.13%  94.0%  0.09%  91.5%  0.04%  
IP  74.0%  0.09%  62.0%  0.05%  52.0%  0.03%  
RAMP  0.50%  0.00%  2.00%  0.00%  0.00%  0.00%  
(0.05, 0.5)  ADMM  98.0%  0.15%  95.0%  0.09%  93.0%  0.13%  
IP  94.5%  0.12%  91.5%  0.10%  89.0%  0.06%  
RAMP  9.00%  0.00%  6.00%  0.00%  0.00%  0.00%  
(0.15, 1)  ADMM  98.5%  0.18%  94.0%  0.13%  93.0%  0.06%  
IP  90.5%  0.13%  81.5%  0.06%  76.5%  0.02%  
RAMP  8.00%  0.00%  3.00%  0.00%  2.50%  0.00% 
= 100  = 200  = 300  

Model  (, )  Method  TPR  FPR  TPR  FPR  TPR  FPR 
5  (0, 0.1)  ADMM  94.7%  0.17%  92.0%  0.09%  90.3%  0.05% 
IP  80.7%  0.14%  91.5%  0.10%  70.3%  0.04%  
RAMP  9.33%  0.00%  9.67%  0.00%  10.7%  0.00%  
(0, 1)  ADMM  92.0%  0.12%  91.3%  0.07%  89.7%  0.05%  
IP  78.3%  0.11%  75.0%  0.06%  68.7%  0.03%  
RAMP  8.00%  0.00%  3.00%  0.00%  6.67%  0.00%  
(0.2, 0.1)  ADMM  95.3%  0.31%  94.0%  0.10%  93.7%  0.04%  
IP  81.3%  0.28%  72.3%  0.09%  71.0%  0.03%  
RAMP  16.0%  0.00%  11.3%  0.00%  0.00%  0.01%  
(0.2, 1.0)  ADMM  93.7%  0.11%  92.7%  0.08%  91.5%  0.04%  
IP  79.7%  0.08%  77.3%  0.06%  69.7%  0.04%  
RAMP  10.7%  0.00%  8.33%  0.00%  7.33%  0.00%  
6  (0, 1)  ADMM  98.5%  0.23%  93.0%  0.19%  91.5%  0.10% 
IP  98.5%  0.20%  92.5%  0.12%  92.0%  0.08%  
RAMP  99.5%  0.00%  100.0%  0.06%  100.0%  0.00%  
(0, 1.5)  ADMM  92.0%  0.31%  88.5%  0.16%  82.0%  0.04%  
IP  91.5%  0.15%  86.0%  0.11%  80.0%  0.03%  
RAMP  96.5%  0.00%  95.5%  0.00%  94.0%  0.00%  
(0.2, 1)  ADMM  96.0%  0.23%  95.0%  0.14%  92.5%  0.08%  
IP  97.0%  0.12%  94.5%  0.08%  91.0%  0.05%  
RAMP  100.0%  0.00%  100.0%  0.00%  99.0%  0.00%  
(0.4, 1)  ADMM  97.5%  0.19%  95.5%  0.10%  95.0%  0.04%  
IP  98.0%  0.11%  96.0%  0.06%  93.5%  0.01%  
RAMP  100.0%  0.00%  100.0%  0.00%  100.0%  0.00% 
= 100  = 200  = 300  

Model  (, )  Method  TPR  FPR  TPR  FPR  TPR  FPR 
7  (0, 1)  ADMM  100.0%  0.29%  100.0%  0.19%  97.5%  0.08% 
IP  98.5%  0.27%  95.5%  0.13%  90.0%  0.04%  
RAMP  1.00%  0.00%  2.50%  0.00%  2.00%  0.00%  
(0, 1.5)  ADMM  97.0%  0.34%  95.5%  0.17%  94.5%  0.07%  
IP  94.0%  0.29%  92.0%  0.16%  89.5%  0.06%  
RAMP  1.50%  0.00%  2.00%  0.00%  1.00%  0.00%  
(0.2, 1)  ADMM  100.0%  0.23%  99.5%  0.13%  97.0  0.06%  
IP  97.0  0.24%  94.0  0.12%  90.5%  0.05%  
RAMP  1.50%  0.00%  2.50%  0.00%  2.00%  0.00%  
(0.2, 1.5)  ADMM  96.5%  0.25%  95.0%  0.17%  94.0%  0.06%  
IP  93.0%  0.27%  91.5%  0.13%  88.0%  0.04%  
RAMP  1.00%  0.00%  1.50%  0.00%  1.00%  0.00%  
8  (0, 1.0)  ADMM  97.0%  0.31%  96.5%  0.18%  94.5%  0.05% 
IP  75.5%  0.29%  60.0%  0.17%  57.5%  0.08%  
RAMP  0.00%  0.00%  1.00%  0.00%  0.00%  0.00%  
(0, 1.25)  ADMM  95.0%  0.28%  93.5%  0.16%  91.0%  0.08%  
IP  48.5%  0.23%  39.0%  0.13%  35.0%  0.06%  
RAMP  1.00%  0.00%  2.50%  0.00%  1.00%  0.00%  
(0.2, 1.0)  ADMM  95.5%  0.35%  92.5%  0.23%  91.5%  0.12%  
IP  67.5%  0.28%  54.0%  0.20%  52.5%  0.13%  
RAMP  1.00%  0.00%  1.00%  0.00%  1.50%  0.00%  
(0.2, 1.25)  ADMM  94.0%  0.29%  92.0%  0.17%  90.5%  0.09%  
IP  64.0%  0.25%  52.0%  0.03%  50.5%  0.08%  
RAMP  1.00%  0.00%  2.50%  0.00%  2.00%  0.00% 
Method  Classification Error  Median Model Size 

ADMM  6.13% (0.24%)  97 
IP  7.38% (0.22%)  85 
RAMP  8.15% (0.35%)  68 
ADMM  IP  RAMP  

Interaction  Frequency  Interaction  Frequency  Interaction  Frequency 
BRCA1 LMO3  75  cMyc KLK3  69  ESR1 IRF4  64 
BRCA1 PALB2  71  FOXA1 NFB  62  SERINC5 CHPF  61 
TP53 RXRA  64  JUND GSTP1  56  EPCAM RBP1  55 
BRCA2 HNF4A  59  BRCA1 ALCAM  52  MAF LMOD1  51 
PDLIM5 SETDB1  58  DPT SETDB1  51  BRCA2 TARP  51 
5 Some Theoretical Analysis
5.1 Nonasymptotic Results
We now perform some theoretical analysis of the estimator (5). Outline of the proofs are provided in the Appendix of the paper. Denote by the unknown truth of the principal Hessian matrix, and the support of , and be the cardinality of . Denote by the maxnorm of the matrix .
The following lemma establishes a nonasymptotic result on the support of .
Lemma 1.
Let be the cone depending on the truth . We have that provided that the tuning parameter satisfies .
We also have the following lemma containing a nonasymptotic oracle’s inequality.
Lemma 2.
Let If , then
To establish an estimation error bound of , we need the following condition.
Condition 2.
(Restricted eigenvalue condition) Let . For all ,
(9) 
Then we have the following theorem, establishing a key nonasymptotic result on the estimating error bound of .
Theorem 1.
Suppose that the restricted eigenvalue Condition 2 holds. If , then .
The main message of Theorem 1 is the good performance nonasymptotically in the sense of small error bound. The restricted eigenvalue Condition 2 essentially requires that on the set , the smallest eigenvalue of is strictly bounded away from 0. We remark that the condition (9) here is not essentially stronger that the analogous one with highdimensional linear model for sparse estimator of as in Hastie et al. (2015) and Bühlmann and van de Geer (2011), requiring that for all in an analogous restricted set depending on the unknown truth . Then by observing the fact that that smallest eigenvalue of is the square of the smallest eigenvalue of , we see that Condition (2) is essentially of the same kind.
5.2 Asymptotic Results
The restricted eigenvalue condition (9) is known important for establishing the error bounds of penalized estimators in regression problems; see, among others, Negahban et al. (2012) and Hastie et al. (2015). Since our estimator (5) is the minimizer of the sum a quadratic loss and penalty, this type of the restricted eigenvalue condition is expected. It is known in the literature that in an asymptotic setting assuming that the entries of following some distributions satisfying some condition on the tail probabilistic behavior, then the condition (9
) holds with probability tending to 1. We refer to the discussion in
Negahban et al. (2012) and results in Raskutti et al. (2010), Raskutti et al. (2011), and Rudelson and Zhou (2011).As in Lemma 1 and Theorem 1, the condition imposes the requirement on the tuning parameter . By their definitions, is an estimator of , and is an estimator of . As in an asymptotic setting with and , they all converge to the truths elementwise in probability under appropriate conditions on the tail distribution of and . Thus, the tuning parameter is allowed going to as , ensuring that , and in probability, i.e. consistency of .
Formally, for showing the asymptotic properties of , we impose the following condition.
Condition 3.
The random vectors are independent and identically distributed. The largest eigenvalue of the is strictly bounded away from infinity. Both and satisfy exponential tail property, i.e., there exist constants , , and such that
Condition 3 ensures the concentration property of . So that by choosing , holds with probability at least when with some constants .
Then we have the following asymptotic result on the error bound of .
Theorem 2.
Suppose that is sparse with support , and the cardinality . Suppose additionally that , . Then under Condition 3, by choosing , .
To ensure that zero components of are correctly estimated as zero by , we need the following irrepresentable condition.
Condition 4.
(Irrepresentable condition) Let with and corresponding to the partition of and . Then for some , where is the th column of .
Theorem 3.
References
 Bickel and Levina (2008) Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance matrices. Annals of Statistics, 36:199–227.
 Bien et al. (2013) Bien, J., Taylor, J., and Tibshirani, R. (2013). A lasso for hierarchical interactions. The Annals of Statistics, 41(3):1111–1141.
 Boyd et al. (2011) Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122.
 Bühlmann and van de Geer (2011) Bühlmann, P. and van de Geer, S. (2011). Statistics for Highdimensional Data: Methods, Theory and Applications. New York: Springer.
 Choi et al. (2010) Choi, N. H., Li, W., and Zhu, J. (2010). Variable selection with the strong heredity constraint and its oracle property. Journal of the American Statistical Association, 105(489):354–364.
 Cook (1998) Cook, R. D. (1998). Principal hessian directions revisited. Journal of the American Statistical Association, 93(441):84–94.
 Eckstein and Yao (2012) Eckstein, J. and Yao, W. (2012). Augmented lagrangian and alternating direction methods for convex optimization: A tutorial and some illustrative computational results. RUTCOR Research Reports, 32:3.
 Fan and Lv (2010) Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20:101–148.
 Fan et al. (2016) Fan, Y., Kong, Y., Li, D., and Lv, J. (2016). Interaction pursuit with feature screening and selection. Manuscript. arXiv:1605.08933.
 Fang et al. (2015) Fang, E. X., He, B., Liu, H., and Yuan, X. (2015). Generalized alternating direction method of multipliers: new theoretical insights and applications. Mathematical programming computation, 7(2):149–187.
 Fang et al. (2017) Fang, E. X., Li, M.D., Jordan, M. I., and Liu, H. (2017). Mining massive amounts of genomic data: a semiparametric topic modeling approach. Journal of the American Statistical Association, 112(519):921–932.
 Glowinski and Marroco (1975) Glowinski, R. and Marroco, A. (1975). Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisationdualité d’une classe de problèmes de dirichlet non linéaires. Revue française d’automatique, informatique, recherche opérationnelle. Analyse numérique, 9(R2):41–76.
 Hao et al. (2016) Hao, N., Feng, Y., and Zhang, H. H. (2016). Model selection for highdimensional quadratic regression via regularization. Journal of the American Statistical Association, pages 1–11.
 Hao et al. (2018) Hao, N., Feng, Y., and Zhang, H. H. (2018). Model selection for highdimensional quadratic regression via regularization. Journal of the American Statistical Association, pages 1–11.
 Hao and Zhang (2014) Hao, N. and Zhang, H. H. (2014). Interaction screening for ultrahighdimensional data. Journal of the American Statistical Association, 109(507):1285–1301.
 Haris et al. (2016) Haris, A., Witten, D., and Simon, N. (2016). Convex modeling of interactions with strong heredity. Journal of Computational and Graphical Statistics, 25(4):981–1004.
 Hastie et al. (2015) Hastie, T., Tibshirani, R., and Wainwright, M. (2015). Statistical Learning with Sparsity: The Lasso and Generalizations. Taylor & Francis Inc.
 Jaccard and Turrisi (2003) Jaccard, J. J. and Turrisi, R. (2003). Interaction Effects in Multiple Regression. SAGE PUBN.
 Jiang and Liu (2014) Jiang, B. and Liu, J. S. (2014). Variable selection for general index models via sliced inverse regression. The Annals of Statistics, 42(5):1751–1786.
 Jiang et al. (2016) Jiang, B., Wang, X., and Leng, C. (2016). Quda: A direct approach for sparse quadratic discriminant analysis. Manuscript. arXiv:1510.00084.
 King et al. (2003) King, M.C., Marks, J. H., Mandell, J. B., et al. (2003). Breast and ovarian cancer risks due to inherited mutations in brca1 and brca2. Science, 302(5645):643–646.
 Kong et al. (2017) Kong, Y., Li, D., Fan, Y., and Lv, J. (2017). Interaction pursuit in highdimensional multiresponse regression via distance correlation. The Annals of Statistics, 45(2):897–922.
 Li (1991) Li, K.C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414):316–327.
 Li (1992) Li, K.C. (1992). On principal hessian directions for data visualization and dimension reduction: Another application of stein’s lemma. Journal of the American Statistical Association, 87(420):1025–1039.

Li and Liu (2017)
Li, Y. and Liu, J. S. (2017).
Robust variable and interaction selection for logistic regression and general index models.
Journal of the American Statistical Association, pages 1–16.  McCall et al. (2010) McCall, M. N., Bolstad, B. M., and Irizarry, R. A. (2010). Frozen robust multiarray analysis (fRMA). Biostatistics, 11(2):242–253.
 Negahban et al. (2012) Negahban, S. N., Ravikumar, P., Wainwright, M. J., and Yu, B. (2012). A unified framework for highdimensional analysis of $m$estimators with decomposable regularizers. Statistical Science, 27(4):538–557.
 Nelder (1977) Nelder, J. A. (1977). A reformulation of linear models. Journal of the Royal Statistical Society. Series A (General), 140(1):48.
 Raskutti et al. (2010) Raskutti, G., Wainwright, M. J., and Yu, B. (2010). Restricted eigenvalue properties for correlated gaussian designs. Journal of Machine Learning Research, 11(Aug):2241–2259.

Raskutti et al. (2011)
Raskutti, G., Wainwright, M. J., and Yu, B. (2011).
Minimax rates of estimation for highdimensional linear regression over $\ell_q$balls.
IEEE Transactions on Information Theory, 57(10):6976–6994.  Rudelson and Zhou (2011) Rudelson, M. and Zhou, S. (2011). Reconstruction from anisotropic random measurements. Manuscript. arXiv:1106.1151.
 Shah (2016) Shah, R. D. (2016). Modelling interactions in highdimensional data with backtracking. Journal of Machine Learning Research, 17(1):7225–7255.
 She et al. (2018) She, Y., Wang, Z., and Jiang, H. (2018). Group regularized estimation under structural hierarchy. Journal of the American Statistical Association, 113(521):445–454.
 Stein (1981) Stein, C. M. (1981). Estimation of the mean of a multivariate normal distribution. The Annals of Statistics, 9(6):1135–1151.
 Sun et al. (2015) Sun, D., Toh, K.C., and Yang, L. (2015). A convergent 3block semiproximal alternating direction method of multipliers for conic programming with 4type constraints. SIAM Journal on Optimization, 25(2):882–915.
 Wu et al. (2013) Wu, G., Yustein, J. T., McCall, M. N., Zilliox, M., Irizarry, R. A., Zeller, K., Dang, C. V., and Ji, H. (2013). ChIPPED enhances the analysis of ChIPseq and ChIPchip data. Bioinformatics.
 Yuan et al. (2009) Yuan, M., Joseph, V. R., and Zou, H. (2009). Structured variable selection and estimation. The Annals of Applied Statistics, 3(4):1738–1757.
 Zhao et al. (2009) Zhao, P., Rocha, G., and Yu, B. (2009). The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics, 37(6A):3468–3497.
Comments
There are no comments yet.