BOLT-SSI: A Statistical Approach to Screening Interaction Effects for Ultra-High Dimensional Data

02/10/2019 ∙ by Min Zhou, et al. ∙ Victoria University of Wellington Xi'an Jiaotong University Duke-NUS Medical School Hong Kong Baptist University The Hong Kong University of Science and Technology 0

Detecting interaction effects is a crucial step in various applications. In this paper, we first propose a simple method for sure screening interactions (SSI). SSI works well for problems of moderate dimensionality, without heredity assumptions. For ultra-high dimensional problems, we propose a fast algorithm, named "BOLT-SSI". This is motivated by that the interaction effects on a response variable can be exactly evaluated using the contingency table when they are all discrete variables. The numbers in contingency table can be collected in an efficient manner by Boolean representation and operations. To generalize this idea, we propose a discritization step such that BOLT-SSI is applicable for interaction detection among continuous variables. Statistical theory has been established for SSI and BOLT-SSI, guaranteeing their sure screening property. Experimental results demonstrate that SSI and BOLT-SSI can often outperform their competitors in terms of computational efficiency and statistical accuracy, especially for the data with more than 300,000 predictors. Based on results, we believe there is a great need to rethink the relationship between statistical accuracy and computational efficiency. The computational performance of a statistical method can often be greatly improved by exploring advantages of the computational architecture, and the loss of statistical accuracy can be tolerated.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The recent two decades are the golden age for the development of statistical science on high dimensional problems. A large number of innovative algorithms have been proposed to address the computational challenges in statistical inference for high dimensional problems. Despite a fruitful achievement in statistical science, there still exists a gap between the established statistical theory and computational performance of developed algorithms. On one hand, many statistical models can deal with the high dimensional problems under some theoretical mild conditions, but their computational cost can be too expensive to be affordable when dimensionality becomes extremely large. On the other hand, to address many real problems, more and more algorithms are not developed in a principled way and thus it leads to computational results without statistical guarantee. As argued by Chandrasekaran and Jordan (2013), there is a great need to rethink the relationship between statistical accuracy and computational efficiency.

To bridge the gap, most statistical literatures focus on reducing the theoretical complexity of an algorithm, or simply using parallel computing to speedup it, while paying not enough attention to taking advantages of computational architecture. In fact, computational performance of statistical models can often be greatly improved by designing new data structures or using hardware acceleration (e.g., graphical processing units for training deep neural networks). In this paper, we use the interaction detection problem in high dimensional models as an example, to demonstrate that it is possible to design statistically guaranteed algorithms to overcome seemingly unaffordable computational cost by taking advantages of computational architecture.

1.1 Related work for interaction effect detection

The word “interaction”, in Oxford English Dictionary, is illustrated as the reciprocal action, action or influence of persons or things on each other. It is one kind of relationship among two or more objects, which have mutual influence upon one another. There is a long history of investigating the interaction effects in many different scientific fields. For example, in Physical Chemistry, the main topics are interactions about atoms and molecules. A simple example in real-world is that neither of carbon and steel has much effect on the strength but a combination of them has substantial effect. In Medicine and Pharmacology, the interaction effects of multiple drugs have been widely observed [Lees et al. (2004)]. In genomics, gene-gene interactions and gene-environment interactions have been studied by bio-medical researchers since Bateson (1909). In recent years, increasing interest has been focusing on detecting gene-gene interactionsp from genome-wide association studies (GWAS) [Cordell (2009)].

In this paper, we investigate the interaction effects from a statistical perspective, where the interaction effect is characterized by the statistical departure from the additive effects of two or more factors [see Fisher (1918),Cox (1984)]. In the framework of high dimensional regression, it is common to use products of explanatory variables to study interaction effects of explanatory variables on response variables. Consider three explanatory variables , and , their two-way interaction terms are , and

. By including these interaction terms, the standard linear regression model becomes

(1)

where is the response variable, is the intercept term, is the coefficient of main effect term , is the coefficient of interaction term , and is the independent error. For the high dimensional data, the number of variables can be much larger than sample size . Clearly, the number of parameters to be determined would be if interaction terms are included. For example, in GWAS, there are millions of genotyped genetic variants, i.e., . The number of interaction terms goes up to an astronomical number at order . The computational cost of detecting interaction effects in such a scale becomes seemingly un-affordable, although statistical theory can be applied with some mild conditions (e.g., sparsity assumptions).

To reduce the computational cost, recently developed methods often make two types of heredity assumptions: the strong heredity assumption means that the interaction effect is important only if its both parents are significant, while the weak heredity assumption illustrates that the interaction term is important only if at least one of parents is included in the model. To name a few, Choi et al. (2010) extended the LASSO method and identified the significant interaction terms in the linear model and generalized linear models under the strong heredity assumption. They proved that their method possessed the oracle property [Fan and Li (2001) and Fan and Peng (2004)], that is, it performed well if the true model was known in advance. The algorithm hierNet was developed by Bien et al. (2013) to select the interactions, which added a set of convex constraints to LASSO in the linear model and constructed the sparse interaction model with the strong or weak heredity assumptions. For the linear model, Hao and Zhang (2014) also proposed two algorithms iFORT and iFORM, and identified the interaction effects in a greedy fashion under the heredity assumption. Hao et al. (2016)

further improved interaction detection by proposing a regularization algorithm under marginality principle (RAMP). To get rid of the dependence of heredity assumptions for interaction detection, building upon the strong assumption on the joint normal distribution between the response variable and the predictor variables,

Fan et al. (2016) suggested a flexible sure screening procedure, called the interaction pursuit (IP), in ultra-high dimensional linear interaction models. The idea of the method IP is to select the “active interaction variables” by screening significant predictor variables with the strong Pearson correlation between and firstly, and then detect the interaction effects among those identified active interaction variables. Kong et al. (2016) extended IP to the ultra-high dimensional linear interaction model with multiple responses by identifying the active interactive variables by the distance correlation with and the multiple response , where be a

-dimensional vector of responses and

.

However, the heredity assumption may not hold in practice due to the existence of pure interaction effects. In human genetics, a number of gene-gene interaction effects have been detected in the absence of their main effects [Cordell (2009) and Wan et al. (2010)]. This motivates new methods to detect interactions without the heredity assumption. Recently, a new algorithm based on random projection was introduced by Thanei et al. (2016) to screen interaction effects. This algorithm does not rely on the heredity assumption, thus it can detect interaction effects in the absence of corresponding main effects. Based on our empirical observation, however, real data performance of this algorithm is not quite satisfactory because the accuracy of detecting interaction effects largely depends on the number of random projections. Yet, computationally efficient algorithms with statistically guaranteed performance for interaction detection are still lacking.

1.2 Our Contribution

Our contribution is to develop a computationally efficient and statistically guaranteed method for interaction detection in high dimensional problems:

  • We propose a new sure screening procedure (SSI) based on the increment of log-likelihood function to fully detect significant interactions for the high dimensional generalized linear models. Furthermore, in order to reduce the computational burden, we take the advantages of computer architecture such as parallel techniques and Boolean operation to construct more computationally efficient algorithm BOLT-SSI, and make detecting interaction effects in a large scale data set available. For example, for the data set Northern Finland Birth Cohort (NFBC) with individuals and SNPs, the number of interactions is about . BOLT-SSI can quickly screen all theses interactions with a short time. The details can be seen in the section 6.

  • Moreover, we investigate the sure screening properties of SSI and BOLT-SSI from theoretical insights, and show that our computationally efficient methods are statically guaranteed. We provide implementations of both the core SSI algorithm and its extension BOLT-SSI in the R package BOLT-SSI, available on the authors’ personal website
    (https://github.com/daviddaigithub/BOLTSSIRR).

  • More importantly, our work is a good try to integrate the advantage of well-designed computer architecture and statistically rigorous methodology. We take it as an example to promote the application of computational structure in the statistical modelling and practice, especially in the era of “Big Data”. We hope this example motivate more combination of statistical methods and computational skills, greatly improving the computational performance of statistical methods.

The rest of this paper is organized as follows. In Section 2 and 3, we propose the sure screening algorithms SSI and BOLT-SSI for detecting interactions in ultra-high dimensional generalized linear regression model, where we briefly introduce the Boolean representation and operations. The theoretical properties of sure screening for the proposed methods are investigated in Section 4. In section 5, we examine the finite sample performance of SSI and BOLT-SSI in comparison to alternative methods, RAMP, -algorithm, and IP, through simulation studies. In Section 6, three real data sets are used to demonstrate the utility of our approaches. Our findings and conclusions are summarized in Section 7. The detailed proofs are relegated to the Appendix.

2 Sure Screening Methods for Interaction in GLM

2.1 Generalized linear models(GLM) with Two-way Interaction

Assume that given the predictor vector

, the conditional distribution of the random variable

belongs to an exponential family, whose probability density function has the canonical form

(2)

where and are some known functions and is a canonical natural parameter. Here we ignore the dispersion parameter in (2

), since we only concentrate on the estimation of mean regression function. It is well know that the distributions in the exponential family include the Binomial, Gaussian, Gamma, Inverse-Gaussian and Poisson distributions.

We consider the following generalized linear model with two-way interactions:

(3)

for the canonical link function with

Let , where and . For simplicity, we assume that

and other each predictor variable is standardized with mean 0 and variance 1. The corresponding sets of coefficient are

where

In the ultra-high dimensional regression model, we usually assume that there is a sparse structure in the underlying model. It means that only a few of predictor variables or features are significantly correlated with response . Hence for the above model with two-way interactions, we assume there is only a small number of interactions contributing to the response . Denote that the true parameter , where for main effects, and with for interactions.

Let

and denote that , then the non-sparsity size is a relative small number compared to the dimension of the model.

2.2 SSI for two-way interaction in GLM

The model (3) can be simply rewritten as an ordinary generalized linear regression model form

(4)

Fan et al. (2009) suggested to select the important variables by sorting the marginal likelihood, and Fan and Song (2010) pointed out that such technique can be considered as the marginal likelihood ratio screening, which builds on the difference between two marginal log-likelihood functions. If we regard the interaction variable the same as other main effects from predictor variables , by considering the marginal likelihood of , we could directly apply the sure screening techniques of Fan et al. (2009) and Fan and Song (2010) to detect the significant interaction effects. But such direct screening method ignores the main effects of and , as argued by Jaccard et al. (1990), it often makes false discovery for the pure significant interaction effects. Hence we consider the following sure screening procedure to detect pure interaction effects in the model (3).

Denote that the random samples are i.i.d. from the model (3) with the canonical link. Let and . And their coefficients are expressed as and , respectively.

The first step of the Sure Screening procedure to detect the Interaction effects (SSI) is to calculate the maximum marginal likelihood estimator by the minimizer of the marginal regression

where and is the empirical measure. Similarly, we can calculate the maximum marginal likelihood estimator without the interaction effect by the minimizer of the marginal regression

Correspondingly, let the population version of the above minimizers of the marginal regressions be

and

In fact, the coefficient is able to measure the importance of the interaction terms from population insight. Though the real joint regression parameter would not be the same as the marginal regression coefficient , we could still expect that, under mild conditions, or the increment of the marginal log-likelihood function

is large, if and only if is some large.

Hence the second step of the SSI procedure is to calculate the increment of the empirical maximum marginal likelihood function,

and . Then measures the strength of the interaction in the marginal model from the empirical version. The larger , similar to , the more the interaction contributes to the response .

Final step of SSI procedure is to sort the vector in a descent order and given threshold value , select the following interaction effect variables

as the final candidates of the significant pure interaction effects.

The computational complexity of the proposed SSI procedure is in the order of . When is of a moderate size (), SSI can quickly screen all interaction terms. It can be further accelerated by parallel computing because all the interaction terms can be evaluated independently.

3 Bolt-Ssi

Despite the simplicity of SSI, it can not be scale up to handle the case that dimensionality is very large, e.g., . In this paper, we present a computationally efficient algorithm named “BOLT-SSI” to detect interactions in ultra-high dimensional problems. The BOLT-SSI algorithm is motivated by the following fact: When , and all are discrete variables, the interaction effects of and on

measured by logistic regression can be exactly calculated based on a few numbers in the contingency table of

, and . These numbers can be efficiently obtained by designing a new data structure and its associated operations, i.e., Boolean representation and Boolean operations. To handle continuous variables, we propose discretization first and then make use of the above strategy for screening. In this section, we describe the details of BOLT-SSI algorithm and establish statistical theory to guarantee its performance next section.

3.1 Equivalence between the logistic models and log-linear models

When all predictors and the response are categorical variables, we usually take the logistic model (for binary response) or baseline-category logit models (for the response with several categories) to fit the data set. Actually, the logistic regression models or baseline-category logit models have their corresponding log-linear regression models for contingency table when the predictor and the response are categorical (See

Agresti (2002)). Based on this equivalence, the significance of interaction effects can be measured by the increment of the corresponding log-linear regression models.

Assume that we consider the following two logistic models-the logistic regression model with main effect and the full logistic regression model:

(5)

and

(6)

Denote that and be the sample version of the negative maximum log-likelihood of the logistic regression model (5) with main effect and the full logistic regression model (6), respectively. The increment of the log-likelihood function is defined as . The corresponding log-linear regression models can be expressed as

(7)

and

(8)

Let and be the sample version of the negative maximum log-likelihood of the homogeneous association regression model (7) and the saturated model (8), respectively. is the corresponding increment of log-likelihood function. Thus, we can take advantage of to screen the interaction terms instead of using .

Now we want to obtain the difference . Suppose that we have one three-way () table with cell counts of random variables , and . The kernel of the log-likelihood function for this contingency table is

Denote that is the marginal probability of and is the number of samples with , is the marginal probability of and and is the corresponding count. Similarly, , , , , , , , .

For the saturated model (8), we know that and directly get the estimation . For the homogeneous association regression model (7), the iterative proportional fitting (IPF) algorithm is recommended by calculating the estimate of , which was introduced by Deming and Stephan (1940). Three steps are included in the first cycle of the IPF algorithm:

where , , , . This cycle does not stop until the process converges and the convergence property has been proved by Fienberg (1970) and Haberman (1974). We count the number by using the Boolean representation, thus the contingency table for and given can be quickly constructed in a faster manner. Finally, the estimation will be obtained.

Consequently, we can take advantage of this equivalence to efficiently estimate the corresponding increment of log-likelihood function by the IPF algorithm when the predictors and the response are qualitative. If some variables are continuous, we can discretize them and the details can be seen in the next section. In section 4, we show that our algorithm is still statistically guaranteed after discetization.

3.2 Discretization

When some of predictors and/or response are continues, we suggest to discretize them. Binning by equal-width or equal-frequency is a simplest method. Considering the variation of random observations, it would be more reasonable to use the equal-frequency method by quantiles to split the domain of variables to several intervals. The number of intervals is called “arity” in the discretization context (See

Liu et al. (2002)). Assume that the arity is denoted by , and then is the maximum number of cut-points of the continuous features.

For more detail, we follow the assumption of Fan and Song (2010)

, and consider variable or feature selection of the generalized linear model:

(9)

where is a random vector, is the parameter vector, is the response, is the canonical link function, and assume that

is the set of indexes of nonzero parameter. Define the marginal log-likelihood increment

where , , and

Furthermore, and , . Let and , be the independent copies of .
Assume that and are the support sets of variables and , respectively. Denote that and are partitions of their supports, which means that

and

where and are two positive constants. Here, the quantiles and quantiles are considered as the break points for the partitions of variables and . Define

and then variables and are discretized to two categorical variables and , respectively. Furthermore, denote that , and , , where is the indicator function. After discretization, we have the new increment of log-likelihood function as

Now consider the discrization for the marginal model with the interaction effect. Assume that , and are the support sets of variables , and , respectively. Denote that , and are partitions of their supports, which means that

and

where , and are positive constants. Here, we still consider the quantiles, quantiles and quantiles as the break points for the partitions of variables , and , respectively. Define

Furthermore, denote that

And also, we define the discretized response ,

Hence, we have the new categorical predictor , and response , respectively. And also, we get the new interaction variable . Furthermore, denote that

and , , where is the indicator function. After discretization, the new increment of log-likelihood function in population version is defined as

Remark 3.1

Actually, there is a trade-off between the arity and the accuracy of screening procedures. Higher arity would make the sure screening with more large probability. However, when the sample size of data are large enough, the relative small arity could also guarantee the accuracy of the screening procedure from our theoretical investigation and numerical studies. Hence though large for different continuous features can be also used. we recommend to let to make trade-off between the computation burden and efficiency of model estimation for our proposed BOLT-SSI when the sample size of data is relative large.

Furthermore, if is a continuous response, similarly we also suggest to use 2-quantile (median) to split the response , that is, and

where is the median of the response .

3.3 Boolean Representation and Logical Operations

After discretization, the Boolean operation can be used to speedup SSI procedure, especially the algorithm to calculate . The Boolean Representation and Operation is a classical and fundamental computer computing technique. A standard floating computing which is a basic computing operation of many statistical software is composed of hundreds of Boolean operations under a low level of the computer computing. Hence if the boolean operation can be directly applied to the computation of the proposed algorithm, the computing speed could be much improved.

Assume that the continuous data set is one matrix with observations and predictors, be the response. After discretizing data set and response , each predictor has levels and has categories. Here, we take and as an example to make illustration. Assuming that has two values (0 and 1), then instead of using one row for each predictor , the new representation uses 3 rows since 3 levels are included in each . Each row consists of two-bit strings, one for samples with and the other for them with , and each bit can represent one sample in the string. The values (0 and 1) illustrate whether the sample belongs to such categorical level for each predictor . For instance, we have one discretized data set with 2 predictors and 16 samples, where the first 8 columns represent samples with and the others represents samples with :

and its Boolean representation is

From the Boolean representation , we can easily find that the first sample belongs to the first category of and the third category of . And also, we can quickly obtain the number of observations that belong to any two categories by taking the logic operation. For example, if we want to calculate the number of samples with and in the category , we just conduct the logical AND operation:

and then, we count the number of 1s in the final string “00000100”, that is 1. This result is consistent to that in . As a result, it is more efficient by using to construct the contingency table for any two discretized predictors. Since the fast logic operation with is utilized, we can accelerate our calculation for our algorithm.

Obviously, and are equivalent and they store the same information. Because one byte is composed of 8 bits, uses 128 bits to save the data, but would use bits, 16 times of the space of , to save the same data if our computer is a 64 bit computer system. As a result, the Boolean representation could reduce dramatically the storage space of the data. So all of the large data could be directly uploaded into the RAM memory, or even be saved in the cache. Then by detailed programming, the number of much slow uploading operation for the data from hard disk to RAM memory, or from RAM memory to the cache can be much reduced. It is the other advantage of the Boolean representation or the discretization.

3.4 New algorithm “BOLT-SSI”

Now, we illustrate our algorithm BOLT-SSI in details. For our ultra-high dimensional generalized linear model (3), instead of calculating the increment for any pair of and , we compute the new increment of the log-likelihood function by the IPF method. Then, by taking the thresholding value or choosing the large or , the selected sure screening set is obtained. Our algorithm BOLT-SSI is summarized as follows:

For any pair of the continuous variables and , , transform them to the corresponding discretized variables with level and with level , and change the response to a categorical variable if necessary.

Directly calculate and use the IPF algorithm to approximately estimate , and then compute for all pairs of and .

Choose the threshold and select the following interactions:

Usually, we take the largest , where .

Sometimes, the dimension is very large and may be the order of tens of millions. The IPF method may be time-consuming for computing all . Here, we propose to use an approximation tool to prune interaction terms in the second step. For the homogeneous association regression model (7), Kirkwood Superposition Approximation (KSA), which was firstly proposed by Kirkwood (1935), is utilized to provide an estimator for in (7). That is,

where is a normalization term, . And then, we get the approximation for . Wan et al. (2010) shows that is an upper bound of , i.e.,

Based on this boundary and by setting up one threshold , in the second step, we can filter out many insignificant interaction terms quickly and then reduce the size of a pool of all interaction effects. The value can be defined by the conservative Bonferroni correction or specified by user. Obviously, if , no one interaction term is deleted in this step. In the final step, for the remaining interaction terms, we compute their by the IPF algorithm. Then select the largest , where or , or take the thresholding value to obtain the sure screening set . can be taken as the Bonferroni correction % percentile decided by the test with degree freedom for any one interaction between and .

In summary, our algorithm BOLT-SSI with KSA is listed as follows:

For any pair of continuous variables and , , transform them to corresponding discretized variables