1 Introduction
There is broad interest in incorporating interactions in linear regression. Extensions of linear regression to accommodate pairwise interactions are commonly referred to as quadratic regression. In moderate to highdimensional settings, it becomes very challenging to implement quadratic regression since the number of parameters to be estimated is
. Hence, classical methods such as least squares cannot be used and even common penalization and Bayesian methods can encounter computational hurdles. Reliable inferences on main effects and interactions is even more challenging when certain predictors are moderately to highly correlated.A lot of effort has been focused on estimating pairwise interactions in moderate highdimensional and ultra highdimensional problems. We refer to the former when the number of covariates is between and and to the latter when . When , the number of parameters to be estimated is greater than . When , onestage regularization methods like [Bien et al., 2013] and [Haris et al., 2016] can be successful. Some of these methods require a socalled heredity assumption [Chipman, 1996] to reduce dimensionality. Strong heredity means that the interaction between two variables is included in the model only if both main effects are. For weak heredity it suffices to have one main effect in the model to estimate the interaction of the corresponding variables. Heredity reduces the number of models from to or for strong or weak heredity, respectively [Chipman, 1996]. For ultra highdimensional problems, two stageapproaches have been developed, see [Hao et al., 2018] and [Wang et al., 2019]. However, these methods do not report uncertainties in model selection and parameter estimation, and rely on strong sparsity assumptions.
We are particularly motivated by studies of environmental health collecting data on mixtures of chemical exposures. These exposures can be moderately highdimensional with high correlations within blocks of variables; for example, this can arise when an individual is exposed to a product having a mixture of chemicals and when chemical measurements consist of metabolites or breakdown products of a parent compound. There is a large public heath interest in studying EE, EG and GG interactions, with E = environmental exposures and G = genetic factors. However, current methods for quadratic regression are not ideal in these applications due to the level of correlation in the predictors, the fact that strong sparsity assumptions are not appropriate, and the need for uncertainty quantification. Regarding the issue of sparsity, it is appealing given the data structure to select blocks of correlated exposures together instead of arbitrarily selecting one chemical in a group.
To address these problems, one possibility is to use a Bayesian approach to inference in order to include prior information to reduce dimensionality while characterizing uncertainty through the posterior distribution. There is an immense literature on Bayesian methods for highdimensional linear regression, including recent algorithms that can scale up to thousands of predictors [Bondell and Reich, 2012], [Rossell and Telesca, 2017], [Johndrow et al., 2017], [Nishimura and Suchard, 2018]. In addition some articles have explicitly focused on quadratic regression and interaction detection [Zhang and Liu, 2007], [Cordell, 2009], [Mackay, 2014]. Bayes variable selection and shrinkage approaches will tend to have problems when predictors are highly correlated; this has motivated a literature on Bayesian latent factor regression, [Lucas et al., 2006], [Carvalho et al., 2008].
Bayesian latent factor regression incorporates shared latent variables in the predictor and response components. This provides dimensionality reduction in modeling of the covariance structure in the predictors and characterizing the impact of correlated groups of predictors in the response. Such approaches are closely related to principal components regression, but it tends to be easier to simultaneously incorporate shrinkage and uncertainty quantification within the Bayesian framework. In addition, within the Bayes latent factor regression paradigm, typical identifiability constraints such as orthogonality are not needed (see, for example [Bhattacharya and Dunson, 2011]). The main contribution of this article is to generalize Bayesian latent factor regression to accommodate interactions using an approach inspired by [Wang et al., 2019]. This is accomplished by including pairwise interactions in the latent variables in the response component. We refer to the resulting framework as Factor analysis for INteractions (FIN). There is rich literature on quadratic and nonlinear latent variable modeling, largely in psychometrics (refer, for example, to [Arminger and Muthén, 1998]) However, to our knowledge, such approaches have not been used for inferences on interactions in regression problems.
In Section 2 we describe the proposed FIN framework, including extensions for higher order interactions and multivariate outcomes. In Section 3 we provide theory on consistency and model misspecification. Section 4 contains a simulation study. Section 5 illustrates the methods on NHANES data. Code is available at https://github.com/fedfer/factor_interactions.
2 Model
2.1 Model and Properties
Let denote a continuous health response for individual , and
denote a vector of exposure measurements. We propose a latent factor joint model, which includes shared factors in both the predictor and response components while assuming conditional independence. We include interactions among latent variables in the response component. We also assume that, given the latent variables, the explanatory variables and the response are continuous and normally distributed. We assume that the data have been normalized prior to the analysis so that we omit the intercept. The model is as follows:
(1)  
where . In a Bayesian fashion, we assume a prior for the parameters that will be specified in Section 2.2. Model is equivalent to classical latent factor regression models; refer, for example, to [West, 2003], except for the term. Here, is a symmetric matrix inducing a quadratic latent variable regression that characterizes interactions among the latent variables.
The above formulation can be shown to induce a quadratic regression of on . To build intuition consider the case in which as done in West [2003] for the special case in which . The manytoone map has multiple generalized inverses such that . If we substitute in the regression equation, we obtain
The following proposition gives a similar result in the non deterministic case:
Proposition 1.
Under model (1), the following are true:

,

,
where and .
This shows that the induced regression of on from model (1) is indeed a quadratic regression. Let us define the induced main effects as and the matrix containing the first order interactions as . Notice that we could define as a diagonal matrix and we would still estimate pairwise interactions between the regressors, further details are given in Sections 2.3 and 2.5.
In epidemiology studies, it is of interest to include interactions between chemical exposures and demographic covariates. The covariates are often binary variables, like
race or sex, or continuous variables that are nonnormally distributed, like age. Hence, we do not want to assume a latent normal structure for the covariates. Letting be a vector of covariates, we modify model to include a main effect for and an interaction term between and the latent factor :(2)  
where is a matrix of interaction coefficients between the latent variables and the covariates, and are main effects for the covariates. Following Proposition 1 we have that
where is a matrix of pairwise interactions between exposures and covariates. In the sequel, we focus our development on model for ease in exposition, but all of the details can be easily modified to pertain to model .
2.2 Priors and MCMC algorithm
In this section we define the priors for , briefly describe the computational challenges given by model
and summarize our Markov Chain Monte Carlo sampler in
Algorithm 1. We choose an InverseGamma distribution with parameters
for and for . The elements of and are given independent Gaussian priors. For , we choose the DirichletLaplace (DL) prior of [Bhattacharya et al., 2015] rowwise, corresponding towhere , , DE refers to the zero mean doubleexponential or Laplace distribution, and is an upper bound on the number of factors, as the prior allows effective deletion of redundant factors through setting all elements of columns of close to zero. The DL prior provides flexible shrinkage on the factor loadings matrix, generalizing the Bayesian Lasso [Park and Casella, 2008] to have a careful chosen hierarchical structure on exposurespecific () and local () scales. This induces a prior with concentration at zero, to strongly shrink small signals, and heavytails, to avoid overshrinking large signals. The DL prior induces near sparsity rowwise in the matrix , as it is reasonable to assume that each variable loads on few factors. In Section 2.4
, we describe how the above prior specification induces an appealing shrinkage prior on the main effects and interactions, and discuss hyperparameter choice.
The inclusion of pairwise interactions among the factors in the regression of the outcome rules out using a data augmentation Gibbs sampler, as in [West, 2003], [Bhattacharya and Dunson, 2011]. The log full conditional distribution for is:
where is a normalizing constant. We update the factors using the MetropolisAdjusted Langevin Algorithm (MALA) [Grenander and Miller, 1994], [Roberts et al., 1996]. Sampling the factors is the main computational bottleneck of our approach since we have to update vectors, each of dimension . The overall MCMC algorithm and the MALA step are summarized in Algorithm 1 in the Appendix.
2.3 Higher order Interactions
FIN can be generalized to allow for higher order interactions. In particular, we can obtain estimates for the interaction coefficients up to the order with the following model:
(3) 
which is a polynomial regression in the latent variables. We do not include interactions between the factors, so that the number of parameters to be estimated is . When , this model is equivalent to being a diagonal matrix. Recall that , where and are defined in Proposition 1. Since we do not include interactions among the factors, let us just focus on the marginal distribution of the factor, i.e where and . Below we provide an expression for
, which can be calculated using noncentral moments of a Normal distribution, see
[Winkelbauer, 2012] for a reference.where , and . We just need to sum up over the index in and we can read out the expressions for the intercept, main effects and interactions up to the order. In particular, we have that the intercept is equal to . When this reduces to , where . The expression for the main effects coefficients on is . When this becomes , hence . Similarly the expression for the interaction between and is and when we have which is equal to .
In general, if we are interested in the order interactions, we can find the expression on the top summation for when
is odd and on the bottom summation for
when is even. Finally notice that with parameters we manage to estimate parameters thanks to the low dimensional factor structure in the covariates.2.4 Induced priors
In this section, we show the behavior of the induced priors on the main effects and pairwise interaction coefficients under model using simulated examples, and we show the induced grouping of coefficients when we have prior information on the covariance structure of . We endow with a normal prior having zero mean and covariance equal to , where is a diagonal matrix. Then, conditional on and , the induced prior on is also Normal with mean and covariance equal to . Recall from Proposition 1 that the induced main effect coefficients are equal to . This expression is equivalent to [West, 2003] and we can similarly characterize the limiting case of , i.e. when the factors explain all of the variability in the matrix of regressors . Let and , together with enforcing to be orthogonal, we have that . It follows that has the generalised singular gprior (or gsgprior) distribution defined by [West, 2003], whose density is proportional to .
Now, consider the extension presented in the previous section, where we include powers of the factors in the regression of . In Figure 1, we show the induced marginal priors for main effects, pairwise interactions and order interactions when and . For a fixed
, there is increasing shrinkage towards zero with higher orders of interaction. However, we avoid assuming exact sparsity corresponding to zero values of the coefficients, a standard assumption of other methods. Although most of the mass is concentrated around zero, the distributions have heavy tails. We can indeed notice that the form of the priors resembles a mixture of two normal distributions with different variances, and that we place a higher mixture weight on the normal distribution concentrated around zero as we increase the order of interactions. Also, notice that the priors are flatter as we increase the number of latent factors
.In environmental epidemiology, it is common to have prior knowledge of groups of exposures that are highly correlated and it is natural to include such information in the specification of . Let us consider the scenario when the true number of factors is equal to , and the variables in can be divided in two groups: and , of dimensions and , each one loading on only one factor. This structure is reasonable when we have measurements of different chemicals that are breakdown products of the same exposure, for example PCB metabolites [Longnecker et al., 2001]. Then the a priori covariance between main effects is equal to , for , when the chemicals belong to the same group, and is zero otherwise. Hence, in this case, is block diagonal. On the other hand, assume that the number of factors is equal to the number of covariates, with being diagonal. In this case the induced covariance on is diagonal with elements . In general when there are groups, the variance of , with , is equal to . Hence, the variance of is lower with respect to the independent case since we are borrowing strength and information from the other covariates within the same group.
Let us now focus on the symmetric matrix of dimension , letting be the vector of lower triangular elements of . Define the duplication matrix as the matrix such that , see Magnus [1988] as a reference. The duplication matrix can be easily calculated for orders and , whereas the R package matrixcalc provides the duplication matrix for higher orders. We are interested in the distribution of . Notice that
We choose a normal prior on pairwise interactions, i.e., , so that . Computing the covariance of the induced Normal prior on the matrix containing the pairwise interactions , we find that the variables are divided in three groups: , and . The quadratic effect of the first two covariates are correlated with each other and with the interaction between them. The same holds for the variables loading on the second factor. Finally, the third group contains the interactions between one variable loading on the first factor and one variable loading on the second factor. In general, with variables and factors, we will have in total groups. In particular, the first groups will be made by the interactions between variables loading on the same factor. On the other hand, we have groups for the interactions between variables loading on different factors, as in the previous example with and .
2.5 Complexity Gains
Inference under existing approaches for Bayesian linear modeling for pairwise interactions when is moderately high is typically computationally infeasible. In fact the complexity per iteration of Gibbs sampling is and the storage is of order . This is without considering any heredity structure. On the other hand, with model we just need samples of , , and to compute main effects and interactions of on thanks to Proposition 1. The complexity per iteration of Gibbs sampling is , where is the number of factors. In our motivating applications, we have . Further, the storage complexity is only since we only need to save the samples of and .
The computational complexity could be further reduced using the algorithm of [Sabnis et al., 2016], which allows one to distribute the covariance matrix estimation to multiple cores, efficiently using a divide and conquer strategy. Letting denote the number of cores at our disposal and assume that is a multiple of . Letting , the computational complexity becomes . If we want to estimate the interactions up the order, the computational complexity becomes . Moreover, the storage complexity is .
2.6 Structural Equation Modeling
In this section we extend model to multivariate outcomes, using Structural Equation Modeling (SEM) [Bollen and Long, 1993]. SEMs are a natural generalization of factor analysis to multivariate outcomes and have been successfully employed for multilevel data [Ansari and Jedidi, 2001] and nonlinear structures [Lee et al., 2004]. For multivariate responses , we could simply model for independently using . Rather, we choose to write both the covariates and the outcome as linear combinations of latent factors, and to model the dependence between them through the latent variables. We consider the usual normal linear SEM model, also referred to as LISREL model, see [Palomo et al., 2007] for a review on Bayesian methods. For :
(4)  
(5)  
(6)  
(7) 
where and are tall and skinny matrices of dimensions and , respectively, and ,, and are all diagonal matrices. is a matrix of dimension and is a square matrix of dimension such that the diagonal is set to zero. Let us assume that is invertible so that we can equivalently write as:
Equations to define a joint normal model for . Using law of iterated expectation is it easy to show that the induced regression of on is linear:
where , which was equivalently defined by Proposition 1. In order to introduce interactions, we either can modify equation , introducing pairwise interactions between s, or , introducing pairwise interactions between s. We decide to follow the second approach:
(8) 
where is an vector such that the element is equal to , with being a matrix containing the coefficients of the pairwise interactions of regressed on . If we substitute equation with , the new induced regression of on accommodates pairwise interactions:
Now, is a vector such that the element is equal to . Using the formula for the expectation of a quadratic form we have that: , where , again defined by Proposition 1. This shows that we induce a regression on the multivariate outcome with pairwise interactions.
3 Properties of the Model
In this section we prove that the posterior distribution of is weakly consistent for a broad set of models. Let
denote the KullbackLeibler divergence between
and , whereWe will assume that represents the true datagenerating model. This assumption is not as restrictive as it may initially seem. The model is flexible enough to always characterize and model quadratic regression in the response component, while accurately approximating any covariance structure in the predictor component. In fact it always holds that:
where and are functions of as in Proposition 1, and the true number of factors is . When , we can write any covariance matrix as . We take an “overfitted” factor modeling approach, related to [Bhattacharya and Dunson, 2011], [Rousseau and Mengersen, 2011], and choose to correspond to an upper bound on the number of factors. We then choose a shrinkage prior that can automatically remove redundant factors that are unnecessary in characterizing the dependence structure. In practice, we recommend the rule of thumb that chooses the upper bound such that , where is the
largest singular value of the correlation matrix of
. We have found this choice to have good performance in a wide variety of simulation cases. However, there is nonetheless a potential concert that may be less than in some cases. Proposition 2 quantifies the distance in terms of KullbackLeibler divergence between the true data generating model and the likelihood under model missspecification as approaches infinity.Proposition 2.
Fix , , and assume that . As increases the posterior distribution of and concentrates around and , satisfying:
where is the largest singular value of .
Unsurprisingly, the bound of Proposition 2
resembles the EckartYoung theorem for lowrank approximation based on the Singular Value Decomposition of a matrix. The EckartYoung theorem states that the rank
approximation of a matrix minimizing the Frobenoius norm is such that. In a similar fashion as Principal Component Analysis and Factor Analysis, we can inspect the singular values of the correlation matrix of the regressors in order to choose the number of factors to include in the model, and thanks to
Proposition 2 we know how far the posterior distribution will concentrate from the truth.The next proposition provides a sufficient condition in order to achieve posterior consistency when . Notice that we achieve posterior consistency on the induced main effects and pairwise interactions.
Proposition 3.
Fix . Whenever , for any there exists an such that:
where is the supnorm.
One can easily define a prior on
such that it places positive probability in any small neighborhood of
, according to . The prior defined in Section 2.2 satisfies this condition. Consequently, the posterior distribution of is weakly consistent due to [Schwartz, 1965].4 Simulations Experiments
In this section we compare the performance of our FIN method with four other approaches: PIE [Wang et al., 2019], RAMP [Hao et al., 2018], Family [Haris et al., 2016] and HierNet [Bien et al., 2013]. These methods are designed for inference on interactions in moderate to high dimensional settings. We generate covariates in two ways: independently from a distribution and from model , where we set the true number of factors equal to and generate the elements of from a standard normal distribution. In the second scenario, the average absolute correlation in the covariates is equal to . For each scenario, we generate the continuous outcome according to a linear regression with pairwise interactions:
where half of the main effects are different from zero and for . We distinguish between a sparse matrix of pairwise interactions , with only nonzero interactions, or dense, where
of the elements are different from zero. In total we have four simulation scenarios: correlated or independent data combined with sparse or dense pairwise interactions. We generate the nonzero main effects and interaction coefficients from a Uniform distribution in the interval
such that the regression equation follows the strong heredity constraint. Strong heredity allows interaction between two variables to be included in the model only if the main effects are. This is done to favor RAMP, Family and HierNet, which assume the heredity condition. We repeat the simulations times and evaluate the performance on a test dataset of units computing predictive mean square error, mean square error for main effects, Frobenious norm for interaction effects, and percentage of True Positives (TP) and True Negatives (TN) for main effects and interaction effects. The percentage of TP and TN main effects is defined as follows:where is the estimated main effect for feature and is the true coefficient. FIN is the only method reporting uncertainty quantification and we set whenever zero is included in the credible interval. We equivalently define the percentage of True Positive and True Negative interactions.
The MCMC algorithm was run for iterations with a burnin of . We observed good mixing. In particular, the Effective Sample Size (ESS) was always greater than across our simulations, both for main effects and interactions. The credible intervals cover the true value of the outcome approximately of the time in all the simulation scenarios. We set the hyperparameter of the DirichletLaplace prior equal to . We obtained similar results for in the interval . The results are summarized in Table 1. Given the rule of thumb that chooses such that , we set . In the high correlation scenario, FIN outperforms the other methods in predictive performance and estimation of main effects and interactions, whereas the rate recovery of true main effects and interactions is comparable to HierNet. With independent covariates, our rule of thumb for choosing the number of factors prescribes to set . Rather, we set to show that, despite the model misspecification, FIN has a comparable predictive performance with respect to the other methods, which do not take into account correlation structure in the covariates.
HierNet  FAMILY  PIE  RAMP  FIN  
test error  
FR  
high corr  main MSE  
sparse  TP main  
TN main  
TP int  
TN int  
test error  
FR  
high corr  main MSE  
dense  TP main  
TN main  
TP int  
TN int  
test error  
FR  
indep  main MSE  
sparse  TP main  
TN main  
TP int  
TN int  
test error  
FR  
indep  main MSE  
sparse  TP main  
TN main  
TP int  
TN int 
5 Environmental Epidemiology Application
The goal of our analysis is to assess the effect of nine phthalate metabolites, Mononbutyl, Monoisobutyl, Monoethyl, Monobenzyl, Monocyclohexyl, Mono2ethyl5carboxypentyl, Mono(2ethyl5hydroxyhexyl), Mono(2ethyl5oxohexyl), Mono(2ethyl)hexyl, and two dichlorophenol pesticides, 2,5dichlorophenol and 2,4dichlorophenol, on body mass index (BMI). Phthalates are mainly used as plasticizers and can be found in toys, detergents, food packaging, and soaps. They have previously been associated with increased BMI [Hatch et al., 2008] and waist circumference (WC) [Stahlhut et al., 2007]. There is a growing health concern for the association of phthalates with childhood obesity [Kim and Park, 2014], [Zhang et al., 2014]. Dichlorophenol pesticides are derivatives as phenols and are commonly used for herbicides. Dichlorophenol pesticides have also been associated with obesity [Thayer et al., 2012], [Twum and Wei, 2011], [Wei et al., 2014]
The data are taken from the National Health and Nutrition Examination Survey (NHANES) for the years 2009 and 2011. We select a subsample of
individuals for which the measurements of phthalates and dichlorophenol pesticides are not missing, though FIN can easily accommodate missing data through adding an imputation step to the MCMC algorithm. We also include in the analysis cholesterol, creatinine, race, sex, education and age. We apply the base
logarithm transformation the chemicals, cholesterol, creatinine and BMI to make their distribution closer to normality. We assume a latent normal structure for the chemicals, creatinine and cholesterol, which are included in the matrix , and use the other variables as covariates, which are included in the matrix . We estimate a quadratic regression according to model . We specify independent Gaussian priors for elements of and . Algorithm 1 can be easily adapted for model . Some chemical measurements have been recorded below the limit of detection (LOD), hence at each iteration we sample their value from a truncated normal distribution with support in the interval, mean and standard deviation equal to
. Inspecting the Eigendecomposition of the matrix of chemicals, creatinine and cholesterol, the first eigenvectors explain most of the variability; hence we set the number of factors equal to .Figure 2 shows the estimated main effects of the chemicals. The signs of the coefficients are to be consistent across different methods. Figure 3 shows the posterior mean of the matrix of chemical interactions and the matrix of factor loadings , before and after applying the clustalign algorithm of [Poworoznek and Dunson, 2019], which resolves rotational ambiguity and column label switching for the posterior samples of . The matrix of factor loadings reflects the correlation structure of the chemicals. In fact, the chemicals appear to be grouped in three families: the environmental chemicals and two groups of phthalates. The environmental chemicals, 2,4dichlorophenol and 2,5dichlorophenol, load mostly on the factor and we estimate a negative interaction between them. The second group of chemicals load mostly on the factor and are composed of Mononbutyl, Monoisobutyl, Monoethyl and Monobenzyl. Finally, Monocarbox, Monohydrox, Monooxohexyl and Monohexyl are part of the last group of chemicals and load mostly on the factor. Interestingly, Monocyclohexyl has a low correlation with the other phthalate metabolites, but the credible intervals of the pairwise interactions with the group of chemicals does not contain zero. We did not identify any significant interactions between the chemical exposures and covariates. On a test set of observations, FIN had the same test errors of HierNet, which were lower than the other three methods. The code for reproducing the analysis is available at https://github.com/fedfer/factor_interactions.
Comments
There are no comments yet.